Table of contents
Description of advanced features
Models/platforms not Explained in Class, Advanced Functions of Modules
Graph Analysis
In our project, the final output produced by Spark could be further read into PySpark as the input of Graph and build a network. The genes names in vertices.csv
could be used to build gene vertices, and the gene connection with significant pairwise correlation in graph_edges.csv
could be used as gene edges and weights to build gene relationships.
First, we could launch Pyspark with:
Remember, we need to specify --repositories https://repos.spark-packages.org#
to fix a bug of repo used by dl.bintray.com
could not be found in PySpark mvn.
$ pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11 --repositories https://repos.spark-packages.org#
Next, we could build graph and explore gene pairwise relationship by using the following code:
import graphframes as GF
from pyspark.sql import SQLContext
sql_context = SQLContext(sc)
vertice_df = sql_context.read.csv(
"vertices.csv/part-00000",
header=False
).toDF("id")
edge_df = sql_context.read.csv(
"graph_edges.csv/part-00000",
header=False,
sep = ','
).toDF("src", "dst", "relationship")
g = GF.GraphFrame(vertice_df, edge_df)
g.vertices.write.parquet("vertices")
g.edges.write.parquet("edges")
# Load the vertices and edges.
# v = sqlContext.read.parquet("hdfs://./vertices")
# e = sqlContext.read.parquet("hdfs://./edges")
# Create a graph
# g = GF.GraphFrame(v, e)
# Exploring the Graph
g.vertices.show()
g.edges.show()
vertexInDegrees = g.inDegrees
vertexInDegrees.show()
vertexOutDegrees = g.outDegrees
## explore network by group by or filtering
g.edges.filter("relationship > 0.008").count()
g.edges.filter("relationship > 0.008").show()
Build graph by using pyspark graphframe
Read in output from Spark Single Node/ EMR Cluster into PySpark as DataFrame
We could explore graph’s In Degree and Out Degree Centrality, to get an insight of the signficant relationship correlated to one particular gene.
We could also explore graph’s relationship between vertices by filtering and other more advanced graphframe functions, as a further application to our project.
By using Spark Graphframe, it provides possibility to easily look into particular genes with relatively low computation complexity, and enable users to compare with tumor genes, cancerous genes with other human gene network as well.