How to Load a Graphx Graph from a parquet file?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How to Load a Graphx Graph from a parquet file?

Alexander Czech-2
Hey all,
I want to load a parquet containing my edges into an Graph my code so far looks like this:

val edgesDF = spark.read.parquet("/path/to/edges/parquet/")
val edgesRDD = edgesDF.rdd
val graph = Graph.fromEdgeTuples(edgesRDD, 1)

But simply this produces an error:
[error]  found   : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error]  required: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.VertexId)]
[error]     (which expands to)  org.apache.spark.rdd.RDD[(Long, Long)]
[error] Error occurred in an application involving default arguments.
[error]         val graph = Graph.fromEdgeTuples(edgesRDD, 1)

I tried to declare the edgesRDD like the following code but this just moves the error by doing this:
val edgesDF = spark.read.parquet("/path/to/edges/parquet/") val edgesRDD : RDD[(Long,Long)] = edgesDF.rdd
val graph = Graph.fromEdgeTuples(edgesRDD, 1)
[error] /home/alex/ownCloud/JupyterNotebooks/Diss_scripte/Webgraph_analysis/pagerankscala/src/main/scala/pagerank.scala:17:44: type mismatch;
[error]  found   : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error]  required: org.apache.spark.rdd.RDD[(Long, Long)]
[error] val edgesRDD : RDD[(Long,Long)] = edgesDF.rdd


So I guess I have to transform org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] into
 org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.VertexId)] (which expands to) org.apache.spark.rdd.RDD[(Long, Long)]

how can I achieve this ?