[GraphX] - OOM Java Heap Space

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[GraphX] - OOM Java Heap Space

Thodoris Zois

I have the edges of a graph stored as parquet files (about 3GB). I am loading the graph and trying to compute the total number of triplets and triangles. Here is my code:

val edges_parq = sqlContext.read.option("header","true").parquet(args(0) + "/year=" + year)
val edges: RDD[Edge[Int]] = edges_parq.rdd.map(row => Edge(row(0).asInstanceOf[Int].toInt, row(1).asInstanceOf[Int].toInt))
val graph = Graph.fromEdges(edges, 1.toInt).partitionBy(PartitionStrategy.RandomVertexCut)

// The actual computation
var numberOfTriplets = graph.triplets.count
val tmp =  graph.triangleCount().vertices.filter{ case (vid, count) => count > 0 }
var numberOfTriangles = tmp.map(a => a._2).sum()

Even though it manages to compute the number of triplets, I can’t compute the number of triangles. Every time I get an exception OOM - Java Heap Space on some executors and the application fails.
I am using 100 executors (1 core and 6GBs per executor). I have tried to use 'hdfsConf.set("mapreduce.input.fileinputformat.split.maxsize", "33554432”)’ in the code but still no results.

Here are some of my configurations:
--conf spark.driver.memory=20G
--conf spark.driver.maxResultSize=20G
--conf spark.yarn.executor.memoryOverhead=6144

- Thodoris
To unsubscribe e-mail: [hidden email]