Spark 2.1.1 Graphx graph loader GC overhead error

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark 2.1.1 Graphx graph loader GC overhead error

Aritra Mandal
This post has NOT been accepted by the mailing list yet.
I have a 10 node cluster with each having 4 cores and 16GB of memory (14.2GB allocated to spark)  each. I am trying to load a edgelist file (29 GB size), the file contains 66M vertices and 2B edges. It throws a GC overhead timeout error.


val g =GraphLoader.edgeListFile(sc,"/home/amandal/soc_friend.txt", true,32)


I am using the default options in graph loader edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY

i can change it to MEMORY_AND_DISK but I am running on  HDD's not SSD's so this might take performance hit.

I tried with other graphs with 3M nodes and 117M edges it works fine  and it even performs 80 pregel iteration in 24 mins.

Any suggestions on the limit of graph size i can use or to solve this issue

Aritra Mandal

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 2.1.1 Graphx graph loader GC overhead error

yncxcw
This post has NOT been accepted by the mailing list yet.
hi,

It highly depends on the algorithms you are going to apply to your data sets.  Graph applications are usually memory hungry and probably cause long GC or even OOM. 

Suggestions include:  1. make some highly reused RDD as  StorageLevel.MEMORY_ONLY and leave the rest  MEMORY_AND_DISK.

                                    2. slight decrease the parallelism for each executor.


Wei Chen
 

On Mon, Jul 10, 2017 at 3:05 PM, Aritra Mandal [via Apache Spark User List] <[hidden email]> wrote:
I have a 10 node cluster with each having 4 cores and 16GB of memory (14.2GB allocated to spark)  each. I am trying to load a edgelist file (29 GB size), the file contains 66M vertices and 2B edges. It throws a GC overhead timeout error.


val g =GraphLoader.edgeListFile(sc,"/home/amandal/soc_friend.txt", true,32)


I am using the default options in graph loader edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY, vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY

i can change it to MEMORY_AND_DISK but I am running on  HDD's not SSD's so this might take performance hit.

I tried with other graphs with 3M nodes and 117M edges it works fine  and it even performs 80 pregel iteration in 24 mins.

Any suggestions on the limit of graph size i can use or to solve this issue

Aritra Mandal




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-1-1-Graphx-graph-loader-GC-overhead-error-tp28841.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 2.1.1 Graphx graph loader GC overhead error

Aritra Mandal
This post has NOT been accepted by the mailing list yet.
yncxcw wrote
hi,

It highly depends on the algorithms you are going to apply to your data
sets.  Graph applications are usually memory hungry and probably cause long
GC or even OOM.

Suggestions include:  1. make some highly reused RDD as
StorageLevel.MEMORY_ONLY
and leave the rest  MEMORY_AND_DISK.

                                    2. slight decrease the parallelism for
each executor.


Wei Chen

Thanks for the response have a implementation of K core decomposition running using pregel framework.

I will try constructing the graph with storagelevel:MEMORY_AND_DISK and  post the outcome here

The GC overhead error is happening even before the algorithm starts its pregel iterations it failing in the GraphLoader.fromEdgeList stage.

Aritra
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 2.1.1 Graphx graph loader GC overhead error

yncxcw
This post has NOT been accepted by the mailing list yet.
hi, 

I think if the OOM occurs before the computation begins, the input data is probably too big to fit in memory. I remembered that the graph data would expand when loading the data input memory. And the scale of expanding is pretty huge( based on my experiment on Pagerank).


Wei  Chen

On Mon, Jul 10, 2017 at 5:56 PM, Aritra Mandal [via Apache Spark User List] <[hidden email]> wrote:
yncxcw wrote
hi,

It highly depends on the algorithms you are going to apply to your data
sets.  Graph applications are usually memory hungry and probably cause long
GC or even OOM.

Suggestions include:  1. make some highly reused RDD as
StorageLevel.MEMORY_ONLY
and leave the rest  MEMORY_AND_DISK.

                                    2. slight decrease the parallelism for
each executor.


Wei Chen

Thanks for the response have a implementation of K core decomposition running using pregel framework.

I will try constructing the graph with storagelevel:MEMORY_AND_DISK and  post the outcome here

The GC overhead error is happening even before the algorithm starts its pregel iterations it failing in the GraphLoader.fromEdgeList stage.

Aritra


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-1-1-Graphx-graph-loader-GC-overhead-error-tp28841p28843.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark 2.1.1 Graphx graph loader GC overhead error

Aritra Mandal
This post has NOT been accepted by the mailing list yet.
yncxcw wrote
hi,

I think if the OOM occurs before the computation begins, the input data is
probably too big to fit in memory. I remembered that the graph data would
expand when loading the data input memory. And the scale of expanding is
pretty huge( based on my experiment on Pagerank).


Wei  Chen
Hello Wei,

Thanks for the suggestions.

I tried this small piece of code with StorageLevel.MEMORY_AND_DISK I removed the pregel call just to test.
But still the code failed with OOM in the graphload stage

val ygraph=GraphLoader.edgeListFile(sc,args(1), true,32,StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK).partitionBy(PartitionStrategy.RandomVertexCut)

println(ygraph.vertices.count())



Is there a way to calculate the maximum size of a graph that a given configuration of the cluster can process.

Aritra

Loading...