GraphX/Spark Partition Strategies Vs Running time doubts

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

GraphX/Spark Partition Strategies Vs Running time doubts

This post has NOT been accepted by the mailing list yet.
Hi All,

I am relatively new to GraphX/Spark, I have developed an Iterative algorithm for detecting communities(If at least one vertex changes its community, then only next iteration will be performed).

I am Struggling with the partitioning strategy here, as there is no clear document and I searched the mailing list, there is no clear pointer to any document or this is not explained in detail till now(Only link I got was "the default GraphX graph-partition strategy on multicore machine"? in this mailing list, after reading that, I am still left with some questions).

I have a cluster with 8 workers, each with 32 cores and 124 GB memory.
I have an input edge file of 1Million edges and 500MB size.

1) What will happen, when I give --num-executors 120 --executory-memory 8g and set partitions to 60. Will some partitions will be idle?

2) I set parallelism to 100, use partition strategy EdgePartition2D how will it assign 1M edges to 8 workers with 32 cores each? As here "$.html" they talk about assigning vertices to machines (How vertices, machines, and partitions are related here? Is machine means partition? As I have 256 cores and 100 partitions then what will happen?)

3) Like EdgePartition1D, will edge with the same source go to the same partition in EdgePartition2D as well? Do they consider destination ID in as well in EdgePartition2D?

4) When I use any partition strategy in spark, Are there always 3 tables (Edge-Table, Vertex-Table, Routing-Table) and for a task of accessing triplets when edges are in other partitions, where does it replicate the required vertex(routing-table?), and how is it perform mapReduceTriplets when edges are in different partition(does separate executor work in separate partitions for map and for reduce they all communicate using shuffle? If it is so, how do they communicate(using routing table)?)

5) My algorithms iterations are not varying for small datasets, while they are varying for large datasets with number of partitions and different partition strategies, Why? Please help and explain