GraphX/Spark Partition Strategies Vs Running time doubts
This post has NOT been accepted by the mailing list yet.
I am relatively new to GraphX/Spark, I have developed an Iterative algorithm for detecting communities(If at least one vertex changes its community, then only next iteration will be performed).
I am Struggling with the partitioning strategy here, as there is no clear document and I searched the mailing list, there is no clear pointer to any document or this is not explained in detail till now(Only link I got was "the default GraphX graph-partition strategy on multicore machine"? in this mailing list, after reading that, I am still left with some questions).
I have a cluster with 8 workers, each with 32 cores and 124 GB memory.
I have an input edge file of 1Million edges and 500MB size.
1) What will happen, when I give --num-executors 120 --executory-memory 8g and set partitions to 60. Will some partitions will be idle?
3) Like EdgePartition1D, will edge with the same source go to the same partition in EdgePartition2D as well? Do they consider destination ID in as well in EdgePartition2D?
4) When I use any partition strategy in spark, Are there always 3 tables (Edge-Table, Vertex-Table, Routing-Table) and for a task of accessing triplets when edges are in other partitions, where does it replicate the required vertex(routing-table?), and how is it perform mapReduceTriplets when edges are in different partition(does separate executor work in separate partitions for map and for reduce they all communicate using shuffle? If it is so, how do they communicate(using routing table)?)
5) My algorithms iterations are not varying for small datasets, while they are varying for large datasets with number of partitions and different partition strategies, Why? Please help and explain