[apache-spark]-spark-shuffle

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[apache-spark]-spark-shuffle

Vijay Kumar
Hi,

I am trying to thoroughly understand below concepts in spark.
1. A job is reading 2 files and performing a cartesian join.
2. Sizes of input are 55.7 mb and 67.1  mb
3. after reading input file, spark did shuffle, for both the inputs shuffle was in KB. I want to understand why this size is not a complete size of a file. Per my understanding, records which are required to be shuffled from one executor to another will only be shuffled. It is not required to shuffle whole file. Is this understanding correct?

4. what is shuffle spill(Memory) and shuffle spill(disk) , do these represent figures for same data one on memory and other on disk. And how to calculate these values ?
5. when does shuffle need to spill. It needs to spill when data does not fit in memory but when are the situations or scenarios when this can happen.
6. On SQL tab  for above join situation , there are 2 exchanges 
data size totals are 
 190. MB(37.3MB,51.0MB,51.0MB)
  228.9 MB(46.9MB,60.6MB,60.6MB) how these figures are calculated
and 
on Sort  below are the figures
524.2MB (Peak Memory Total)(min,med,max)(64KB,64KB,128MB)
576.0 MB (Peak memory Total) (min, med,max)(144MB,144MB,144MB)

I am trying to understand many things if you can help me with some kind of guide or link or book where i will be able to get answers to above question along with other more questions it will be great.


VP
Reply | Threaded
Open this post in threaded view
|

Re: [apache-spark]-spark-shuffle

VP
How a Spark job reads datasources depends on the underlying source system,the
job configuration about number of executors and cores per executor.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#external-datasets

About Shuffle operations.
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
https://stackoverflow.com/questions/32210011/spark-difference-between-shuffle-write-shuffle-spill-memory-shuffle-spill
https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs/29012187#29012187

this has great explanation of how shuffle works
https://stackoverflow.com/questions/37528047/how-are-stages-split-into-tasks-in-spark

========
A sample of code and job configuration, the DAG underlying source (HDFS or
others) would help explain

thanks
VP



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Best Regards, VP