I am trying to thoroughly understand below concepts in spark.
1. A job is reading 2 files and performing a cartesian join.
2. Sizes of input are 55.7 mb and 67.1 mb
3. after reading input file, spark did shuffle, for both the inputs shuffle was in KB. I want to understand why this size is not a complete size of a file. Per my understanding, records which are required to be shuffled from one executor to another will only be shuffled. It is not required to shuffle whole file. Is this understanding correct?
4. what is shuffle spill(Memory) and shuffle spill(disk) , do these represent figures for same data one on memory and other on disk. And how to calculate these values ?
5. when does shuffle need to spill. It needs to spill when data does not fit in memory but when are the situations or scenarios when this can happen.
6. On SQL tab for above join situation , there are 2 exchanges
data size totals are
228.9 MB(46.9MB,60.6MB,60.6MB) how these figures are calculated