Can reduced parallelism lead to no shuffle spill?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Can reduced parallelism lead to no shuffle spill?

V0lleyBallJunki3
Consider an example where I have a cluster with 5 nodes and each node has 64
cores with 244 GB memory. I decide to run 3 executors on each node and set
executor-cores to 21 and executor memory of 80GB, so that each executor can
execute 21 tasks in parallel.  Now consider that 315(63 * 5) partitions of
data, out of which 314 partitions are of size 3GB but one of them is
30GB(due to data skew). All of the executors that received the 3GB
partitions have 63GB(21 * 3 = since each executor can run 21 tasks in
parallel and each task takes 3GB of memory space) occupied. But the one
executor that received the 30GB partition will need 90GB(20 * 3 + 30)
memory. So will this executor first execute the 20 tasks of 3GB and then
load 30GB task or will it just try to load 21 tasks and find that for one
task it has to spill to disk? If I set executor-cores to just 15 then the
executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72
gb and hence won't spill to disk. So in this case will reduced parallelism
lead to no shuffle spill?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can reduced parallelism lead to no shuffle spill?

Alexander Czech-2
Why don't you just repartion the dataset ? If partion are really that unevenly sized you should probably do that first. That potentially also saves a lot of trouble later on.

On Thu, Nov 7, 2019 at 5:14 PM V0lleyBallJunki3 <[hidden email]> wrote:
Consider an example where I have a cluster with 5 nodes and each node has 64
cores with 244 GB memory. I decide to run 3 executors on each node and set
executor-cores to 21 and executor memory of 80GB, so that each executor can
execute 21 tasks in parallel.  Now consider that 315(63 * 5) partitions of
data, out of which 314 partitions are of size 3GB but one of them is
30GB(due to data skew). All of the executors that received the 3GB
partitions have 63GB(21 * 3 = since each executor can run 21 tasks in
parallel and each task takes 3GB of memory space) occupied. But the one
executor that received the 30GB partition will need 90GB(20 * 3 + 30)
memory. So will this executor first execute the 20 tasks of 3GB and then
load 30GB task or will it just try to load 21 tasks and find that for one
task it has to spill to disk? If I set executor-cores to just 15 then the
executor that receives the 30 GB partition will only need 14 * 3 + 30 = 72
gb and hence won't spill to disk. So in this case will reduced parallelism
lead to no shuffle spill?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can reduced parallelism lead to no shuffle spill?

V0lleyBallJunki3
I am just using the above example to understand how Spark handles partitions



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]