[SPARK SQL] Sometimes spark does not scale down on k8s

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[SPARK SQL] Sometimes spark does not scale down on k8s

dmn42
Hi all!
 
We are using spark as constantly running sql interface to parquet on hdfs and gcs with our in-house app. We use autoscaling with k8s backend. Sometimes (approx. once a day) something nasty happens and spark stops to scale down staying with max available executors. 
I've checked graphs (https://imgur.com/a/6h3MfPa) and found few strange things:
At the same time numberTargetExecutors and numberMaxNeededExecutors increases drastically and remains large even though there could be no requests at all (I've tried to remove driver from backend pool, this did not help to scale down even with no requests during ~20mins)
There are also lots of dropped events from executorManagement queue
 
I've tried to increase executorManagement queue size up to 30000, this did not help.
 
Is this a bug or kinda expected behavior? Shall I increase queue size even more or there is another thing to adjust?
 
Thank you.
 
spark: 3.1.1
jvm: openjdk-11-jre-headless:amd64      11.0.10+9-0ubuntu1~18.04
k8s provider: gke
 
some related spark options:
 
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=5
spark.dynamicAllocation.maxExecutors=50
spark.dynamicAllocation.executorIdleTimeout=120s
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
spark.dynamicAllocation.shuffleTracking.timeout=120s
spark.dynamicAllocation.executorAllocationRatio=0.5
spark.dynamicAllocation.schedulerBacklogTimeout=2s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1s
spark.scheduler.listenerbus.eventqueue.capacity=30000
 
-- 
Grats, Alex.
 
--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [SPARK SQL] Sometimes spark does not scale down on k8s

dmn42
I've increased spark.scheduler.listenerbus.eventqueue.executorManagement.capacity to 10M, this lead to several things.
First, scaler didn't break when it was expected to. I mean, maxNeededExecutors remained low (except peak values).
Second, scaler started to behave a bit weird. Having maxExecutors=50 I saw up to 79 executors according to JVM metrics and up to 78 counted from api data (graphs didn't match, these values changed independently)
At the same time pod count didn't change, I had 50 pods at high time as max.
And one more, as a dessert - with 10M queue I ran out of 10G heap less than in three days. But this was expected so no questions :)
 
 
02.04.2021, 17:47, "Alexei" <[hidden email]>:
Hi all!
 
We are using spark as constantly running sql interface to parquet on hdfs and gcs with our in-house app. We use autoscaling with k8s backend. Sometimes (approx. once a day) something nasty happens and spark stops to scale down staying with max available executors. 
I've checked graphs (https://imgur.com/a/6h3MfPa) and found few strange things:
At the same time numberTargetExecutors and numberMaxNeededExecutors increases drastically and remains large even though there could be no requests at all (I've tried to remove driver from backend pool, this did not help to scale down even with no requests during ~20mins)
There are also lots of dropped events from executorManagement queue
 
I've tried to increase executorManagement queue size up to 30000, this did not help.
 
Is this a bug or kinda expected behavior? Shall I increase queue size even more or there is another thing to adjust?
 
Thank you.
 
spark: 3.1.1
jvm: openjdk-11-jre-headless:amd64      11.0.10+9-0ubuntu1~18.04
k8s provider: gke
 
some related spark options:
 
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=5
spark.dynamicAllocation.maxExecutors=50
spark.dynamicAllocation.executorIdleTimeout=120s
spark.dynamicAllocation.shuffleTracking.enabled=true
spark.dynamicAllocation.cachedExecutorIdleTimeout=120s
spark.dynamicAllocation.shuffleTracking.timeout=120s
spark.dynamicAllocation.executorAllocationRatio=0.5
spark.dynamicAllocation.schedulerBacklogTimeout=2s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1s
spark.scheduler.listenerbus.eventqueue.capacity=30000
 
-- 
Grats, Alex.
 
--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
 
 
-- 
Grats, Alex.
 
--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]