Large Scheduler Delay Causing Performance Issue in Spark Application

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Large Scheduler Delay Causing Performance Issue in Spark Application

Akshat Bordia

Hi All,

 

We are seeing a performance issue where a particular stage is taking lots of time > 2 hours and still not completed. This stage is doing GROUP BY in Spark SQL. The data size is around 40GB and there are around ~11K tasks. The cluster is running in Standalone mode and the job is submitted via Livy Session APIs. There are 40 executors with 8 cores and 24 GB memory each. Upon looking at the stage details, we see that task duration is not too much with a 2 seconds median. However, there is a large scheduler delay with 75th Percentile being 7.1 minutes and max value around 21 minutes. What could be potentially causing this large scheduling delay? Other metrics are listed in the following screenshot. Also, please find below the event timeline chart. 


image.png
image.png


We changed the locality configs to following to avoid any scheduler delay but that didn’t seem to help much:
spark.locality.wait=0s

spark.shuffle.reduceLocality.enabled=false


Also, started seeing executor lost errors after couple of hours: ExecutorLostFailure (executor 39 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 134746 ms


Please let me know if anyone has any thought on how this can be improved. Appreciate any help.

 

Thanks,

Akshat