Large Scheduler Delay Causing Performance Issue in Spark Application
We are seeing a performance issue where a particular stage
is taking lots of time > 2 hours and still not completed. This stage is doing GROUP BY in Spark SQL. The
data size is around 40GB and there are around ~11K tasks. The cluster is
running in Standalone mode and the job is submitted via Livy Session APIs. There
are 40 executors with 8 cores and 24 GB memory each. Upon looking at the stage
details, we see that task duration is not too much with a 2 seconds median.
However, there is a large scheduler delay with 75th Percentile being
7.1 minutes and max value around 21 minutes. What could be potentially causing
this large scheduling delay? Other metrics are listed in the following screenshot. Also, please find below the event timeline chart.
We changed the locality configs to following to avoid any
scheduler delay but that didn’t seem to help much:
Also, started seeing executor lost errors after couple of hours: ExecutorLostFailure (executor 39 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 134746 ms
Please let me know if anyone has any thought on how this can be improved. Appreciate any help.