[2.4.5 Standalone Master]: Idle cores not being allocated

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[2.4.5 Standalone Master]: Idle cores not being allocated

krchia
Background:
I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs
running on EC2. We deploy a Scala web server as a long running jar via
`spark-submit` in client mode. Sometimes we get into a state where the
application ends up with 0 cores due to our in-house autoscaler scaling down
and killing workers without checking if any of the cores in the worker were
allocated to existing applications. These applications then end up with 0
cores, even though there are healthy workers in the cluster.

However, if i submit a new application or register a new worker in the
cluster, only then will the master finally reallocate cores to the
application. This is problematic, because the long running 0 core
application is stuck.

Could this be related to the fact that `schedule()` is only triggered by new
workers / new applications as commented here?
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L721-L724

If that is the case, should the application be calling `schedule()` when
removing workers after calling `timeOutWorkers()`?
https://github.com/apache/spark/blob/v2.4.5/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L417

The downscaling causes me to see this in my logs, so i am fairly certain
`timeOutWorkers()` is being called:
```
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested
to set total executors to 1.
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0
on worker worker-20200608113523-<IP_ADDRESS>-7077
20/06/08 11:41:44 WARN Master: Removing
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60
seconds
20/06/08 11:41:44 INFO Master: Removing worker
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0
20/06/08 11:41:44 INFO Master: Telling app of lost worker:
worker-20200608113523-10.158.242.213-7077
```



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]