[2.4.5 Standalone Master]: Idle cores not being allocated
I have a cluster (2.4.5) using standalone mode orchestrated by Nomad jobs
running on EC2. We deploy a Scala web server as a long running jar via
`spark-submit` in client mode. Sometimes we get into a state where the
application ends up with 0 cores due to our in-house autoscaler scaling down
and killing workers without checking if any of the cores in the worker were
allocated to existing applications. These applications then end up with 0
cores, even though there are healthy workers in the cluster.
However, if i submit a new application or register a new worker in the
cluster, only then will the master finally reallocate cores to the
application. This is problematic, because the long running 0 core
application is stuck.
The downscaling causes me to see this in my logs, so i am fairly certain
`timeOutWorkers()` is being called:
20/06/08 11:40:56 INFO Master: Application app-20200608114056-0006 requested
to set total executors to 1.
20/06/08 11:40:56 INFO Master: Launching executor app-20200608114056-0006/0
on worker worker-20200608113523-<IP_ADDRESS>-7077
20/06/08 11:41:44 WARN Master: Removing
worker-20200608113523-<IP_ADDRESS>-7077 because we got no heartbeat in 60
20/06/08 11:41:44 INFO Master: Removing worker
worker-20200608113523-<IP_ADDRESS>-7077 on <IP_ADDRESS>:7077
20/06/08 11:41:44 INFO Master: Telling app of lost executor: 0
20/06/08 11:41:44 INFO Master: Telling app of lost worker: