Seeing a framework registration loop with Spark 2.3.1 on DCOS 1.10.0
I’m attempting to use Spark 2.3.1 (spark-2.3.1-bin-hadoop2.7.tgz) in cluster mode and running into some issues. This is a cluster where we've had success using Spark 2.2.0 (spark-2.2.0-bin-hadoop2.7.tgz), and I'm simply upgrading our nodes with
the new Spark 2.3.1 package and testing it out.
This is a multi-node cluster. I'm submitting a job that's using the sample spark-pi jar included in the distribution. Occasionally, spark submits run without issue. Then a run will begin execution where a bunch of TASK_LOST messages occur immediately,
followed by the BlockManager attempting to remove a handful of non-existent executors. I also can see where the driver/scheduler begins making a tight loop of SUBSCRIBE requests to the master.mesos service. The request volume and frequency is so high that
the mesos.master stops responding to other requests, and eventually runs OOM and systemd restarts the failed process. If there is only one job running, and it's able to start an executor (exactly one started in my sample logs), the job will eventually complete.
However, if I deploy multiple jobs (five seemed to do the trick), I've seen cases where none of the jobs complete, and the cluster begins to have cascading failures due to the master not servicing other API requests due to the influx of REGISTER requests from
numerous spark driver frameworks.