Seeing a framework registration loop with Spark 2.3.1 on DCOS 1.10.0

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Seeing a framework registration loop with Spark 2.3.1 on DCOS 1.10.0

David Hesson
I’m attempting to use Spark 2.3.1 (spark-2.3.1-bin-hadoop2.7.tgz) in cluster mode and running into some issues. This is a cluster where we've had success using Spark 2.2.0 (spark-2.2.0-bin-hadoop2.7.tgz), and I'm simply upgrading our nodes with the new Spark 2.3.1 package and testing it out.

Some version information:

Spark v2.3.1
DC/OS v1.10.0
Mesos v1.4.0
Dispatcher: docker, mesosphere/spark:2.3.1-2.2.1-2-hadoop-2.6 (Docker image from

This is a multi-node cluster. I'm submitting a job that's using the sample spark-pi jar included in the distribution. Occasionally, spark submits run without issue. Then a run will begin execution where a bunch of TASK_LOST messages occur immediately, followed by the BlockManager attempting to remove a handful of non-existent executors. I also can see where the driver/scheduler begins making a tight loop of SUBSCRIBE requests to the master.mesos service. The request volume and frequency is so high that the mesos.master stops responding to other requests, and eventually runs OOM and systemd restarts the failed process. If there is only one job running, and it's able to start an executor (exactly one started in my sample logs), the job will eventually complete. However, if I deploy multiple jobs (five seemed to do the trick), I've seen cases where none of the jobs complete, and the cluster begins to have cascading failures due to the master not servicing other API requests due to the influx of REGISTER requests from numerous spark driver frameworks.

Problematic run (stdout, stderr, mesos.master logs):
Successful run (stdout, stderr; for comparison):
Snippet of flood of subscribes hitting master node: