Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

Nimi W
I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple of occasions, we have noticed that when the Spark Driver crashes (to various causes - human error, network error), sometimes, when the Driver is restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second up until the Mesos Master node gets overwhelmed and crashes. It does this again to the next master node, over and over until it takes down all the master nodes. Usually the only thing that will fix is manually stopping the driver and restarting.

Here is a snippet of the log of the mesos master, which just logs the repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f

Here is the output of the spark framework: https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just repeats 'Transport endpoint is not connected' over and over.

Thanks for any insights


Reply | Threaded
Open this post in threaded view
|

Re: Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

Susan X. Huynh
Hi Nimi,


It turned out to be a bug in libmesos (the client library used to communicate with Mesos): "using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop" (https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by upgrading to a version of libmesos that has the fix.

Susan


On Fri, Jul 13, 2018 at 3:39 PM, Nimi W <[hidden email]> wrote:
I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple of occasions, we have noticed that when the Spark Driver crashes (to various causes - human error, network error), sometimes, when the Driver is restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second up until the Mesos Master node gets overwhelmed and crashes. It does this again to the next master node, over and over until it takes down all the master nodes. Usually the only thing that will fix is manually stopping the driver and restarting.

Here is a snippet of the log of the mesos master, which just logs the repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f

Here is the output of the spark framework: https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just repeats 'Transport endpoint is not connected' over and over.

Thanks for any insights





--
Susan X. Huynh
Software engineer, Data Agility
Reply | Threaded
Open this post in threaded view
|

Re: Spark on Mesos: Spark issuing hundreds of SUBSCRIBE requests / second and crashing Mesos

Nimi W
That does sound like it could be it - I checked our libmesos version and it is 1.4.1. I'll try upgrading libmesos.

Thanks.

On Mon, Jul 23, 2018 at 12:13 PM Susan X. Huynh <[hidden email]> wrote:
Hi Nimi,


It turned out to be a bug in libmesos (the client library used to communicate with Mesos): "using a failoverTimeout of 0 with Mesos native scheduler client can result in infinite subscribe loop" (https://issues.apache.org/jira/browse/MESOS-8171). It can be fixed by upgrading to a version of libmesos that has the fix.

Susan


On Fri, Jul 13, 2018 at 3:39 PM, Nimi W <[hidden email]> wrote:
I've come across an issue with Mesos 1.4.1 and Spark 2.2.1. We launch Spark tasks using the MesosClusterDispatcher in cluster mode. On a couple of occasions, we have noticed that when the Spark Driver crashes (to various causes - human error, network error), sometimes, when the Driver is restarted, it issues a hundreds of SUBSCRIBE requests to mesos / per second up until the Mesos Master node gets overwhelmed and crashes. It does this again to the next master node, over and over until it takes down all the master nodes. Usually the only thing that will fix is manually stopping the driver and restarting.

Here is a snippet of the log of the mesos master, which just logs the repeated SUBSCRIBE command: https://gist.github.com/nemosupremo/28ef4acfd7ec5bdcccee9789c021a97f

Here is the output of the spark framework: https://gist.github.com/nemosupremo/d098ef4def28ebf96c14d8f87aecd133 which also just repeats 'Transport endpoint is not connected' over and over.

Thanks for any insights





--
Susan X. Huynh
Software engineer, Data Agility