Spark and Zookeeper HA failures

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Spark and Zookeeper HA failures

Mark Bidewell
I am trying to set up a Spark cluster with multi-master HA.  I have 3 spark nodes connecting to a single zookeeper node running on a separate server.  When running in this configuration, Over the course of 1-2 hours each node ends its session because it is not receving any messages from the server.  The standby nodes reconnect, but if a leader encounters it, it immediately exits.

The net result is that the cluster slowly dies as each master ends its session and terminates.

The spark cluster is not in use so I don't think this is a GC issue.  Pings, etc seem reliable.  I have tried adjusting timeouts but that doesn't work either.

Any ideas how to resolve this?