Monitoring spark dis-associated workers

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Monitoring spark dis-associated workers

Allen Chang
We're running into an issue where periodically the master loses connectivity with workers in the spark cluster. We believe this issue tends to manifest when the cluster is under heavy load, but we're not entirely sure when it happens. I've seen one or two other messages to this list about this issue, but no one seems to have a clue as to the actual bug.

So, to work around the issue, we'd like to programmatically monitor the number of workers connected to the master and restart the cluster when the master loses track of some of its workers. Any ideas on how to programmatically write such a health check?

Thanks,
Allen