We're running into an issue where periodically the master loses connectivity with workers in the spark cluster. We believe this issue tends to manifest when the cluster is under heavy load, but we're not entirely sure when it happens. I've seen one or two other messages to this list about this issue, but no one seems to have a clue as to the actual bug.
So, to work around the issue, we'd like to programmatically monitor the number of workers connected to the master and restart the cluster when the master loses track of some of its workers. Any ideas on how to programmatically write such a health check?