Regression of external shuffle service spark 2.3 vs spark 2.2
any inputs will be welcome regarding below
We are running with external shuffle service. Mesos cluster(1.5.1)
After upgrading our production workload to spark 2.3 we started to see OOM
failures of external shuffle services(running on each node).
Does anybody experienced same problems?
Any direction to any code would be helpful(I know that there was work done
in external shuffle service domain under 2.3, but from reading PRs can't
pinpoint what change causing those OOM)
Unfortunately there is no test case for reproduction and even with 2.3, OOM
failures start after 2+ days of production load