We've recently started testing spark on kubernetes, and have found some odd performance decreases. In particular its almost an order of magnitude slower pulling data from kafka than it is in our mesos cluster.
We've tested a few set-ups:
Baseline: Spark 2.3.0 on Mesos host networking (~5million records processed per batch, ~12-15s a batch)
Spark 2.3.0 on k8s 1.10 EKS with the AWS CNI plugin
Spark 2.3.0 on k8s 1.10 EKS with the Calico CNI plugin
Spark 2.4.0 on k8s 1.10.x with the Cilium CNI plugin (~5 million, 80-100s a batch)
All of them show the same performance decrease, though we don't have good numbers we have on the EKS cluster (those tests were run by a different group). On the 2.4.0/Cilium cluster I run I've seen roughly 8x performance decrease compared to the equivalent in our mesos cluster.
Our production mesos job runs with 20 executors with 2 cores each, 6g mem. When we set-up the equivalent I also set it up in the same fashion and removed CPU limits so that it could burst up if it was needing more CPU. But none of that seemed to get it to the performance level of our mesos set-up.
The mesos cluster does not use any CNI, so it's all host based networking, but I woulnd't expect an overlay network to slow down our jobs by an order of magnitude.
I was able to compare running a simple consumer that just reads from kafak and measures how fast it could go, and found that running in the overlay was about 8% slower than running in the host network stack. So the numbers aren't lining up for an order of magnitude slower.
Does anyone else have any experience in running streaming production workloads in kubernetes and if they've run into issues with performance? Any potential settings I could be missing?
I know most kubernetes clusters are set-up wildly differently but any tips or insights on where to look would be greatly appreciated.