I was varying the storage levels of RDD caching in the KMeans program
implemented using the MLib library and got some very confusing and
interesting results. The base code of the application is from a Benchmark
suite named SparkBench <https://github.com/CODAIT/spark-bench> . I changed
the storage levels of the data RDD passed to the Kmeans train function and
it seems like MEMORY_AND_DISK_SER is performing quite worse compared to
DISK_ONLY level. MEMORY_AND_DISK level performed the best as expected. But
as to why Memory serialized storage level is performing worse than Disk
serialized level is very confusing. I am using 1 node as master and 4 nodes
as slaves with each executor having a 48g JVM. The cached data should also
fit within the memory easily.
If anyone has any idea or suggestion on why this behavior is happening
please let me know.