I am trying to run kmeans (not mllib verison) on 8 machines (8 cores, 60gb ram each) and having some issues, hopefully someone will have some advice.
Basically, the input data (250gb in CSV format) won't fit in memory (even using Kyro serialization), so I was wondering what the most efficient way of running the job is. I believe (please correct me if I am wrong) that the best thing to do would be to use the MEMORY_AND_DISK_SER option to spill serialized partitions to disk. Would the following setting/parameters be appropriate for this job? Is there anything else I can do to improve performance?