Best practices for larger than memory data

classic Classic list List threaded Threaded
1 message Options
agg
Reply | Threaded
Open this post in threaded view
|

Best practices for larger than memory data

agg
This post was updated on .
Hi,

I am trying to run kmeans (not mllib verison) on 8 machines (8 cores, 60gb ram each) and having some issues, hopefully someone will have some advice.  

Basically, the input data (250gb in CSV format) won't fit in memory (even using Kyro serialization), so I was wondering what the most efficient way of running the job is.  I believe (please correct me if I am wrong) that the best thing to do would be to use the MEMORY_AND_DISK_SER option to spill serialized partitions to disk.  Would the following setting/parameters be appropriate for this job? Is there anything else I can do to improve performance?

System.setProperty("spark.executor.memory", "55g")
System.setProperty("spark.storage.memoryFraction", ".4")
System.setProperty("spark.default.parallelism", "5000")

val data = lines.map(parseVector _).persist(MEMORY_AND_DISK_SER)

Does anyone have any thoughts?

Thanks!