I'm experimenting with Spark in both a distributed environment and as a multi-threaded local application.
When I set the spark master to local and attempt to read a ~20GB text file on the local file system into an RDD and perform computations on it, I don't get an out of memory error, but rather a "Too many open files" error. Is there a reason why this happens? How aggressively is Spark partitioning the data into intermediate files?
I have also tried splitting the text file into numerous text files - around 100,000 of them - and processing 10,000 of them at a time sequentially. However then Spark seems to get bottlenecked on reading each individual file into the RDD before proceeding with the computation. This has issues even reading 10,000 files at once. I would have thought that Spark could do I/O in parallel with computation, but it seems that Spark does all of the I/O first?
I was wondering if Spark was simply just not built for local applications outside of testing.