This post has NOT been accepted by the mailing list yet.
This post was updated on .
I have used Hadoop and now evaluating Spark. What I can not understand is there any easy way to say when we have out of memory situation please spill data into disk. We can of course always to better spark jobs, but what I really want is that when when we have large data sets Spark could store those on disk. Is there any way to do this or I missing something?
When your objects are still too large to efficiently store despite this tuning, a much simpler way to reduce memory usage is to store them in serialized form, using the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER. Spark will then store each RDD partition as one large byte array. The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly. We highly recommend using Kryo if you want to cache data in serialized form, as it leads to much smaller sizes than Java serialization (and certainly than raw Java objects).