Persisting RDD: Low Percentage with a lot of memory available
This post has NOT been accepted by the mailing list yet.
This problem is very annoying for me and I'm tired of surfing the network without any good advice to follow.
I have a complex job. It has been worked fine until I needed to save partial results (RDDs) to files.
So I tried to cache the RDDs and then call a saveAsText method and follow the workflow as usual.
The first problem I noticed was that RDD was not totally cached .
So I changed the cache() method for a persist(StorageLevel.MEMORY_AND_DISK_SER()) hoping this should persist 100% of RDD. But it didn't at all.
not make any sense to me. Is not suposed that with that storage level,
the fractions which don't fit in memory will be persisted in disk?
Even insignificant RDDs of about ~5MB was cached only at 82%.
The last one in the previous image, which had 6628 cached partitions, is distributed in the following way:
The executors Storage Memory wee far away from be filled:
The only thing I noticed is near to be exhausted is "Memory" in hadoop Cluster Memory:
don't know the relation between this "memory used" column and the
memory described in Spark UI (Storage memory was almost empty).
the job accumulated a lot of stages (~100) for recalculation of RDDs
not cached and the cluster failed with an enigmatic and apparently
16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since Thu Jan 01 01:00:00 CET 1970