This post has NOT been accepted by the mailing list yet.
I have a long running spark job that has started running out of memory (truthfully the garbage collection times become so long and the performance is degraded so much that we restart the application before any actual OOM).
This is our proprietary production application so I cannot post the actual code, and I have not been able to create a sample application exhibiting the same issue. I'm hoping for debugging help.
I'll start the job in the morning, and by the next morning the application has to be restarted. I can see in monitoring software that the CPU begins to climb and then sits at 100% until restarted.
I have logging that tells me the number of block ids stored by the block manager. During normal operation that number fluctuates between 0 and 1000, always coming down after usage stops. At some point, no matter the usage, that number stops dropping. I can see that all of the BlockIds in the blockInfoManager are of type BroadcastBlockId and that most of the blockInfo.classTag's are Array[byte].
We are using akka to construct RDD's, and then have actors on pinned dispatchers run collect operations, so RDD's are constructed and passed around various threads, then realized on different ones.
We do not use any broadcasts in our code, so they all must come from spark. I have a few heap dumps i can look around in, but I only see that most of my memory is taken up by the MemoryStore.
Any advice on finding what is causing spark to hold onto these broadcast ids?