Can someone please suggest on below question related to spark streaming query / context cleaner / garbage collection issue we are facing. We suspect it’s bug causing memory leak.
We have a spark 2.3 cluster running streaming query. We are observing behavior that no matter how much memory we allocate to executor, JVM heap eventually grows to the limit and the JVM's GC starts to cause frequent timeouts. Eventually
the executor is marked "lost" or "dead". GC logging is enabled and it takes about 30-45 min to fill the heap. After that full GCs become much more frequent. We tried to increase more memory, gc interval and other relevant parameters of memory but have been
observing same issue.
We enabled context cleaner debug logs and observe only bordcast/Accumulator related cleaning messages. We don't see RDDs being received for cleanup with message “Cleaning
RDD..” (Ref: this code
ContextCleaner.scala#L213) . I have attached context cleaner logs for reference as well.