I am using spark 2.1 and I am leveraging spark streaming for my data
pipeline. Now, in my case the batch size is 3 minutes and we persist couple
of RDDs while processing a batch and after processing we rely on Spark's
ContextCleaner to clean out RDDs which are no longer in scope.
So we have set "spark.cleaner.periodicGC.interval" = "15s" and
"spark.network.timeout" = "20s".
Now sometimes, when GC is triggered and it tries to clean all the out of
scope RDDs, then a Future timeout occurs(File attached -->
FuturesTimeout.txt) and it says "Failed to remove RDD 14254".
So, when I go and check for this particular RDD under storage tab, then I
can see the same RDD is still there(verified using RddId -> 14254).
That's fine unless it is queued for cleanup in subsequent GC cycle !
But that does not happen and I could see this RDD under the storage tab.
This happened for couple of more RDDs.
So, I tried looking into it and it seems once ContextCleaner sends a request
for cleaning the RDD and then if an error occurs while cleaning this RDD
then it does not re-queue the given RDD for cleaning.
It seems like a bug to me. Can you please look into it !