Spark 2.2.0 GC Overhead Limit Exceeded and OOM errors in the executors
I am trying to do some image analytics type workload using Spark. The images are read in JPEG format and then are converted to the raw format in map functions and this causes the size of the partitions to grow by an order of 1. In addition to this, I am caching some of the data because my pipeline is iterative.
I get OOM errors and GC overhead limit exceeded errors and I fix them by increasing the heap size or number of partitions even though after doing that there is still high GC pressure.
I know that my partitions should be small enough such that it can fit in memory. But when I did the calculation using the size of cache partitions available in Spark UI I think the individual partitions are small enough given the heap size and storage fraction. I am interested in getting your input on what other things can cause OOM errors in executors. Is caching data can be a problem (SPARK-1777)?
In addition to the info there, if you're partitioning by some key where
you've got a lot of data skew, one of the task's memory requirements may be
larger than the RAM of a given executor, where the rest of the tasks may be
just fine. If you're partitioning by some key, you may want to see if some
key has way more data than the others.