spark 2.2.x - Broadcasthashjoin is not happening even after checkpointing
I am joining two datasets: one with few hundred million record and another is just 72 records. Without doing anything it tries to do SortMergeJoin (shuffle exchange) and blows with OOM. I expect it to do mapjoin (broadcast join)
I have auto boradcast on and I am not repartitioning my dataset.
It works now if I save small dataset and read it back. It doesn't work if I checkpoint!
Attaching two screen shot. 1st one is where I am checkpointing small dataset.
Above is reading ExistingRDD from checkpoint. It has only 72 records and still decided to do shuffle join!
Here when I save it :
now it does broadcast join.
So my workaround is to save and read back small dataset.
Why checkpointing didn't work?
Why without checkpointing or saving it doesn't work? (I don't have this lineage here as it's too big and complicated) checkpointing does help to truncate previous lineage by executing it but what happened after that was not expected.