spark 2.2.x - Broadcasthashjoin is not happening even after checkpointing

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

spark 2.2.x - Broadcasthashjoin is not happening even after checkpointing

Nirav Patel
I am joining two datasets: one with few hundred million record and another is just 72 records. Without doing anything it tries to do SortMergeJoin (shuffle exchange) and blows with OOM. I expect it to do mapjoin (broadcast join)
I have auto boradcast on and I am not repartitioning my dataset.

It works now if I save small dataset and read it back. It doesn't work if I checkpoint!

Attaching two screen shot. 1st one is where I am checkpointing small dataset.

Screen Shot 2018-11-07 at 4.04.04 PM.png

Above is reading ExistingRDD from checkpoint. It has only 72 records and still decided to do shuffle join!

Here when I save it :

Screen Shot 2018-11-07 at 4.03.53 PM.png

now it does broadcast join.

So my workaround is to save and read back small dataset. 

Why checkpointing didn't work?

Why without checkpointing or saving it doesn't work? (I don't have this lineage here as it's too big and complicated) checkpointing does help to truncate previous lineage by executing it but what happened after that was not expected.




 




What's New with Xactly