join with just 1 record causes all data to go to a single node

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

join with just 1 record causes all data to go to a single node

Marcelo Valle
Hi, 

I am using spark on EMR 5.28.0.

We were having a problem in production where, after a join between 2 dataframes, in some situations all data was being moved to a single node, and then the cluster was failing after retrying many times. 

Our join is something like that:

```
df1.join(df2,  
df1("field1") <=> df2("field1")
&& df1("field2") <=> df2("field2"))
```

After some harvesting, we were able to isolate the corner case - this was only happening when all join fields were NULL. Notice the `<=>` operator instead of `===`.

Would someone be able to explain this behavior? It looks like a bug to me, but I could be missing something.

Thanks,
Marcelo.

This email is confidential [and may be protected by legal privilege]. If you are not the intended recipient, please do not copy or disclose its content but contact the sender immediately upon receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom