Spark 2.2.1 Dataframes multiple joins bug?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark 2.2.1 Dataframes multiple joins bug?

Rico B.

Hi all!

Is it possible that Spark creates under certain circumstances duplicate rows when doing multiple joins?

What I did:

buse.count

res0: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").count

res1: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), $"bdef._c9"===$"crnb._c4").count

res2: Long = 20554365

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), $"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"), $"crnb._c1"===$"wreg._c5").count

res3: Long = 21633023

For explanation: buse and crnb are 1:1 relationship tables.

In the last join I expected again 20554365 but suddenly duplicate rows exist. "wreg._c5" is a unique key, so it should not create more records:

wreg.groupBy($"_c5").agg(count($"_c2") as "cnt").filter($"cnt">1).show
+---+---+
|_c5|cnt|
+---+---+
+---+---+

When doing a distinct on the 4-way join I get the expected number of records:

buse.alias("buse").join(bdef.alias("bdef"), $"buse._c4"===$"bdef._c4").join(crnb.alias("crnb"), $"bdef._c9"===$"crnb._c4").join(wreg.alias("wreg"), $"crnb._c1"===$"wreg._c5").distinct.count
res10: Long = 20554365

This (in my opinion) means, that Spark is creating duplicte rows, although it shouldn't. Or do I miss something?


Best, Rico.