Any way to make catalyst optimise away join

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Any way to make catalyst optimise away join

jelmer
I have 2 dataframes , lets call them A and B,

A is made up out of [unique_id, field1]
B is made up out of [unique_id, field2]

The have the exact same number of rows, and every id in A is also present in B

if I execute a join like this A.join(B, Seq("unique_id")).select($"unique_id", $"field1") then spark will do an expensive join even though it does not have to because all the fields it needs are in A. is there some trick I can use so that catalyst will optimise this join away ?
Reply | Threaded
Open this post in threaded view
|

Re: Any way to make catalyst optimise away join

Jerry Vinokurov
This seems like a suboptimal situation for a join. How can Spark know in advance that all the fields are present and the tables have the same number of rows? I suppose you could just sort the two frames by id and concatenate them, but I'm not sure what join optimization is available here.

On Fri, Nov 29, 2019, 4:51 AM jelmer <[hidden email]> wrote:
I have 2 dataframes , lets call them A and B,

A is made up out of [unique_id, field1]
B is made up out of [unique_id, field2]

The have the exact same number of rows, and every id in A is also present in B

if I execute a join like this A.join(B, Seq("unique_id")).select($"unique_id", $"field1") then spark will do an expensive join even though it does not have to because all the fields it needs are in A. is there some trick I can use so that catalyst will optimise this join away ?