[Spark SQL]: Stability of large many-to-many joins

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL]: Stability of large many-to-many joins

nathan grand
Hi,

I have two very large datasets, which both have many repeated keys, which I
wish to join.

A simplified example:

dsA

A_1 |A_2
1 |A
2 |A
3 |A
4 |A
5 |A
1 |B
2 |B
3 |B
1 |C

dsB

B_1 |B_2
A |B
A |C
A |D
A |E
A |F
A |G
B |A
B |E
B |G
B |H
C |A
C |B


The join I want to do is:

dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")

However, this ends putting a lot of pressure on tasks containing frequently
occurring keys - it's either very, very slow to complete or I encounter
memory issues.

I've played with grouping both sides by the join key prior to joining
(which would make the join one-to-one) but memory seems to become an issue
again as the groups are very large.

Does anyone have any good suggestions as to how to make large many-to-many
joins reliably complete in Spark??

Reliability for me is much more important than speed - this is for a tool
so I can't over-tune to specific data sizes/shapes.

Thanks,

Nathan  
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Stability of large many-to-many joins

Peyman Mohajerian
Two options, either add salting to your join or filter records that are frequent, join them separately and the union back, it's the skew join issue.

On Fri, Mar 20, 2020 at 4:12 AM nathan grand <[hidden email]> wrote:
Hi,

I have two very large datasets, which both have many repeated keys, which I
wish to join.

A simplified example:

dsA

A_1 |A_2
1 |A
2 |A
3 |A
4 |A
5 |A
1 |B
2 |B
3 |B
1 |C

dsB

B_1 |B_2
A |B
A |C
A |D
A |E
A |F
A |G
B |A
B |E
B |G
B |H
C |A
C |B


The join I want to do is:

dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")

However, this ends putting a lot of pressure on tasks containing frequently
occurring keys - it's either very, very slow to complete or I encounter
memory issues.

I've played with grouping both sides by the join key prior to joining
(which would make the join one-to-one) but memory seems to become an issue
again as the groups are very large.

Does anyone have any good suggestions as to how to make large many-to-many
joins reliably complete in Spark??

Reliability for me is much more important than speed - this is for a tool
so I can't over-tune to specific data sizes/shapes.

Thanks,

Nathan  
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Stability of large many-to-many joins

nathan grand
Being many-to-many on two similar sized large datasets, does salting really help?

On Fri, Mar 20, 2020 at 3:26 PM Peyman Mohajerian <[hidden email]> wrote:
Two options, either add salting to your join or filter records that are frequent, join them separately and the union back, it's the skew join issue.

On Fri, Mar 20, 2020 at 4:12 AM nathan grand <[hidden email]> wrote:
Hi,

I have two very large datasets, which both have many repeated keys, which I
wish to join.

A simplified example:

dsA

A_1 |A_2
1 |A
2 |A
3 |A
4 |A
5 |A
1 |B
2 |B
3 |B
1 |C

dsB

B_1 |B_2
A |B
A |C
A |D
A |E
A |F
A |G
B |A
B |E
B |G
B |H
C |A
C |B


The join I want to do is:

dsA.join(dsB, dsA("A_2") === dsB($"B_1"), "INNER")

However, this ends putting a lot of pressure on tasks containing frequently
occurring keys - it's either very, very slow to complete or I encounter
memory issues.

I've played with grouping both sides by the join key prior to joining
(which would make the join one-to-one) but memory seems to become an issue
again as the groups are very large.

Does anyone have any good suggestions as to how to make large many-to-many
joins reliably complete in Spark??

Reliability for me is much more important than speed - this is for a tool
so I can't over-tune to specific data sizes/shapes.

Thanks,

Nathan