[pyspark 2.4+] BucketBy SortBy doesn't retain sort order

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[pyspark 2.4+] BucketBy SortBy doesn't retain sort order

rishishah.star
Hi All,

I have 2 large tables (~1TB), I used the following to save both the tables. Then when I try to join both tables with join_column, it still does shuffle & sort before the join. Could someone please help?

df.repartition(2000).write.bucketBy(1, join_column).sortBy(join_column).saveAsTable(tablename)

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [pyspark 2.4+] BucketBy SortBy doesn't retain sort order

rishishah.star
Hi All,

Just checking in to see if anyone has any advice on this.

Thanks,
Rishi

On Mon, Mar 2, 2020 at 9:21 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I have 2 large tables (~1TB), I used the following to save both the tables. Then when I try to join both tables with join_column, it still does shuffle & sort before the join. Could someone please help?

df.repartition(2000).write.bucketBy(1, join_column).sortBy(join_column).saveAsTable(tablename)

--
Regards,

Rishi Shah


--
Regards,

Rishi Shah