Bucket vs repartition

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Bucket vs repartition

אורן שמון
Hi all,
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want  to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .

What is prefer ? and why 
Thanks in advance,
Oren