Hi all,

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Hi all,

אורן שמון
I have 2 spark jobs one is pre-process and the second is the process.
Process job needs to calculate for each user in the data.
I want  to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .

What is prefer ? and why 
Thanks in advance,
Oren
Reply | Threaded
Open this post in threaded view
|

Re: Hi all,

Jean Georges Perrin
Hi Oren,

Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion...


> On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨[hidden email]⁩> wrote:
>
> I have 2 spark jobs one is pre-process and the second is the process.
> Process job needs to calculate for each user in the data.
> I want  to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .
>
> What is prefer ? and why
> Thanks in advance,
> Oren


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hi all,

אורן שמון
Hi Jean,
We prepare the data for all another jobs. We have a lot of jobs that schedule to different time but all of them need to read same raw data. 

On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrin <[hidden email]> wrote:
Hi Oren,

Why don’t you want to use a GroupBy? You can cache or checkpoint the result and use it in your process, keeping everything in Spark and avoiding save/ingestion...


> On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨[hidden email]⁩> wrote:
>
> I have 2 spark jobs one is pre-process and the second is the process.
> Process job needs to calculate for each user in the data.
> I want  to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .
>
> What is prefer ? and why
> Thanks in advance,
> Oren