spark sql data skew

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

spark sql data skew

崔苗
Hi,
when I want to count(distinct userId) by company,I met the data skew and the task takes too long time,how to count distinct by keys on skew data in spark sql ?

thanks for any reply

jgp
Reply | Threaded
Open this post in threaded view
|

Re: spark sql data skew

jgp
Just thinking out loud… repartition by key? create a composite key based on company and userid? 

How big is your dataset?

On Jul 13, 2018, at 06:20, 崔苗 <[hidden email]> wrote:

Hi,
when I want to count(distinct userId) by company,I met the data skew and the task takes too long time,how to count distinct by keys on skew data in spark sql ?

thanks for any reply


Reply | Threaded
Open this post in threaded view
|

Re:Re: spark sql data skew

崔苗
30G user data, how to get distinct users count after creating a composite key based on company and userid?

在 2018-07-13 18:24:52,Jean Georges Perrin <[hidden email]> 写道:
Just thinking out loud… repartition by key? create a composite key based on company and userid? 

How big is your dataset?

On Jul 13, 2018, at 06:20, 崔苗 <[hidden email]> wrote:

Hi,
when I want to count(distinct userId) by company,I met the data skew and the task takes too long time,how to count distinct by keys on skew data in spark sql ?

thanks for any reply



Reply | Threaded
Open this post in threaded view
|

Re: Re: spark sql data skew

Shawn Wan
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character.

On 17 July 2018 at 02:25, 崔苗 <[hidden email]> wrote:
30G user data, how to get distinct users count after creating a composite key based on company and userid?


在 2018-07-13 18:24:52,Jean Georges Perrin <[hidden email]> 写道:
Just thinking out loud… repartition by key? create a composite key based on company and userid? 

How big is your dataset?

On Jul 13, 2018, at 06:20, 崔苗 <[hidden email]> wrote:

Hi,
when I want to count(distinct userId) by company,I met the data skew and the task takes too long time,how to count distinct by keys on skew data in spark sql ?

thanks for any reply




Reply | Threaded
Open this post in threaded view
|

Re:Re: Re: spark sql data skew

崔苗
but how to get count(distinct userId) group by company from count(distinct userId) group by company+x?
count(userId) is different from count(distinct userId)

在 2018-07-21 00:49:58,Xiaomeng Wan <[hidden email]> 写道:
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character.

On 17 July 2018 at 02:25, 崔苗 <[hidden email]> wrote:
30G user data, how to get distinct users count after creating a composite key based on company and userid?


在 2018-07-13 18:24:52,Jean Georges Perrin <[hidden email]> 写道:
Just thinking out loud… repartition by key? create a composite key based on company and userid? 

How big is your dataset?

On Jul 13, 2018, at 06:20, 崔苗 <[hidden email]> wrote:

Hi,
when I want to count(distinct userId) by company,I met the data skew and the task takes too long time,how to count distinct by keys on skew data in spark sql ?

thanks for any reply





Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: spark sql data skew

Gourav Sengupta
https://docs.databricks.com/spark/latest/spark-sql/skew-join.html

The above might help, in case you are using a join. 

On Mon, Jul 23, 2018 at 4:49 AM, 崔苗 <[hidden email]> wrote:
but how to get count(distinct userId) group by company from count(distinct userId) group by company+x?
count(userId) is different from count(distinct userId)


在 2018-07-21 00:49:58,Xiaomeng Wan <[hidden email]> 写道:
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character.

On 17 July 2018 at 02:25, 崔苗 <[hidden email]> wrote:
30G user data, how to get distinct users count after creating a composite key based on company and userid?


在 2018-07-13 18:24:52,Jean Georges Perrin <[hidden email]> 写道:
Just thinking out loud… repartition by key? create a composite key based on company and userid? 

How big is your dataset?

On Jul 13, 2018, at 06:20, 崔苗 <[hidden email]> wrote:

Hi,
when I want to count(distinct userId) by company,I met the data skew and the task takes too long time,how to count distinct by keys on skew data in spark sql ?

thanks for any reply