Spark Sql group by less performant

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Sql group by less performant

lsn24
Hello,

 I have a requirement where I need to get total count of rows and total
count of failedRows based on a grouping.

The code looks like below:

 myDataset.createOrReplaceTempView("temp_view");

Dataset <Row> countDataset = sparkSession.sql("Select
column1,column2,column3,column4,column5,column6,column7,column8, count(*) as
totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows
from temp_view group by
column1,column2,column3,column4,column5,column6,column7,column8");


Up till around 50 Million records,  the query performance was ok. After that
it gave it up. Mostly resulting in out of Memory exception.

I read documentation and blogs, most of them gives me examples of
RDD.reduceByKey. But here I got dataset and spark Sql.

What  am I missing here ? .

Any help will be appreciated.

Thanks!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Sql group by less performant

15313776907
i think you can add executer memory

15313776907
邮箱:15313776907@...

签名由 网易邮箱大师 定制

On 12/11/2018 08:28, [hidden email] wrote:
Hello,

I have a requirement where I need to get total count of rows and total
count of failedRows based on a grouping.

The code looks like below:

myDataset.createOrReplaceTempView("temp_view");

Dataset <Row> countDataset = sparkSession.sql("Select
column1,column2,column3,column4,column5,column6,column7,column8, count(*) as
totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows
from temp_view group by
column1,column2,column3,column4,column5,column6,column7,column8");


Up till around 50 Million records,  the query performance was ok. After that
it gave it up. Mostly resulting in out of Memory exception.

I read documentation and blogs, most of them gives me examples of
RDD.reduceByKey. But here I got dataset and spark Sql.

What  am I missing here ? .

Any help will be appreciated.

Thanks!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Spark Sql group by less performant

geoHeil

Am Di., 11. Dez. 2018 um 02:09 Uhr schrieb 15313776907 <[hidden email]>:
i think you can add executer memory

15313776907
邮箱:15313776907@...

签名由 网易邮箱大师 定制

On 12/11/2018 08:28, [hidden email] wrote:
Hello,

I have a requirement where I need to get total count of rows and total
count of failedRows based on a grouping.

The code looks like below:

myDataset.createOrReplaceTempView("temp_view");

Dataset <Row> countDataset = sparkSession.sql("Select
column1,column2,column3,column4,column5,column6,column7,column8, count(*) as
totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows
from temp_view group by
column1,column2,column3,column4,column5,column6,column7,column8");


Up till around 50 Million records,  the query performance was ok. After that
it gave it up. Mostly resulting in out of Memory exception.

I read documentation and blogs, most of them gives me examples of
RDD.reduceByKey. But here I got dataset and spark Sql.

What  am I missing here ? .

Any help will be appreciated.

Thanks!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]