Performance Issue

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance Issue

Tzahi File
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 

Reply | Threaded
Open this post in threaded view
|

Re:Performance Issue

Jiaan Geng
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 

Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Gourav Sengupta
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 

Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Tzahi File
Hi Gourav, 

My version of Spark is 2.1. 

The data is stored on S3 directory in parquet format. 

I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed):

SELECT *
FROM (select *  
      from parquet_files.raw_e as re
      WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
JOIN csv_file as g 
ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
LEFT JOIN campaigns as c
ON c.campaign_id = re.campaign_id
GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20,21 

Looking forward to any insights.


Thanks.

On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 



--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Gourav Sengupta
Hi Tzahi,

by using GROUP BY without any aggregate columns are you just trying to find out the DISTINCT of the columns ?

Also it may be of help (I do not know whether the SQL optimiser automatically takes care of this) to have the LEFT JOIN on a smaller data set by having joined on the device_id before as a subquery or separate query. And when you are writing the output of the JOIN between csv_file and raw_e to ORDER BY the output based on campaign_ID.

Thanks and Regards,
Gourav Sengupta


On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <[hidden email]> wrote:
Hi Gourav, 

My version of Spark is 2.1. 

The data is stored on S3 directory in parquet format. 

I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed):

SELECT *
FROM (select *  
      from parquet_files.raw_e as re
      WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
JOIN csv_file as g 
ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
LEFT JOIN campaigns as c
ON c.campaign_id = re.campaign_id
GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20,21 

Looking forward to any insights.


Thanks.

On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 



--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Tzahi File
Hi Gourav,

I just wanted to attach an example of my query so I replaced my fields names with  "select *", I do have an agg fields in my query. 

What about improving performance with Sparks - like broadcasting or something like that?

Thanks,
Tzahi

On Thu, Jan 10, 2019 at 7:23 PM Gourav Sengupta <[hidden email]> wrote:
Hi Tzahi,

by using GROUP BY without any aggregate columns are you just trying to find out the DISTINCT of the columns ?

Also it may be of help (I do not know whether the SQL optimiser automatically takes care of this) to have the LEFT JOIN on a smaller data set by having joined on the device_id before as a subquery or separate query. And when you are writing the output of the JOIN between csv_file and raw_e to ORDER BY the output based on campaign_ID.

Thanks and Regards,
Gourav Sengupta


On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <[hidden email]> wrote:
Hi Gourav, 

My version of Spark is 2.1. 

The data is stored on S3 directory in parquet format. 

I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed):

SELECT *
FROM (select *  
      from parquet_files.raw_e as re
      WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
JOIN csv_file as g 
ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
LEFT JOIN campaigns as c
ON c.campaign_id = re.campaign_id
GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20,21 

Looking forward to any insights.


Thanks.

On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 



--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Gourav Sengupta
Hi Tzahi,

I think that SPARK automatically broadcasts with the latest versions, but you might have to check with your version. Did you try filtering first and then doing the LEFT JOIN?

Regards,
Gourav Sengupta

On Sun, Jan 13, 2019 at 9:20 AM Tzahi File <[hidden email]> wrote:
Hi Gourav,

I just wanted to attach an example of my query so I replaced my fields names with  "select *", I do have an agg fields in my query. 

What about improving performance with Sparks - like broadcasting or something like that?

Thanks,
Tzahi

On Thu, Jan 10, 2019 at 7:23 PM Gourav Sengupta <[hidden email]> wrote:
Hi Tzahi,

by using GROUP BY without any aggregate columns are you just trying to find out the DISTINCT of the columns ?

Also it may be of help (I do not know whether the SQL optimiser automatically takes care of this) to have the LEFT JOIN on a smaller data set by having joined on the device_id before as a subquery or separate query. And when you are writing the output of the JOIN between csv_file and raw_e to ORDER BY the output based on campaign_ID.

Thanks and Regards,
Gourav Sengupta


On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <[hidden email]> wrote:
Hi Gourav, 

My version of Spark is 2.1. 

The data is stored on S3 directory in parquet format. 

I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed):

SELECT *
FROM (select *  
      from parquet_files.raw_e as re
      WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
JOIN csv_file as g 
ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
LEFT JOIN campaigns as c
ON c.campaign_id = re.campaign_id
GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20,21 

Looking forward to any insights.


Thanks.

On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 



--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Tzahi File
Hi Gourav,

I tried to remove the left join to see how it influences on the performance. 
It was a difference of about 3 min only. 
So I'm looking for a solution that may decrease the running time more significantly (now the running time is about 2 hours)   

On Sun, Jan 13, 2019 at 1:12 PM Gourav Sengupta <[hidden email]> wrote:
Hi Tzahi,

I think that SPARK automatically broadcasts with the latest versions, but you might have to check with your version. Did you try filtering first and then doing the LEFT JOIN?

Regards,
Gourav Sengupta

On Sun, Jan 13, 2019 at 9:20 AM Tzahi File <[hidden email]> wrote:
Hi Gourav,

I just wanted to attach an example of my query so I replaced my fields names with  "select *", I do have an agg fields in my query. 

What about improving performance with Sparks - like broadcasting or something like that?

Thanks,
Tzahi

On Thu, Jan 10, 2019 at 7:23 PM Gourav Sengupta <[hidden email]> wrote:
Hi Tzahi,

by using GROUP BY without any aggregate columns are you just trying to find out the DISTINCT of the columns ?

Also it may be of help (I do not know whether the SQL optimiser automatically takes care of this) to have the LEFT JOIN on a smaller data set by having joined on the device_id before as a subquery or separate query. And when you are writing the output of the JOIN between csv_file and raw_e to ORDER BY the output based on campaign_ID.

Thanks and Regards,
Gourav Sengupta


On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <[hidden email]> wrote:
Hi Gourav, 

My version of Spark is 2.1. 

The data is stored on S3 directory in parquet format. 

I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed):

SELECT *
FROM (select *  
      from parquet_files.raw_e as re
      WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
JOIN csv_file as g 
ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
LEFT JOIN campaigns as c
ON c.campaign_id = re.campaign_id
GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20,21 

Looking forward to any insights.


Thanks.

On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 



--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue

Arnaud LARROQUE
Hi,

Indeed Spark use spark.sql.autoBroadcastJoinThreshold to choose if it autobroadcasts a dataset or not. Default value are 10 mb.
You may execute an explain and check the different plans and see if the broadcasthashjoins are being used. You may change accordingly. There is no use to increase too much as it will use too much memory on each executor.

You could also try to increase spark.sql.shuffle.partitions>=2001. In version 2.0.x, I've tracked down than above this limit, partitions are compressed and it may help removing some pressure on executors.

Can you tell more about your job :
- Garbage collecting pressure ?
- Nb total tasks vs nb executors (parallelism) with numbers of CPU and Memory allocated for each executors

You can find all this inputs in Spark UI.

Regards,
Arnaud


On Sun, Jan 13, 2019 at 3:30 PM Tzahi File <[hidden email]> wrote:
Hi Gourav,

I tried to remove the left join to see how it influences on the performance. 
It was a difference of about 3 min only. 
So I'm looking for a solution that may decrease the running time more significantly (now the running time is about 2 hours)   

On Sun, Jan 13, 2019 at 1:12 PM Gourav Sengupta <[hidden email]> wrote:
Hi Tzahi,

I think that SPARK automatically broadcasts with the latest versions, but you might have to check with your version. Did you try filtering first and then doing the LEFT JOIN?

Regards,
Gourav Sengupta

On Sun, Jan 13, 2019 at 9:20 AM Tzahi File <[hidden email]> wrote:
Hi Gourav,

I just wanted to attach an example of my query so I replaced my fields names with  "select *", I do have an agg fields in my query. 

What about improving performance with Sparks - like broadcasting or something like that?

Thanks,
Tzahi

On Thu, Jan 10, 2019 at 7:23 PM Gourav Sengupta <[hidden email]> wrote:
Hi Tzahi,

by using GROUP BY without any aggregate columns are you just trying to find out the DISTINCT of the columns ?

Also it may be of help (I do not know whether the SQL optimiser automatically takes care of this) to have the LEFT JOIN on a smaller data set by having joined on the device_id before as a subquery or separate query. And when you are writing the output of the JOIN between csv_file and raw_e to ORDER BY the output based on campaign_ID.

Thanks and Regards,
Gourav Sengupta


On Thu, Jan 10, 2019 at 1:13 PM Tzahi File <[hidden email]> wrote:
Hi Gourav, 

My version of Spark is 2.1. 

The data is stored on S3 directory in parquet format. 

I sent you an example for a query I would like to run (the raw_e table is stored as parquet files and event_day is the partitioned filed):

SELECT *
FROM (select *  
      from parquet_files.raw_e as re
      WHERE  re.event_day >= '2018-11-28' AND re.event_day <= '2018-12-28')
JOIN csv_file as g 
ON g.device_id = re.id and g.advertiser_id = re.advertiser_id
LEFT JOIN campaigns as c
ON c.campaign_id = re.campaign_id
GROUP by 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20,21 

Looking forward to any insights.


Thanks.

On Wed, Jan 9, 2019 at 8:21 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

Can you please let us know the SPARK version, and the query, and whether the data is in parquet format or not, and where is it stored?

Regards,
Gourav Sengupta

On Wed, Jan 9, 2019 at 1:53 AM 大啊 <[hidden email]> wrote:
What is your performance issue?





At 2019-01-08 22:09:24, "Tzahi File" <[hidden email]> wrote:
Hello, 

I have some performance issue running SQL query on Spark. 

The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for running the SQL query. 

After searching after how to improve my query I have added to the configuration the above settings:
spark.sql.shuffle.partitions=1000
spark.dynamicAllocation.maxExecutors=200

There wasn't any significant improvement. I'm looking for any ideas to improve my running time.


Thanks! 
Tzahi 



 



--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.


--
Tzahi File
Data Engineer
ironSource
mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com
linkedintwitterfacebookgoogleplus
This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.