Poor performance caused by coalesce to 1

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Poor performance caused by coalesce to 1

James Yu
Hi Team,

We are running into this poor performance issue and seeking your suggestion on how to improve it:

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.


Thanks

James
Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

Silvio Fiorito

Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absolutely need a single file output, you can instead add a stage boundary and use repartition(1). This will give your query full parallelization during processing while at the end giving you a single task that writes data out. Note that if the file is large (e.g. in 1GB or more) you’ll probably still notice slowness while writing. You may want to reconsider the 1-file requirement for larger datasets.

 

From: James Yu <[hidden email]>
Date: Wednesday, February 3, 2021 at 1:54 PM
To: user <[hidden email]>
Subject: Poor performance caused by coalesce to 1

 

Hi Team,

 

We are running into this poor performance issue and seeking your suggestion on how to improve it:

 

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

 

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.

 

 

Thanks

 

James

Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

Stéphane Verlet-2
In reply to this post by James Yu
I had that issue too and from what I gathered, it is an expected optimization... Try using repartiion instead

On Feb 3, 2021, at 11:55, James Yu <[hidden email]> wrote:
Hi Team,

We are running into this poor performance issue and seeking your suggestion on how to improve it:

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.


Thanks

James
Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

srowen
In reply to this post by James Yu
Probably could also be because that coalesce can cause some upstream transformations to also have parallelism of 1. I think (?) an OK solution is to cache the result, then coalesce and write. Or combine the files after the fact. or do what Silvio said.

On Wed, Feb 3, 2021 at 12:55 PM James Yu <[hidden email]> wrote:
Hi Team,

We are running into this poor performance issue and seeking your suggestion on how to improve it:

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.


Thanks

James
Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

James Yu
In reply to this post by Silvio Fiorito
Hi Silvio,

The result file is less than 50 MB in size so I think it is small and acceptable enough for one task to write. 

Your suggestion sounds interesting. Could you guide us further on how to easily "add a stage boundary"?

Thanks

From: Silvio Fiorito <[hidden email]>
Sent: Wednesday, February 3, 2021 11:05 AM
To: James Yu <[hidden email]>; user <[hidden email]>
Subject: Re: Poor performance caused by coalesce to 1
 

Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absolutely need a single file output, you can instead add a stage boundary and use repartition(1). This will give your query full parallelization during processing while at the end giving you a single task that writes data out. Note that if the file is large (e.g. in 1GB or more) you’ll probably still notice slowness while writing. You may want to reconsider the 1-file requirement for larger datasets.

 

From: James Yu <[hidden email]>
Date: Wednesday, February 3, 2021 at 1:54 PM
To: user <[hidden email]>
Subject: Poor performance caused by coalesce to 1

 

Hi Team,

 

We are running into this poor performance issue and seeking your suggestion on how to improve it:

 

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

 

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.

 

 

Thanks

 

James

Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

Mich Talebzadeh
In reply to this post by srowen
That sounds like a plan as suggested by Sean, I have also seen caching the RS before coalesce provides benefits, especially for a minute 50MB data. Check Spark GUI storage tab for its effect.

HTH


Mich


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 3 Feb 2021 at 19:08, Sean Owen <[hidden email]> wrote:
Probably could also be because that coalesce can cause some upstream transformations to also have parallelism of 1. I think (?) an OK solution is to cache the result, then coalesce and write. Or combine the files after the fact. or do what Silvio said.

On Wed, Feb 3, 2021 at 12:55 PM James Yu <[hidden email]> wrote:
Hi Team,

We are running into this poor performance issue and seeking your suggestion on how to improve it:

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.


Thanks

James
Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

Gourav Sengupta
In reply to this post by James Yu
Hi,
as always, I would like to first identify the problem before solving the problem. 
So to isolate the problem, first without coalesce try to write the data out to a storage location and check the time. 
Then try to do coalesce to one and check the time.
If the time between writing down between coalesce and writing out to the files is very large, then the issue is coalesce. Otherwise the issue is the chain of transformations before coalesce. 
Anyways, its 2021, and I always get confused when people use RDD's. Any particular reason why dataframes would not work?


Regards,
Gourav Sengupta

On Wed, Feb 3, 2021 at 7:20 PM James Yu <[hidden email]> wrote:
Hi Silvio,

The result file is less than 50 MB in size so I think it is small and acceptable enough for one task to write. 

Your suggestion sounds interesting. Could you guide us further on how to easily "add a stage boundary"?

Thanks

From: Silvio Fiorito <[hidden email]>
Sent: Wednesday, February 3, 2021 11:05 AM
To: James Yu <[hidden email]>; user <[hidden email]>
Subject: Re: Poor performance caused by coalesce to 1
 

Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absolutely need a single file output, you can instead add a stage boundary and use repartition(1). This will give your query full parallelization during processing while at the end giving you a single task that writes data out. Note that if the file is large (e.g. in 1GB or more) you’ll probably still notice slowness while writing. You may want to reconsider the 1-file requirement for larger datasets.

 

From: James Yu <[hidden email]>
Date: Wednesday, February 3, 2021 at 1:54 PM
To: user <[hidden email]>
Subject: Poor performance caused by coalesce to 1

 

Hi Team,

 

We are running into this poor performance issue and seeking your suggestion on how to improve it:

 

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

 

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.

 

 

Thanks

 

James

Reply | Threaded
Open this post in threaded view
|

Re: Poor performance caused by coalesce to 1

Silvio Fiorito
In reply to this post by James Yu

As I suggested, you need to use repartition(1) in place of coalesce(1)

 

That will give you a single file output without losing parallelization for the rest of the job.

 

From: James Yu <[hidden email]>
Date: Wednesday, February 3, 2021 at 2:19 PM
To: Silvio Fiorito <[hidden email]>, user <[hidden email]>
Subject: Re: Poor performance caused by coalesce to 1

 

Hi Silvio,

 

The result file is less than 50 MB in size so I think it is small and acceptable enough for one task to write. 

 

Your suggestion sounds interesting. Could you guide us further on how to easily "add a stage boundary"?

 

Thanks


From: Silvio Fiorito <[hidden email]>
Sent: Wednesday, February 3, 2021 11:05 AM
To: James Yu <[hidden email]>; user <[hidden email]>
Subject: Re: Poor performance caused by coalesce to 1

 

Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absolutely need a single file output, you can instead add a stage boundary and use repartition(1). This will give your query full parallelization during processing while at the end giving you a single task that writes data out. Note that if the file is large (e.g. in 1GB or more) you’ll probably still notice slowness while writing. You may want to reconsider the 1-file requirement for larger datasets.

 

From: James Yu <[hidden email]>
Date: Wednesday, February 3, 2021 at 1:54 PM
To: user <[hidden email]>
Subject: Poor performance caused by coalesce to 1

 

Hi Team,

 

We are running into this poor performance issue and seeking your suggestion on how to improve it:

 

We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough).  We found that after a series of transformations (GROUP BYs, FLATMAPs), we coalesced the final RDD to 1 partition before writing it out, and this coalesce degrade the performance, not that this additional coalesce operation took additional runtime, but it somehow dictates the partitions to use in the upstream transformations.

 

We hope there is a simple and useful way to solve this kind of issue which we believe is quite common for many people.

 

 

Thanks

 

James