30000 partitions vs 1000 partitions with Coalescing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

30000 partitions vs 1000 partitions with Coalescing

dev nan
I would like to know why it is faster to write out an RDD that has 30,000 partitions as 30,000 files sized 1K-2M rather than coalescing it to 1000 partitions and writing out 1000 S3 files of roughly 26MB each, or even 100 partitions and 100 S3 files of 260MB each.

The coalescing takes a long time.


Thanks,

Adnan
Reply | Threaded
Open this post in threaded view
|

Re: 30000 partitions vs 1000 partitions with Coalescing

Roland Johann
Hi Adnan,

coalescing involves network shuffle to other executors. How many executors are configured for that job?

Best regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann



Am 23.04.2020 um 20:41 schrieb dev nan <[hidden email]>:

I would like to know why it is faster to write out an RDD that has 30,000 partitions as 30,000 files sized 1K-2M rather than coalescing it to 1000 partitions and writing out 1000 S3 files of roughly 26MB each, or even 100 partitions and 100 S3 files of 260MB each.

The coalescing takes a long time.


Thanks,

Adnan