[pyspark 2.4] maxrecordsperfile option

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[pyspark 2.4] maxrecordsperfile option

rishishah.star
Hi All,

Version 2.2 introduced maxrecordsperfile option while writing data, could someone help understand the performance impact of using maxrecordsperfile (single pass at writing data with this option) vs repartitioning (2 stage process where we write down data and then consolidate later)?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [pyspark 2.4] maxrecordsperfile option

Shraddha Shah
After digging in a bit more, it looks like maxrecordsperfile does not provide full parallelism as expected. Any thoughts on this would be really helpful.

On Sat, Nov 23, 2019 at 11:36 PM Rishi Shah <[hidden email]> wrote:
Hi All,

Version 2.2 introduced maxrecordsperfile option while writing data, could someone help understand the performance impact of using maxrecordsperfile (single pass at writing data with this option) vs repartitioning (2 stage process where we write down data and then consolidate later)?

--
Regards,

Rishi Shah