[Spark SQL] [Spark 2.4.0] Performance regression when reading parquet files from S3

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL] [Spark 2.4.0] Performance regression when reading parquet files from S3

Yann Moisan
Hello,

A Spark job on EMR reads parquet files located in an s3 bucket.

I use this option : spark.hadoop.fs.s3a.experimental.input.fadvise=random

When the ec2 instances and the bucket are in the same region, performance are quite the same but when there are not, performance drops down (job duration is multiplied by 2).

Note :  using the default value for the parameter mitigate the issue.

spark.hadoop.fs.s3a.experimental.input.fadvise=sequential

Any idea on what has changed in Spark 2.4.0 that could explain this issue ?