Spark performance over S3

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark performance over S3

Tzahi File

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

When the job:

  • reads the parquet files from S3 and also writes to S3, it takes 22 min
  • reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
  • reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:

  • spark.hadoop.fs.s3a.connection.establish.timeout=5000
  • spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this post to improve the transfer speed, is something here relevant?



Thanks,
Tzahi
Reply | Threaded
Open this post in threaded view
|

Re: Spark performance over S3

Gourav Sengupta
Hi Tzahi,

that is a huge cost. So that I can understand the question before answering it:
1. what is the SPARK version that you are using?
2. what is the SQL code that you are using to read and write?

There are several other questions that are pertinent, but the above will be a great starting point.

Regards,
Gourav Sengupta

On Tue, Apr 6, 2021 at 7:46 PM Tzahi File <[hidden email]> wrote:

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

When the job:

  • reads the parquet files from S3 and also writes to S3, it takes 22 min
  • reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
  • reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:

  • spark.hadoop.fs.s3a.connection.establish.timeout=5000
  • spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this post to improve the transfer speed, is something here relevant?



Thanks,
Tzahi
Reply | Threaded
Open this post in threaded view
|

RE: Spark performance over S3

Boris Litvak

Hi Tzahi,

 

I don’t know the reasons for that, though I’d check for fs.s3a implementation to be using multipart uploads, which I assume it does.

 

I would say that none of the comments in the link are relevant to you, as the VPC endpoint is more of a security rather than performance feature.

 

I got an answer from AWS support recently saying that they tested this vs S3 access via public internet and the differences were negligible.

There is always an option it was not tested in your region, but it’s unlikely. Anyway, you can provision & test this with aws cli.

 

There is always an option to compare this with EMRFS performance …

I know it requires you to put in some work.

 

Boris

 

From: Gourav Sengupta <[hidden email]>
Sent: Tuesday, 6 April 2021 22:24
To: Tzahi File <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Spark performance over S3

 

Hi Tzahi,

 

that is a huge cost. So that I can understand the question before answering it:

1. what is the SPARK version that you are using?

2. what is the SQL code that you are using to read and write?

 

There are several other questions that are pertinent, but the above will be a great starting point.

 

Regards,

Gourav Sengupta

 

On Tue, Apr 6, 2021 at 7:46 PM Tzahi File <[hidden email]> wrote:

Hi All,

We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

the bucket and the ec2 run in the same region.

As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

When the job:

·         reads the parquet files from S3 and also writes to S3, it takes 22 min

·         reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)

·         reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

the spark job has the following S3-related configuration:

·         spark.hadoop.fs.s3a.connection.establish.timeout=5000

·         spark.hadoop.fs.s3a.connection.maximum=200

when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

Do you have any idea for the cause of the read latency from S3?

I saw this post to improve the transfer speed, is something here relevant?

 

 

Thanks,

Tzahi

Reply | Threaded
Open this post in threaded view
|

Re: Spark performance over S3

Hariharan
In reply to this post by Tzahi File
Hi Tzahi,

Comparing the first two cases:
  • > reads the parquet files from S3 and also writes to S3, it takes 22 min
  • > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)

  • It looks like most of the time is being spent in reading, and the time spent in writing is likely negligible (probably you're not writing much output?)

    Can you clarify what is the difference between these two?

    > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)?
    > reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    In the second case, was the data read from hdfs or s3?

    Regarding the point from the post you linked to:
    1, Enhanced networking does make a difference, but it should be automatically enabled if you're using a compatible instance type and an AWS AMI. However if you're using a custom AMI, you might want to check if it's enabled for you.
    2. VPC endpoints also can make a difference in performance - at least that used to be the case a few years ago. Maybe that has changed now.

    Couple of other things you might want to check:
    1. If your bucket is versioned, you may want to check if you're using the ListObjectsV2 API in S3A.
    2. Also check these recommendations from Cloudera for optimal use of S3A.

    Thanks,
    Hariharan



    On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <[hidden email]> wrote:

    Hi All,

    We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

    The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

    the bucket and the ec2 run in the same region.

    As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

    When the job:

    • reads the parquet files from S3 and also writes to S3, it takes 22 min
    • reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
    • reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    the spark job has the following S3-related configuration:

    • spark.hadoop.fs.s3a.connection.establish.timeout=5000
    • spark.hadoop.fs.s3a.connection.maximum=200

    when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

    Do you have any idea for the cause of the read latency from S3?

    I saw this post to improve the transfer speed, is something here relevant?



    Thanks,
    Tzahi
    Reply | Threaded
    Open this post in threaded view
    |

    Re: Spark performance over S3

    Vladimir Prus
    VPC endpoint can also make a major difference in costs. Without it, access to S3 incurs data transfer costs and NAT costs, and these can be large. 

    On Wed, 7 Apr 2021 at 14:13, Hariharan <[hidden email]> wrote:
    Hi Tzahi,

    Comparing the first two cases:
    • > reads the parquet files from S3 and also writes to S3, it takes 22 min
    • > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)

    It looks like most of the time is being spent in reading, and the time spent in writing is likely negligible (probably you're not writing much output?)

    Can you clarify what is the difference between these two?

    > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)?
    > reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    In the second case, was the data read from hdfs or s3?

    Regarding the point from the post you linked to:
    1, Enhanced networking does make a difference, but it should be automatically enabled if you're using a compatible instance type and an AWS AMI. However if you're using a custom AMI, you might want to check if it's enabled for you.
    2. VPC endpoints also can make a difference in performance - at least that used to be the case a few years ago. Maybe that has changed now.

    Couple of other things you might want to check:
    1. If your bucket is versioned, you may want to check if you're using the ListObjectsV2 API in S3A.
    2. Also check these recommendations from Cloudera for optimal use of S3A.

    Thanks,
    Hariharan



    On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <[hidden email]> wrote:

    Hi All,

    We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

    The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

    the bucket and the ec2 run in the same region.

    As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

    When the job:

    • reads the parquet files from S3 and also writes to S3, it takes 22 min
    • reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
    • reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    the spark job has the following S3-related configuration:

    • spark.hadoop.fs.s3a.connection.establish.timeout=5000
    • spark.hadoop.fs.s3a.connection.maximum=200

    when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

    Do you have any idea for the cause of the read latency from S3?

    I saw this post to improve the transfer speed, is something here relevant?



    Thanks,
    Tzahi
    --
    Reply | Threaded
    Open this post in threaded view
    |

    Re: Spark performance over S3

    Tzahi File
    In reply to this post by Hariharan
    Hi Hariharan,

    Thanks for your reply. 

    In both cases we are writing the data to S3. The difference is that in the first case we read the data from S3 and in the second we read from HDFS. 
    We are using ListObjectsV2 API in S3A

    The S3 bucket and the cluster are located at the same AWS region. 



    On Wed, Apr 7, 2021 at 2:12 PM Hariharan <[hidden email]> wrote:
    Hi Tzahi,

    Comparing the first two cases:
    • > reads the parquet files from S3 and also writes to S3, it takes 22 min
    • > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)

    It looks like most of the time is being spent in reading, and the time spent in writing is likely negligible (probably you're not writing much output?)

    Can you clarify what is the difference between these two?

    > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)?
    > reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    In the second case, was the data read from hdfs or s3?

    Regarding the point from the post you linked to:
    1, Enhanced networking does make a difference, but it should be automatically enabled if you're using a compatible instance type and an AWS AMI. However if you're using a custom AMI, you might want to check if it's enabled for you.
    2. VPC endpoints also can make a difference in performance - at least that used to be the case a few years ago. Maybe that has changed now.

    Couple of other things you might want to check:
    1. If your bucket is versioned, you may want to check if you're using the ListObjectsV2 API in S3A.
    2. Also check these recommendations from Cloudera for optimal use of S3A.

    Thanks,
    Hariharan



    On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <[hidden email]> wrote:

    Hi All,

    We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

    The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

    the bucket and the ec2 run in the same region.

    As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

    When the job:

    • reads the parquet files from S3 and also writes to S3, it takes 22 min
    • reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)
    • reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    the spark job has the following S3-related configuration:

    • spark.hadoop.fs.s3a.connection.establish.timeout=5000
    • spark.hadoop.fs.s3a.connection.maximum=200

    when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

    Do you have any idea for the cause of the read latency from S3?

    I saw this post to improve the transfer speed, is something here relevant?



    Thanks,
    Tzahi


    --
    Tzahi File
    Data Engineers Team Lead
    ironSource
    mobile <a href="tel:+972-546864835" style="color:rgb(3,0,85)" target="_blank">+972-546864835
    ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
    ironsrc.com
    linkedintwitterfacebookgoogleplus
    This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.
    Reply | Threaded
    Open this post in threaded view
    |

    RE: Spark performance over S3

    Boris Litvak

    Oh, Tzahi, I misread the metrics in the first reply. It’s about reads indeed, not writes.

     

    From: Tzahi File <[hidden email]>
    Sent: Wednesday, 7 April 2021 16:02
    To: Hariharan <[hidden email]>
    Cc: user <[hidden email]>
    Subject: Re: Spark performance over S3

     

    Hi Hariharan,

     

    Thanks for your reply. 

     

    In both cases we are writing the data to S3. The difference is that in the first case we read the data from S3 and in the second we read from HDFS. 

    We are using ListObjectsV2 API in S3A

     

    The S3 bucket and the cluster are located at the same AWS region. 

     

     

     

    On Wed, Apr 7, 2021 at 2:12 PM Hariharan <[hidden email]> wrote:

    Hi Tzahi,

     

    Comparing the first two cases:

    • > reads the parquet files from S3 and also writes to S3, it takes 22 min
    • > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)

     

    It looks like most of the time is being spent in reading, and the time spent in writing is likely negligible (probably you're not writing much output?)


    Can you clarify what is the difference between these two?

    > reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)?

    > reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

     

    In the second case, was the data read from hdfs or s3?

     

    Regarding the point from the post you linked to:

    1, Enhanced networking does make a difference, but it should be automatically enabled if you're using a compatible instance type and an AWS AMI. However if you're using a custom AMI, you might want to check if it's enabled for you.

    2. VPC endpoints also can make a difference in performance - at least that used to be the case a few years ago. Maybe that has changed now.

     

    Couple of other things you might want to check:

    1. If your bucket is versioned, you may want to check if you're using the ListObjectsV2 API in S3A.

    2. Also check these recommendations from Cloudera for optimal use of S3A.

     

    Thanks,

    Hariharan

     

     

    On Wed, Apr 7, 2021 at 12:15 AM Tzahi File <[hidden email]> wrote:

    Hi All,

    We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.

    The spark job running on that cluster reads from an S3 bucket and writes to that bucket.

    the bucket and the ec2 run in the same region.

    As part of our efforts to reduce the runtime of our spark jobs we found there's serious latency when reading from S3.

    When the job:

    ·         reads the parquet files from S3 and also writes to S3, it takes 22 min

    ·         reads the parquet files from S3 and writes to its local hdfs, it takes the same amount of time (±22 min)

    ·         reads the parquet files from S3 (they were copied into the hdfs before) and writes to its local hdfs, the job took 7 min

    the spark job has the following S3-related configuration:

    ·         spark.hadoop.fs.s3a.connection.establish.timeout=5000

    ·         spark.hadoop.fs.s3a.connection.maximum=200

    when reading from S3 we tried to increase the spark.hadoop.fs.s3a.connection.maximum config param from 200 to 400 or 900 but it didn't reduce the S3 latency.

    Do you have any idea for the cause of the read latency from S3?

    I saw this post to improve the transfer speed, is something here relevant?

     

     

    Thanks,

    Tzahi


     

    --

    Tzahi File

    Data Engineers Team Lead

    ironSource

    mobile <a href="tel:+972-546864835" target="_blank">+972-546864835

    fax +972-77-5448273

    ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv

    ironsrc.com

    linkedintwitterfacebookgoogleplus

    This email (including any attachments) is for the sole use of the intended recipient and may contain confidential information which may be protected by legal privilege. If you are not the intended recipient, or the employee or agent responsible for delivering it to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication and/or its content is strictly prohibited. If you are not the intended recipient, please immediately notify us by reply email or by telephone, delete this email and destroy any copies. Thank you.