Spark & S3 - Introducing random values into key names

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark & S3 - Introducing random values into key names

Subhash Sriram
Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions.


Is there a straight forward way to accomplish this randomness in Spark via the DataSet API? The only thing that I could think of would be to actually split the large set into multiple sets (based on row boundaries), and then write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash


Reply | Threaded
Open this post in threaded view
|

Re: Spark & S3 - Introducing random values into key names

Vadim Semenov-2
You need to put randomness into the beginning of the key, if you put it other than into the beginning, it's not guaranteed that you're going to have good performance.

The way we achieved this is by writing to HDFS first, and then having a custom DistCp implemented using Spark that copies parquet files using random keys,
and then saves the list of resulting keys to S3, and when we want to use those parquet files, we just need to load the listing file, and then take keys from it and pass them into the loader.

You only need to do this when you have way too many files, if the number of keys you operate is reasonably small (let's say, in thousands), you won't get any benefits.

Also the S3 buckets have internal optimizations, and overtime it adjusts to the workload, i.e. some additional underlying partitions are getting added, some splits happen, etc.
If you want to have good performance from start, you would need to use randomization, yes.
Or alternatively, you can contact AWS and tell them about the naming schema that you're going to have (but it must be set in stone), and then they can try to pre-optimize the bucket for you.

On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram <[hidden email]> wrote:
Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions.


Is there a straight forward way to accomplish this randomness in Spark via the DataSet API? The only thing that I could think of would be to actually split the large set into multiple sets (based on row boundaries), and then write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash



Reply | Threaded
Open this post in threaded view
|

Re: Spark & S3 - Introducing random values into key names

Subhash Sriram
Thanks, Vadim! That helps and makes sense. I don't think we have a number of keys so large that we have to worry about it. If we do, I think I would go with an approach similar to what you suggested.

Thanks again,
Subhash 

Sent from my iPhone

On Mar 8, 2018, at 11:56 AM, Vadim Semenov <[hidden email]> wrote:

You need to put randomness into the beginning of the key, if you put it other than into the beginning, it's not guaranteed that you're going to have good performance.

The way we achieved this is by writing to HDFS first, and then having a custom DistCp implemented using Spark that copies parquet files using random keys,
and then saves the list of resulting keys to S3, and when we want to use those parquet files, we just need to load the listing file, and then take keys from it and pass them into the loader.

You only need to do this when you have way too many files, if the number of keys you operate is reasonably small (let's say, in thousands), you won't get any benefits.

Also the S3 buckets have internal optimizations, and overtime it adjusts to the workload, i.e. some additional underlying partitions are getting added, some splits happen, etc.
If you want to have good performance from start, you would need to use randomization, yes.
Or alternatively, you can contact AWS and tell them about the naming schema that you're going to have (but it must be set in stone), and then they can try to pre-optimize the bucket for you.

On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram <[hidden email]> wrote:
Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions.


Is there a straight forward way to accomplish this randomness in Spark via the DataSet API? The only thing that I could think of would be to actually split the large set into multiple sets (based on row boundaries), and then write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash