Question on bucketing vs sorting

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on bucketing vs sorting

Patrik Iselind
Hi everyone,

I am trying to push by understanding of bucketing vs sorting. I hope I can get som clarification from this list.

Bucketing as I've come to understand it is primarily intended for when preparing the dataframe for join operations. Where the goal is to get data that will be joined together in the same partition, to make the joins faster.

Sorting on the other hand is simply for when I want my data sorted, nothing strange there I guess.

The effect of using bucketing, as I see it, would be the same as sorting if I'm not doing any joining and I use enough buckets, like in the following example program. Where the sorting or bucketing would replace the `?()` transformation.

```pseudo code
df = spark.read.parquet("s3://...")
// df contains the columns A, B, and C
df2 = df.distinct().?().repartition(num_desired_partitions)
df2.write.parquet("s3://,,,")
```

Is my understanding correct or am I missing something?

Is there a performance consideration between sorting and bucketing that I need to keep in mind?

The end goal for me here is not that the data as such is sorted on the A column, it's that each  distinct value of A is kept together with all other rows which have the same value in A. If all rows with the same A value cannot fit within one partitions, then I accept that there's more than one partitions with the same value in the A column. If there's room left in the partitions, then it would be fine for rows with another value of A to fill up the partition.

I would like something as depicted below
```desireable example
-- Partition 1
A|B|C
=====
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
-- Partition 2
A|B|C
=====
2|?|?
0|?|?
0|?|?
0|?|?
1|?|?
```

What I don't want is something like below

```undesireable example
-- Partition 1
A|B|C
=====
0|?|?
0|?|?
1|?|?
0|?|?
1|?|?
2|?|?
1|?|?
-- Partition 2
A|B|C
=====
0|?|?
0|?|?
0|?|?
1|?|?
2|?|?
```
Where the A value varies.

Patrik Iselind
Reply | Threaded
Open this post in threaded view
|

Re: Question on bucketing vs sorting

Peyman Mohajerian
You can save your data to hdfs or other targets using either a sorted or bucketed dataframe. In the case of bucketing you will have a different data skipping mechanism when you read back the data compared to the sorted version. 

On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <[hidden email]> wrote:
Hi everyone,

I am trying to push by understanding of bucketing vs sorting. I hope I can get som clarification from this list.

Bucketing as I've come to understand it is primarily intended for when preparing the dataframe for join operations. Where the goal is to get data that will be joined together in the same partition, to make the joins faster.

Sorting on the other hand is simply for when I want my data sorted, nothing strange there I guess.

The effect of using bucketing, as I see it, would be the same as sorting if I'm not doing any joining and I use enough buckets, like in the following example program. Where the sorting or bucketing would replace the `?()` transformation.

```pseudo code
df = spark.read.parquet("s3://...")
// df contains the columns A, B, and C
df2 = df.distinct().?().repartition(num_desired_partitions)
df2.write.parquet("s3://,,,")
```

Is my understanding correct or am I missing something?

Is there a performance consideration between sorting and bucketing that I need to keep in mind?

The end goal for me here is not that the data as such is sorted on the A column, it's that each  distinct value of A is kept together with all other rows which have the same value in A. If all rows with the same A value cannot fit within one partitions, then I accept that there's more than one partitions with the same value in the A column. If there's room left in the partitions, then it would be fine for rows with another value of A to fill up the partition.

I would like something as depicted below
```desireable example
-- Partition 1
A|B|C
=====
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
-- Partition 2
A|B|C
=====
2|?|?
0|?|?
0|?|?
0|?|?
1|?|?
```

What I don't want is something like below

```undesireable example
-- Partition 1
A|B|C
=====
0|?|?
0|?|?
1|?|?
0|?|?
1|?|?
2|?|?
1|?|?
-- Partition 2
A|B|C
=====
0|?|?
0|?|?
0|?|?
1|?|?
2|?|?
```
Where the A value varies.

Patrik Iselind
Reply | Threaded
Open this post in threaded view
|

Re: Question on bucketing vs sorting

Patrik Iselind
Thank you Peyman for clarifying this for me.
Would you say there's a case for using bucketing in this case at all, or should I simply focus completely on the sorting solution? If so, when would you say bucketing is the preferred solution?

Patrik Iselind


On Thu, Dec 31, 2020 at 4:15 PM Peyman Mohajerian <[hidden email]> wrote:
You can save your data to hdfs or other targets using either a sorted or bucketed dataframe. In the case of bucketing you will have a different data skipping mechanism when you read back the data compared to the sorted version. 

On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <[hidden email]> wrote:
Hi everyone,

I am trying to push by understanding of bucketing vs sorting. I hope I can get som clarification from this list.

Bucketing as I've come to understand it is primarily intended for when preparing the dataframe for join operations. Where the goal is to get data that will be joined together in the same partition, to make the joins faster.

Sorting on the other hand is simply for when I want my data sorted, nothing strange there I guess.

The effect of using bucketing, as I see it, would be the same as sorting if I'm not doing any joining and I use enough buckets, like in the following example program. Where the sorting or bucketing would replace the `?()` transformation.

```pseudo code
df = spark.read.parquet("s3://...")
// df contains the columns A, B, and C
df2 = df.distinct().?().repartition(num_desired_partitions)
df2.write.parquet("s3://,,,")
```

Is my understanding correct or am I missing something?

Is there a performance consideration between sorting and bucketing that I need to keep in mind?

The end goal for me here is not that the data as such is sorted on the A column, it's that each  distinct value of A is kept together with all other rows which have the same value in A. If all rows with the same A value cannot fit within one partitions, then I accept that there's more than one partitions with the same value in the A column. If there's room left in the partitions, then it would be fine for rows with another value of A to fill up the partition.

I would like something as depicted below
```desireable example
-- Partition 1
A|B|C
=====
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
-- Partition 2
A|B|C
=====
2|?|?
0|?|?
0|?|?
0|?|?
1|?|?
```

What I don't want is something like below

```undesireable example
-- Partition 1
A|B|C
=====
0|?|?
0|?|?
1|?|?
0|?|?
1|?|?
2|?|?
1|?|?
-- Partition 2
A|B|C
=====
0|?|?
0|?|?
0|?|?
1|?|?
2|?|?
```
Where the A value varies.

Patrik Iselind
Reply | Threaded
Open this post in threaded view
|

Re: Question on bucketing vs sorting

Peyman Mohajerian
So there's the hive partitions, that's at rest partitioning, vs Spark partitioning, make sure you're not confusing the two. If the cardinality of the column you want to bucket by isn't too high and you don't have data skewness with respect to the buckets then you should use it (and each partition has at least 256M-1G of data) also you're total data size is large enough, not a few MBs. Each bucket doesn't translate to one Spark partition necessarily, if you have a couple of Gigs per bucket (at rest), Spark will create many partitions per bucket. So whether you sort or bucket the same value won't be in the same partition necessarily unless you force it by having a specific number of partitions (same as you number of buckets), which may not be a good idea if you have too much data per bucket.


On Thu, Dec 31, 2020 at 10:21 AM Patrik Iselind <[hidden email]> wrote:
Thank you Peyman for clarifying this for me.
Would you say there's a case for using bucketing in this case at all, or should I simply focus completely on the sorting solution? If so, when would you say bucketing is the preferred solution?

Patrik Iselind


On Thu, Dec 31, 2020 at 4:15 PM Peyman Mohajerian <[hidden email]> wrote:
You can save your data to hdfs or other targets using either a sorted or bucketed dataframe. In the case of bucketing you will have a different data skipping mechanism when you read back the data compared to the sorted version. 

On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <[hidden email]> wrote:
Hi everyone,

I am trying to push by understanding of bucketing vs sorting. I hope I can get som clarification from this list.

Bucketing as I've come to understand it is primarily intended for when preparing the dataframe for join operations. Where the goal is to get data that will be joined together in the same partition, to make the joins faster.

Sorting on the other hand is simply for when I want my data sorted, nothing strange there I guess.

The effect of using bucketing, as I see it, would be the same as sorting if I'm not doing any joining and I use enough buckets, like in the following example program. Where the sorting or bucketing would replace the `?()` transformation.

```pseudo code
df = spark.read.parquet("s3://...")
// df contains the columns A, B, and C
df2 = df.distinct().?().repartition(num_desired_partitions)
df2.write.parquet("s3://,,,")
```

Is my understanding correct or am I missing something?

Is there a performance consideration between sorting and bucketing that I need to keep in mind?

The end goal for me here is not that the data as such is sorted on the A column, it's that each  distinct value of A is kept together with all other rows which have the same value in A. If all rows with the same A value cannot fit within one partitions, then I accept that there's more than one partitions with the same value in the A column. If there's room left in the partitions, then it would be fine for rows with another value of A to fill up the partition.

I would like something as depicted below
```desireable example
-- Partition 1
A|B|C
=====
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
-- Partition 2
A|B|C
=====
2|?|?
0|?|?
0|?|?
0|?|?
1|?|?
```

What I don't want is something like below

```undesireable example
-- Partition 1
A|B|C
=====
0|?|?
0|?|?
1|?|?
0|?|?
1|?|?
2|?|?
1|?|?
-- Partition 2
A|B|C
=====
0|?|?
0|?|?
0|?|?
1|?|?
2|?|?
```
Where the A value varies.

Patrik Iselind