Spark Data Frame. PreSorded partitions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Data Frame. PreSorded partitions

Николай Ижиков
Hello, guys!

I work on implementation of custom DataSource for Spark Data Frame API and have a question:

If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.

Do I have a built-in option to tell spark that data from each partition already sorted?

It seems that Spark can benefit from usage of already sorted partitions.
By using of distributed merge sort algorithm, for example.

Does it make sense for you?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Data Frame. PreSorded partitions

MidwestMike
I'm not sure other than retrieving from a hive table that is already sorted.  This sounds cool though, would be interested to know this as well

On Nov 28, 2017 10:40 AM, "Николай Ижиков" <[hidden email]> wrote:
Hello, guys!

I work on implementation of custom DataSource for Spark Data Frame API and have a question:

If I have a `SELECT * FROM table1 ORDER BY some_column` query I can sort data inside a partition in my data source.

Do I have a built-in option to tell spark that data from each partition already sorted?

It seems that Spark can benefit from usage of already sorted partitions.
By using of distributed merge sort algorithm, for example.

Does it make sense for you?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]