Using partition information from parquet files during groupBy

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Using partition information from parquet files during groupBy

jans70
This post has NOT been accepted by the mailing list yet.
We've been looking at loading data from Parquet files partitioned by a couple of keys, then doing a groupBy on the same keys on the resulting dataset. The resulting job performs a shuffle on the dataset during the groupBy even though the partition keys have been correctly recognised (as verified by filtering on the same keys). This seems like a waste, given that the partitions read from disk are already grouped correctly.

Is it the case that partition keys from Parquet only take effect using filters? Any plans to use them for groupBy as well? (We're on Spark 2.1.0)

This question about the same issue was asked last year but wasn't answered (and marked as not accepted!): http://apache-spark-user-list.1001560.n3.nabble.com/Using-partition-information-from-parquet-file-to-avoid-shuffle-during-group-by-td27245.html

Any insight into this would be greatly appreciated.

Jan