Using partition information from parquet files during groupBy

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

Using partition information from parquet files during groupBy

This post has NOT been accepted by the mailing list yet.
We've been looking at loading data from Parquet files partitioned by a couple of keys, then doing a groupBy on the same keys on the resulting dataset. The resulting job performs a shuffle on the dataset during the groupBy even though the partition keys have been correctly recognised (as verified by filtering on the same keys). This seems like a waste, given that the partitions read from disk are already grouped correctly.

Is it the case that partition keys from Parquet only take effect using filters? Any plans to use them for groupBy as well? (We're on Spark 2.1.0)

This question about the same issue was asked last year but wasn't answered (and marked as not accepted!):

Any insight into this would be greatly appreciated.