Using partition information from parquet files during groupBy
This post has NOT been accepted by the mailing list yet.
We've been looking at loading data from Parquet files partitioned by a couple of keys, then doing a groupBy on the same keys on the resulting dataset. The resulting job performs a shuffle on the dataset during the groupBy even though the partition keys have been correctly recognised (as verified by filtering on the same keys). This seems like a waste, given that the partitions read from disk are already grouped correctly.
Is it the case that partition keys from Parquet only take effect using filters? Any plans to use them for groupBy as well? (We're on Spark 2.1.0)