Informing Spark about specific Partitioning scheme to avoid shuffles
This post has NOT been accepted by the mailing list yet.
My environment is PySpark with Spark 2.0.0.
I'm using spark to load data from a large number of files into a Spark dataframe with fields say field1 to field10. While loading my data I have ensured that records are partitioned by field1 and field2(without using partitionBy). This was done when loading the data into a RDD of lists and before the .toDF() call. So I assume Spark would not already know that such a partitioning exists and might trigger a shuffle if I call a shuffling transform using field1 or field2 as keys and then cache that information.
Is it possible to inform Spark once I've created the data-frame about my custom partitioning scheme? Or would spark have already discovered this somehow before the shuffling transform call?