Informing Spark about specific Partitioning scheme to avoid shuffles

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Informing Spark about specific Partitioning scheme to avoid shuffles

saatvikshah1994
This post has NOT been accepted by the mailing list yet.
Hi everyone,

My environment is PySpark with Spark 2.0.0.

I'm using spark to load data from a large number of files into a Spark dataframe with fields say field1 to field10. While loading my data I have ensured that records are partitioned by field1 and field2(without using partitionBy). This was done when loading the data into a RDD of lists and before the .toDF() call. So I assume Spark would not already know that such a partitioning exists and might trigger a shuffle if I call a shuffling transform using field1 or field2 as keys and then cache that information.

Is it possible to inform Spark once I've created the data-frame about my custom partitioning scheme? Or would spark have already discovered this somehow before the shuffling transform call?

Loading...