Spark join: grouping of records having same value for a particular column in same partition
We have 2 Hive tables which are read in spark and joined using a join key, let’s call it
Then, we write this joined dataset to S3 and register it hive as a 3rd table for subsequent tasks to use this joined dataset.
One of the other columns in the joined dataset is called
We want to group all the user records belonging to the same keychain_id in the same partition for a reason to avoid shuffles later.
So, can I do a repartition(“keychain_id”) before writing to s3 and registering it in Hive , and when I read the same data back from this third table will it still have the same partition grouping (all
users belonging to the
Same keychain_id in the same partition)? Because trying to avoid doing a repartition(“keychain_id”) every time when reading from this 3rd table.
Can you please clarify ? If there is no guarantee that it will retain the same partition grouping while reading, then is there another efficient way this can be done other than caching?