Optimizing a join with bucketing

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Optimizing a join with bucketing

Vitaliy Pisarev
I am joining two entities.
One of the entities weighs ~0.5 TB. The other weighs ~16GB

Both are stored in parquet.

Another trait of the problem is that the "smaller" entity does not change, so I figured I'd pre-bucket it
to improve performance.

* What are the guidelines for deciding the best amount of buckets for this?
Does it solely depend on the overall size of the bucketed entity or do I need to take into account the size of the unbucketed one. How?