understanding spark shuffle file re-use better

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

understanding spark shuffle file re-use better

Koert Kuipers
is shuffle file re-use based on identity or equality of the dataframe?

for example if run the exact same code twice to load data and do transforms (joins, aggregations, etc.) but without re-using any actual dataframes, will i still see skipped stages thanks to shuffle file re-use?

thanks!
koert
Reply | Threaded
Open this post in threaded view
|

Re: understanding spark shuffle file re-use better

Jacek Laskowski
Hi,

An interesting question that I must admit I'm not sure how to answer myself actually :)

Off the top of my head, I'd **guess** unless you cache the first query these two queries would share nothing. With caching, there's a phase in query execution when a canonicalized version of a query is used to look up any cached queries.

Again, I'm not really sure and if I'd have to answer it (e.g. as part of an interview) I'd say nothing would be shared / re-used.

On Wed, Jan 13, 2021 at 5:39 PM Koert Kuipers <[hidden email]> wrote:
is shuffle file re-use based on identity or equality of the dataframe?

for example if run the exact same code twice to load data and do transforms (joins, aggregations, etc.) but without re-using any actual dataframes, will i still see skipped stages thanks to shuffle file re-use?

thanks!
koert