Broadcast join data reuse

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Broadcast join data reuse

tcondie

We have a case where data the is small enough to be broadcasted in joined with multiple tables in a single plan. Looking at the physical plan, I do not see anything that indicates if the broadcast data is done only once i.e., the BroadcastExchange is being reused i.i.e., that data is not redistributed from scratch. Could someone with insight into the physical plan strategy for such a case confirm whether previous broadcasted data is reused or if subsequent BroadcastExechange steps are done from scratch.

 

Thanks and best regards,

Tyson

Reply | Threaded
Open this post in threaded view
|

Re: Broadcast join data reuse

Ankur Srivastava
Hi Tyson,

The broadcast variable should remain in-memory of the executors and reused unless you unpersist, destroy it or it goes out of context.

Hope this helps.

Thanks
Ankur

On Wed, Jun 10, 2020 at 5:28 PM <[hidden email]> wrote:

We have a case where data the is small enough to be broadcasted in joined with multiple tables in a single plan. Looking at the physical plan, I do not see anything that indicates if the broadcast data is done only once i.e., the BroadcastExchange is being reused i.i.e., that data is not redistributed from scratch. Could someone with insight into the physical plan strategy for such a case confirm whether previous broadcasted data is reused or if subsequent BroadcastExechange steps are done from scratch.

 

Thanks and best regards,

Tyson

Reply | Threaded
Open this post in threaded view
|

Re: Broadcast join data reuse

gypsysunny
The broadcasted table can't seem to be resued across multiple actions.
e.g.
val small_df_bc = broadcast(small_df)
big_df1.join(small_df_bc, Seq("id")).write.parquet("/test1")
big_df2.join(small_df_bc, Seq("id")).write.parquet("/test2")

we can tell the small df has been distributed twice in the spark web UI.

so how can we make it happen only once?

thanks a million.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]