when to use broadcast variables

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

when to use broadcast variables

Diana Carroll
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD?  

Like, say I'm joining web logs in an RDD with user account data.  I could keep the account data in an RDD or if it's "small", a broadcast variable instead.  How small is small?  Small enough that I know it can easily fit in memory on a single node?  Some other guideline?

Thanks!

Diana
Reply | Threaded
Open this post in threaded view
|

Re: when to use broadcast variables

Prashant Sharma
I had like to be corrected on this but I am just trying to say small enough of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it can be a GB but not GBs and then depends on the network too.

Prashant Sharma


On Fri, May 2, 2014 at 6:42 PM, Diana Carroll <[hidden email]> wrote:
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD?  

Like, say I'm joining web logs in an RDD with user account data.  I could keep the account data in an RDD or if it's "small", a broadcast variable instead.  How small is small?  Small enough that I know it can easily fit in memory on a single node?  Some other guideline?

Thanks!

Diana

Reply | Threaded
Open this post in threaded view
|

Re: when to use broadcast variables

Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a
pretty good litmus test for whether or not to broadcast a smaller
dataset or turn it into an RDD.

On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma <[hidden email]> wrote:

> I had like to be corrected on this but I am just trying to say small enough
> of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it
> can be a GB but not GBs and then depends on the network too.
>
> Prashant Sharma
>
>
> On Fri, May 2, 2014 at 6:42 PM, Diana Carroll <[hidden email]> wrote:
>>
>> Anyone have any guidance on using a broadcast variable to ship data to
>> workers vs. an RDD?
>>
>> Like, say I'm joining web logs in an RDD with user account data.  I could
>> keep the account data in an RDD or if it's "small", a broadcast variable
>> instead.  How small is small?  Small enough that I know it can easily fit in
>> memory on a single node?  Some other guideline?
>>
>> Thanks!
>>
>> Diana
>
>