Is it common in spark to broadcast a 10 gb variable?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Is it common in spark to broadcast a 10 gb variable?

Aureliano Buendia
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?
Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Guillaume Pitel
From my experience, it shouldn't be a problem since 0.8.1 (before that, the akka FrameSize was a limit).
I've broadcast arrays of max 1.4GB so far

Keep in mind that it will be stored in spark.local.dir so you must have room on the disk.

Guillaume
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?


--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

bmiller1
If you're using pyspark, beware that there are some known issues associated with large broadcast variables.


-Brad


On Wed, Mar 12, 2014 at 10:15 AM, Guillaume Pitel <[hidden email]> wrote:
From my experience, it shouldn't be a problem since 0.8.1 (before that, the akka FrameSize was a limit).
I've broadcast arrays of max 1.4GB so far

Keep in mind that it will be stored in spark.local.dir so you must have room on the disk.

Guillaume
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?


--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Josh Marcus
In reply to this post by Aureliano Buendia
Aureliano,

Just to answer your second question (unrelated to Spark), arrays in java and scala can't be larger than the maximum value of an Integer (Integer.MAX_VALUE), which means that arrays are limited to about 2.2 billion elements.  

--j



On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?

Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Stephen Boesch
Hi Josh,
  So then   2^31 (2.2Bilion) * 2^6  (length of double)  = 128GB  would be max array byte length with Doubles?


2014-03-12 11:30 GMT-07:00 Josh Marcus <[hidden email]>:
Aureliano,

Just to answer your second question (unrelated to Spark), arrays in java and scala can't be larger than the maximum value of an Integer (Integer.MAX_VALUE), which means that arrays are limited to about 2.2 billion elements.  

--j



On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?


Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Aureliano Buendia
Is TorrentBroadcastFactory out of beta? IS it preferred over HttpBroadcastFactory for large broadcasts?

What are the benefits of HttpBroadcastFactory as the default factory?


On Wed, Mar 12, 2014 at 7:09 PM, Stephen Boesch <[hidden email]> wrote:
Hi Josh,
  So then   2^31 (2.2Bilion) * 2^6  (length of double)  = 128GB  would be max array byte length with Doubles?


2014-03-12 11:30 GMT-07:00 Josh Marcus <[hidden email]>:

Aureliano,

Just to answer your second question (unrelated to Spark), arrays in java and scala can't be larger than the maximum value of an Integer (Integer.MAX_VALUE), which means that arrays are limited to about 2.2 billion elements.  

--j



On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?



Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Matei Zaharia
Administrator
You should try Torrent for this one, it will be faster. It’s still experimental but I believe it works pretty well and it just needs more testing to become the default.

Matei

On Mar 12, 2014, at 1:12 PM, Aureliano Buendia <[hidden email]> wrote:

Is TorrentBroadcastFactory out of beta? IS it preferred over HttpBroadcastFactory for large broadcasts?

What are the benefits of HttpBroadcastFactory as the default factory?


On Wed, Mar 12, 2014 at 7:09 PM, Stephen Boesch <[hidden email]> wrote:
Hi Josh,
  So then   2^31 (2.2Bilion) * 2^6  (length of double)  = 128GB  would be max array byte length with Doubles?


2014-03-12 11:30 GMT-07:00 Josh Marcus <[hidden email]>:

Aureliano,

Just to answer your second question (unrelated to Spark), arrays in java and scala can't be larger than the maximum value of an Integer (Integer.MAX_VALUE), which means that arrays are limited to about 2.2 billion elements.  

--j



On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

I asked a similar question a while ago, didn't get any answers.

I'd like to share a 10 gb double array between 50 to 100 workers. The physical memory of workers is over 40 gb, so it can fit in each memory. The reason I'm sharing this array is that a cartesian operation is applied to this array, and I want to avoid network shuffling.

1. Is Spark broadcast built for pushing variables of gb size? Does it need special configurations (eg akka config, etc) to work under this condition?

2. (Not directly related to spark) Is the an upper limit for scala/java arrays other than the physical memory? Do they stop working when the array elements count exceeds a certain number?




Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Ryan Compton
In 0.8 I had problems broadcasting variables around that size, for
more info see here:
https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3CCAMgYSQ9sivS0J9dHV9qGDZP9qXGFaDQKrD58B3yNbNHdgkPBmw@...%3E

On Wed, Mar 12, 2014 at 2:12 PM, Matei Zaharia <[hidden email]> wrote:

> You should try Torrent for this one, it will be faster. It’s still
> experimental but I believe it works pretty well and it just needs more
> testing to become the default.
>
> Matei
>
> On Mar 12, 2014, at 1:12 PM, Aureliano Buendia <[hidden email]> wrote:
>
> Is TorrentBroadcastFactory out of beta? IS it preferred over
> HttpBroadcastFactory for large broadcasts?
>
> What are the benefits of HttpBroadcastFactory as the default factory?
>
>
> On Wed, Mar 12, 2014 at 7:09 PM, Stephen Boesch <[hidden email]> wrote:
>>
>> Hi Josh,
>>   So then   2^31 (2.2Bilion) * 2^6  (length of double)  = 128GB  would be
>> max array byte length with Doubles?
>>
>>
>> 2014-03-12 11:30 GMT-07:00 Josh Marcus <[hidden email]>:
>>
>>> Aureliano,
>>>
>>> Just to answer your second question (unrelated to Spark), arrays in java
>>> and scala can't be larger than the maximum value of an Integer
>>> (Integer.MAX_VALUE), which means that arrays are limited to about 2.2
>>> billion elements.
>>>
>>> --j
>>>
>>>
>>>
>>> On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia <[hidden email]>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I asked a similar question a while ago, didn't get any answers.
>>>>
>>>> I'd like to share a 10 gb double array between 50 to 100 workers. The
>>>> physical memory of workers is over 40 gb, so it can fit in each memory. The
>>>> reason I'm sharing this array is that a cartesian operation is applied to
>>>> this array, and I want to avoid network shuffling.
>>>>
>>>> 1. Is Spark broadcast built for pushing variables of gb size? Does it
>>>> need special configurations (eg akka config, etc) to work under this
>>>> condition?
>>>>
>>>> 2. (Not directly related to spark) Is the an upper limit for scala/java
>>>> arrays other than the physical memory? Do they stop working when the array
>>>> elements count exceeds a certain number?
>>>
>>>
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Aureliano Buendia
Thanks, Ryan. Was your problem solved in spark 0.9?


On Wed, Mar 12, 2014 at 9:59 PM, Ryan Compton <[hidden email]> wrote:
In 0.8 I had problems broadcasting variables around that size, for
more info see here:
https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3CCAMgYSQ9sivS0J9dHV9qGDZP9qXGFaDQKrD58B3yNbNHdgkPBmw@...%3E

On Wed, Mar 12, 2014 at 2:12 PM, Matei Zaharia <[hidden email]> wrote:
> You should try Torrent for this one, it will be faster. It’s still
> experimental but I believe it works pretty well and it just needs more
> testing to become the default.
>
> Matei
>
> On Mar 12, 2014, at 1:12 PM, Aureliano Buendia <[hidden email]> wrote:
>
> Is TorrentBroadcastFactory out of beta? IS it preferred over
> HttpBroadcastFactory for large broadcasts?
>
> What are the benefits of HttpBroadcastFactory as the default factory?
>
>
> On Wed, Mar 12, 2014 at 7:09 PM, Stephen Boesch <[hidden email]> wrote:
>>
>> Hi Josh,
>>   So then   2^31 (2.2Bilion) * 2^6  (length of double)  = 128GB  would be
>> max array byte length with Doubles?
>>
>>
>> 2014-03-12 11:30 GMT-07:00 Josh Marcus <[hidden email]>:
>>
>>> Aureliano,
>>>
>>> Just to answer your second question (unrelated to Spark), arrays in java
>>> and scala can't be larger than the maximum value of an Integer
>>> (Integer.MAX_VALUE), which means that arrays are limited to about 2.2
>>> billion elements.
>>>
>>> --j
>>>
>>>
>>>
>>> On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia <[hidden email]>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I asked a similar question a while ago, didn't get any answers.
>>>>
>>>> I'd like to share a 10 gb double array between 50 to 100 workers. The
>>>> physical memory of workers is over 40 gb, so it can fit in each memory. The
>>>> reason I'm sharing this array is that a cartesian operation is applied to
>>>> this array, and I want to avoid network shuffling.
>>>>
>>>> 1. Is Spark broadcast built for pushing variables of gb size? Does it
>>>> need special configurations (eg akka config, etc) to work under this
>>>> condition?
>>>>
>>>> 2. (Not directly related to spark) Is the an upper limit for scala/java
>>>> arrays other than the physical memory? Do they stop working when the array
>>>> elements count exceeds a certain number?
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Is it common in spark to broadcast a 10 gb variable?

Ryan Compton
Have not upgraded yet...

On Wed, Mar 12, 2014 at 3:06 PM, Aureliano Buendia <[hidden email]> wrote:

> Thanks, Ryan. Was your problem solved in spark 0.9?
>
>
> On Wed, Mar 12, 2014 at 9:59 PM, Ryan Compton <[hidden email]>
> wrote:
>>
>> In 0.8 I had problems broadcasting variables around that size, for
>> more info see here:
>>
>> https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3CCAMgYSQ9sivS0J9dHV9qGDZP9qXGFaDQKrD58B3yNbNHdgkPBmw@...%3E
>>
>> On Wed, Mar 12, 2014 at 2:12 PM, Matei Zaharia <[hidden email]>
>> wrote:
>> > You should try Torrent for this one, it will be faster. It’s still
>> > experimental but I believe it works pretty well and it just needs more
>> > testing to become the default.
>> >
>> > Matei
>> >
>> > On Mar 12, 2014, at 1:12 PM, Aureliano Buendia <[hidden email]>
>> > wrote:
>> >
>> > Is TorrentBroadcastFactory out of beta? IS it preferred over
>> > HttpBroadcastFactory for large broadcasts?
>> >
>> > What are the benefits of HttpBroadcastFactory as the default factory?
>> >
>> >
>> > On Wed, Mar 12, 2014 at 7:09 PM, Stephen Boesch <[hidden email]>
>> > wrote:
>> >>
>> >> Hi Josh,
>> >>   So then   2^31 (2.2Bilion) * 2^6  (length of double)  = 128GB  would
>> >> be
>> >> max array byte length with Doubles?
>> >>
>> >>
>> >> 2014-03-12 11:30 GMT-07:00 Josh Marcus <[hidden email]>:
>> >>
>> >>> Aureliano,
>> >>>
>> >>> Just to answer your second question (unrelated to Spark), arrays in
>> >>> java
>> >>> and scala can't be larger than the maximum value of an Integer
>> >>> (Integer.MAX_VALUE), which means that arrays are limited to about 2.2
>> >>> billion elements.
>> >>>
>> >>> --j
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Mar 12, 2014 at 1:08 PM, Aureliano Buendia
>> >>> <[hidden email]>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I asked a similar question a while ago, didn't get any answers.
>> >>>>
>> >>>> I'd like to share a 10 gb double array between 50 to 100 workers. The
>> >>>> physical memory of workers is over 40 gb, so it can fit in each
>> >>>> memory. The
>> >>>> reason I'm sharing this array is that a cartesian operation is
>> >>>> applied to
>> >>>> this array, and I want to avoid network shuffling.
>> >>>>
>> >>>> 1. Is Spark broadcast built for pushing variables of gb size? Does it
>> >>>> need special configurations (eg akka config, etc) to work under this
>> >>>> condition?
>> >>>>
>> >>>> 2. (Not directly related to spark) Is the an upper limit for
>> >>>> scala/java
>> >>>> arrays other than the physical memory? Do they stop working when the
>> >>>> array
>> >>>> elements count exceeds a certain number?
>> >>>
>> >>>
>> >>
>> >
>> >
>
>