flatMap() returning large class

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

flatMap() returning large class

Don Drake
I'm looking for some advice when I have a flatMap on a Dataset that is creating and returning a sequence of a new case class (Seq[BigDataStructure]) that contains a very large amount of data, much larger than the single input record (think images).

In python, you can use generators (yield) to bypass creating a large list of structures and returning the list.

I'm programming this is in Scala and was wondering if there are any similar tricks to optimally return a list of classes?? I found the for/yield semantics, but it appears the compiler is just creating a sequence for you and this will blow through my Heap given the number of elements in the list and the size of each element.

Is there anything else I can use?

Thanks.

-Don

--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake
800-733-2143
Reply | Threaded
Open this post in threaded view
|

Re: flatMap() returning large class

Marcelo Vanzin
This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <[hidden email]> wrote:

> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for you
> and this will blow through my Heap given the number of elements in the list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: flatMap() returning large class

Richard Garris
Hi Don,

Good to hear from you. I think the problem is that regardless of whether you use yield or a generator - Spark internally will produce the entire result as a single large JVM object which will blow up your heap space.

Would it be possible to shrink the overall size of the image object storing it as a vector or Array vs a large Java class object?

That might be the more prudent approach.

-RG

Richard Garris

Principal Architect

Databricks, Inc

650.200.0840

[hidden email]


On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin ([hidden email]) wrote:

This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <[hidden email]> wrote:

> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for you
> and this will blow through my Heap given the number of elements in the list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: flatMap() returning large class

Don Drake
Hey Richard,

Good to hear from you as well.  I thought I would ask if there was something Scala specific I was missing in handling these large classes.

I can tweak my job to do a map() and then only one large object will be created at a time and returned, which should allow me to lower my executor memory size.

Thanks.

-Don


On Thu, Dec 14, 2017 at 2:58 PM, Richard Garris <[hidden email]> wrote:
Hi Don,

Good to hear from you. I think the problem is that regardless of whether you use yield or a generator - Spark internally will produce the entire result as a single large JVM object which will blow up your heap space.

Would it be possible to shrink the overall size of the image object storing it as a vector or Array vs a large Java class object?

That might be the more prudent approach.

-RG

Richard Garris

Principal Architect

Databricks, Inc

<a href="tel:(650)%20200-0840" value="+16502000840" target="_blank">650.200.0840

[hidden email]


On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin ([hidden email]) wrote:

This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <[hidden email]> wrote:

> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for you
> and this will blow through my Heap given the number of elements in the list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> <a href="tel:(800)%20733-2143" value="+18007332143" target="_blank">800-733-2143



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake
800-733-2143
Reply | Threaded
Open this post in threaded view
|

Re: flatMap() returning large class

Richard Garris
Hi Don,

It’s not so much map() vs flatMap(). You can return a collection and have Spark flatten the result.

My point was more to change from Seq[BigDataStructure]  to Seq[SmallDataStructure]

If the use case is really storing image data - I would try to use Seq[Vector] and store the values as a sparse array to reduce the overall size of the object.

Secondly, Databricks released an open source image processing Utility library in Deep Learning pipelines, specifically for reading in images and loading them as arrays in DataFrames or DataSets efficiently.


You can reuse this code potentially. 

Richard Garris

Principal Architect

Databricks, Inc

650.200.0840

[hidden email]


On December 17, 2017 at 3:12:41 PM, Don Drake ([hidden email]) wrote:

Hey Richard,

Good to hear from you as well.  I thought I would ask if there was something Scala specific I was missing in handling these large classes.

I can tweak my job to do a map() and then only one large object will be created at a time and returned, which should allow me to lower my executor memory size.

Thanks.

-Don


On Thu, Dec 14, 2017 at 2:58 PM, Richard Garris <[hidden email]> wrote:
Hi Don,

Good to hear from you. I think the problem is that regardless of whether you use yield or a generator - Spark internally will produce the entire result as a single large JVM object which will blow up your heap space.

Would it be possible to shrink the overall size of the image object storing it as a vector or Array vs a large Java class object?

That might be the more prudent approach.

-RG

Richard Garris

Principal Architect

Databricks, Inc

<a href="tel:(650)%20200-0840" value="+16502000840" target="_blank">650.200.0840

[hidden email]


On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin ([hidden email]) wrote:

This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <[hidden email]> wrote:
> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for you
> and this will blow through my Heap given the number of elements in the list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> <a href="tel:(800)%20733-2143" value="+18007332143" target="_blank">800-733-2143



--
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake
800-733-2143