[Spark 2.x Core] .collect() size limit

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark 2.x Core] .collect() size limit

klrmowse
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Stephen Boesch
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse <[hidden email]>:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Deepak Goel
There is something as *virtual memory*

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <[hidden email]> wrote:
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse <[hidden email]>:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Stephen Boesch
While it is certainly possible to use VM I have seen in a number of places warnings that collect() results must be able to be fit in memory. I'm not sure if that applies to *all" spark calculations: but in the very least each of the specific collect()'s that are performed would need to be verified. 

And maybe all collects do require sufficient memory - would you like to check the source code to see if there were disk backed collects actually happening for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel <[hidden email]>:
There is something as *virtual memory*

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <[hidden email]> wrote:
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse <[hidden email]>:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Deepak Goel
I believe the virtualization of memory happens at the OS layer hiding it completely from the application layer

On Sat, 28 Apr 2018, 22:22 Stephen Boesch, <[hidden email]> wrote:
While it is certainly possible to use VM I have seen in a number of places warnings that collect() results must be able to be fit in memory. I'm not sure if that applies to *all" spark calculations: but in the very least each of the specific collect()'s that are performed would need to be verified. 

And maybe all collects do require sufficient memory - would you like to check the source code to see if there were disk backed collects actually happening for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel <[hidden email]>:
There is something as *virtual memory*

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <[hidden email]> wrote:
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse <[hidden email]>:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Mark Hamstra
In reply to this post by klrmowse

On Sat, Apr 28, 2018 at 8:41 AM, klrmowse <[hidden email]> wrote:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

JayeshLalwani
In reply to this post by Deepak Goel

Although there is such a thing as virtualization of memory done at the OS layer, JVM imposes it’s own limit that is controlled by the spark.executor.memory and spark.driver.memory configurations. The amount of memory allocated by JVM will be controlled by those parameters. General guidelines say that executor and driver memory should be kept at 80-85% of available RAM. So, if general guidelines are followed, *virtual memory* is moot.

From: Deepak Goel <[hidden email]>
Date: Saturday, April 28, 2018 at 12:58 PM
To: Stephen Boesch <[hidden email]>
Cc: klrmowse <[hidden email]>, "user @spark" <[hidden email]>
Subject: Re: [Spark 2.x Core] .collect() size limit

 

 

On Sat, 28 Apr 2018, 22:22 Stephen Boesch, <[hidden email]> wrote:

While it is certainly possible to use VM I have seen in a number of places warnings that collect() results must be able to be fit in memory. I'm not sure if that applies to *all" spark calculations: but in the very least each of the specific collect()'s that are performed would need to be verified. 

 

And maybe all collects do require sufficient memory - would you like to check the source code to see if there were disk backed collects actually happening for some cases?

 

2018-04-28 9:48 GMT-07:00 Deepak Goel <[hidden email]>:

There is something as *virtual memory*

 

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <[hidden email]> wrote:

Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would be your limiting factor.

 

2018-04-28 8:41 GMT-07:00 klrmowse <[hidden email]>:

i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from:
http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail:
[hidden email]

 

 



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Deepak Goel
Could you please help us and provide the source which says about the general guidelines (80-85)?

Even if there is a general guideline, it is probably to keep the performance of Spark application high (And to *distinguish* it from Hadoop). But if you are not too concerned about the *performance* hit from memory to disk, then you could use virtual memory to your advantage. Infact I think the OS could do a pretty good job of data management by keeping only the necessary data in RAM and at the same time having no hard-limit (It would be great to have benchmarks if anyone has done any test before)

Also we should *tread* carefully when applying general guidelines to problems. They might not be *relevant* at all.

Deepak
"Please stop cruelty to Animals, help by becoming a Vegan"

"Plant a Tree, Go Green"


On Mon, Apr 30, 2018 at 9:06 PM, Lalwani, Jayesh <[hidden email]> wrote:

Although there is such a thing as virtualization of memory done at the OS layer, JVM imposes it’s own limit that is controlled by the spark.executor.memory and spark.driver.memory configurations. The amount of memory allocated by JVM will be controlled by those parameters. General guidelines say that executor and driver memory should be kept at 80-85% of available RAM. So, if general guidelines are followed, *virtual memory* is moot.

From: Deepak Goel <[hidden email]>
Date: Saturday, April 28, 2018 at 12:58 PM
To: Stephen Boesch <[hidden email]>
Cc: klrmowse <[hidden email]>, "user @spark" <[hidden email]>
Subject: Re: [Spark 2.x Core] .collect() size limit

 

 

On Sat, 28 Apr 2018, 22:22 Stephen Boesch, <[hidden email]> wrote:

While it is certainly possible to use VM I have seen in a number of places warnings that collect() results must be able to be fit in memory. I'm not sure if that applies to *all" spark calculations: but in the very least each of the specific collect()'s that are performed would need to be verified. 

 

And maybe all collects do require sufficient memory - would you like to check the source code to see if there were disk backed collects actually happening for some cases?

 

2018-04-28 9:48 GMT-07:00 Deepak Goel <[hidden email]>:

There is something as *virtual memory*

 

On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <[hidden email]> wrote:

Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM - so that would be your limiting factor.

 

2018-04-28 8:41 GMT-07:00 klrmowse <[hidden email]>:

i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from:
http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail:
[hidden email]

 

 



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.


Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Vadim Semenov-2
In reply to this post by klrmowse
`.collect` returns an Array, and array's can't have more than Int.MaxValue elements, and in most JVMs it's lower: `Int.MaxValue - 8`
So it puts upper limit, however, you can just create Array of Arrays, and so on, basically limitless, albeit with some gymnastics.
Reply | Threaded
Open this post in threaded view
|

Re: [Spark 2.x Core] .collect() size limit

Irving Duran
In reply to this post by klrmowse
I don't think there is a magic number, so I would say that it will depend on how big your dataset is and the size of your worker(s).

Thank You,

Irving Duran


On Sat, Apr 28, 2018 at 10:41 AM klrmowse <[hidden email]> wrote:
i am currently trying to find a workaround for the Spark application i am
working on so that it does not have to use .collect()

but, for now, it is going to have to use .collect()

what is the size limit (memory for the driver) of RDD file that .collect()
can work with?

i've been scouring google-search - S.O., blogs, etc, and everyone is
cautioning about .collect(), but does not specify how huge is huge... are we
talking about a few gigabytes? terabytes?? petabytes???



thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [EXT] [Spark 2.x Core] .collect() size limit

Michael Mansour
In reply to this post by klrmowse
Well, if you don't need to actually evaluate the information on the driver, but just need to trigger some sort of action, then you may want to consider using the `forEach` or `forEachPartition` method, which is an action and will execute your process.  It won't return anything to the driver and blow out its memory.  For instance, I coalesce results to a smaller number of partitions, and use the forEachPartition to have the pipeline executed and results written via a custom DB connector.

If you need to save the results of your job, then just write them to disk or a DB.

Please expand on what you're trying to achieve here.

--
Michael Mansour
Data Scientist
Symantec CASB

On 4/28/18, 8:41 AM, "klrmowse" <[hidden email]> wrote:

    i am currently trying to find a workaround for the Spark application i am
    working on so that it does not have to use .collect()
   
    but, for now, it is going to have to use .collect()
   
    what is the size limit (memory for the driver) of RDD file that .collect()
    can work with?
   
    i've been scouring google-search - S.O., blogs, etc, and everyone is
    cautioning about .collect(), but does not specify how huge is huge... are we
    talking about a few gigabytes? terabytes?? petabytes???
   
   
   
    thank you
   
   
   
    --
    Sent from: https://clicktime.symantec.com/a/1/G2S0jpsqRyjcQKz8BMD-57u9YHEnQ-h69cgkpqQng68=?d=IqkRB5PXtyU64zwtXRR_81Ek3npwXfp0028DVY5snG6r90rezWaDCQbQ6Ab6FaIZpLDoZ1mkTRdqgacrSYrOBcO36fn_bmwDGm_-jIFM3U6HZ4PkSqpSJY8WCddv-CS6OejjBpwbJ_ZkN4pQsVBX2Y9YDp_H9M4Lh-Up1XevC5eDghyQfz1_LBMjkcXQ64H2M2i8eatGFAaKR72rjCxcAncpHamquC2pYtUjN5LlYlDskvBBoTnw0Cna36sv61eEVCMTT6t3kxI0eZ1VNbwqAXRWEVo-N4rnn81K3y6bj47SfI5uS8pjba72sqtqVaC0s19cOYgqEnnRA-RR0KKBbNHEEGEXpsD2c0iLVyr-xNY7PBLnjCT3rBfdhkEPqFPfbEJO0oV-F6fTvWnr5KcN_g1dMEYaybaqQbogSA%3D%3D&u=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2F
   
    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]
   
   


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [EXT] [Spark 2.x Core] .collect() size limit

klrmowse
okie, i may have found an alternate/workaround to using .collect() for what i
am trying to achieve...

initially, for the Spark application that i am working on, i would call
.collect() on two separate RDDs into a couple of ArrayLists (which was the
reason i was asking what the size limit on the driver is)

i need to map the 1st rdd to the 2nd rdd according to a computation/function
- resulting in key-value pairs;

it turns out, i don't need to call .collect() if i instead use
.zipPartitions() - to which i can pass the function to;

i am currently testing it out...



thanks all for your responses



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]