Using Spark on Data size larger than Memory size

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Spark on Data size larger than Memory size

Vibhor Banga
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor
Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Vibhor Banga
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore

Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Mayur Rustagi
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: +1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore


Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Aaron Davidson
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore



Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Roger Hoover
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little more please?

If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <[hidden email]> wrote:
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore




Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Roger Hoover
I think it would very handy to be able to specify that you want sorting during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <[hidden email]> wrote:
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little more please?

If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <[hidden email]> wrote:
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore





Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Andrew Ash
Hi Roger,

You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once.  It does require holding the entire partition in memory though.  Do you need the partition to never be held in memory all at once?

As far as the work that Aaron mentioned is happening, I think he might be referring to the discussion and code surrounding https://issues.apache.org/jira/browse/SPARK-983

Cheers!
Andrew


On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <[hidden email]> wrote:
I think it would very handy to be able to specify that you want sorting during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <[hidden email]> wrote:
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little more please?

If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <[hidden email]> wrote:
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore






Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Roger Hoover
Andrew, 

Thank you.  I'm using mapPartitions() but as you say, it requires that every partition fit in memory.  This will work for now but may not always work so I was wondering about another way.

Thanks,

Roger


On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <[hidden email]> wrote:
Hi Roger,

You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once.  It does require holding the entire partition in memory though.  Do you need the partition to never be held in memory all at once?

As far as the work that Aaron mentioned is happening, I think he might be referring to the discussion and code surrounding https://issues.apache.org/jira/browse/SPARK-983

Cheers!
Andrew


On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <[hidden email]> wrote:
I think it would very handy to be able to specify that you want sorting during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <[hidden email]> wrote:
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little more please?

If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <[hidden email]> wrote:
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore







Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Andrew Ash

If an individual partition becomes too large to fit in memory then the usual approach would be to repartition to more partitions, so each one is smaller. Hopefully then it would fit.

On Jun 6, 2014 5:47 PM, "Roger Hoover" <[hidden email]> wrote:
Andrew, 

Thank you.  I'm using mapPartitions() but as you say, it requires that every partition fit in memory.  This will work for now but may not always work so I was wondering about another way.

Thanks,

Roger


On Thu, Jun 5, 2014 at 5:26 PM, Andrew Ash <[hidden email]> wrote:
Hi Roger,

You should be able to sort within partitions using the rdd.mapPartitions() method, and that shouldn't require holding all data in memory at once.  It does require holding the entire partition in memory though.  Do you need the partition to never be held in memory all at once?

As far as the work that Aaron mentioned is happening, I think he might be referring to the discussion and code surrounding https://issues.apache.org/jira/browse/SPARK-983

Cheers!
Andrew


On Thu, Jun 5, 2014 at 5:16 PM, Roger Hoover <[hidden email]> wrote:
I think it would very handy to be able to specify that you want sorting during a partitioning stage.


On Thu, Jun 5, 2014 at 4:42 PM, Roger Hoover <[hidden email]> wrote:
Hi Aaron,

When you say that sorting is being worked on, can you elaborate a little more please?

If particular, I want to sort the items within each partition (not globally) without necessarily bringing them all into memory at once.

Thanks,

Roger


On Sat, May 31, 2014 at 11:10 PM, Aaron Davidson <[hidden email]> wrote:
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore







Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Vibhor Banga
In reply to this post by Aaron Davidson
Aaron, Thank You for your response and clarifying things.

-Vibhor


On Sun, Jun 1, 2014 at 11:40 AM, Aaron Davidson <[hidden email]> wrote:
There is no fundamental issue if you're running on data that is larger than cluster memory size. Many operations can stream data through, and thus memory usage is independent of input data size. Certain operations require an entire *partition* (not dataset) to fit in memory, but there are not many instances of this left (sorting comes to mind, and this is being worked on).

In general, one problem with Spark today is that you can OOM under certain configurations, and it's possible you'll need to change from the default configuration if you're using doing very memory-intensive jobs. However, there are very few cases where Spark would simply fail as a matter of course -- for instance, you can always increase the number of partitions to decrease the size of any given one. or repartition data to eliminate skew.

Regarding impact on performance, as Mayur said, there may absolutely be an impact depending on your jobs. If you're doing a join on a very large amount of data with few partitions, then we'll have to spill to disk. If you can't cache your working set of data in memory, you will also see a performance degradation. Spark enables the use of memory to make things fast, but if you just don't have enough memory, it won't be terribly fast.


On Sat, May 31, 2014 at 12:14 AM, Mayur Rustagi <[hidden email]> wrote:
Clearly thr will be impact on performance but frankly depends on what you are trying to achieve with the dataset.

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Sat, May 31, 2014 at 11:45 AM, Vibhor Banga <[hidden email]> wrote:
Some inputs will be really helpful.

Thanks,
-Vibhor


On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga <[hidden email]> wrote:
Hi all,

I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. 

I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ?

Any thoughts in this direction will be helpful and are welcome.

Thanks,
-Vibhor



--
Vibhor Banga
Software Development Engineer
Flipkart Internet Pvt. Ltd., Bangalore








Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Allen Chang
In reply to this post by Aaron Davidson
Thanks for the clarification.

What is the proper way to configure RDDs when your aggregate data size exceeds your available working memory size? In particular, in additional to typical operations, I'm performing cogroups, joins, and coalesces/shuffles.

I see that the default storage level for RDDs is MEMORY_ONLY. Do I just need to set all the storage level for all of my RDDs to something like MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in the presence of coalesces/shuffles, cogroups, and joins?

Thanks,
Allen
Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Surendranauth Hiraman
My team has been using DISK_ONLY. The challenge with this approach is knowing when to unpersist if your job creates a lot of intermediate data. The "right solution" would be to mark a transient RDD as being capable of spilling to disk, rather than having to persist it to force this behavior. Hopefully that will be added at some point, now that Iterable is available in the PairRDDFunctions api.

The other thing that was important for us was setting the executor memory to the right level because it seems some intermediate buffers can be large.

We are currently using 16 GB for spark.executor.memory and 18 GB for SPARK_WORKER_MEMORY. Parallelism (spark.default.parallelism) seems to have an impact, through we are still working on tuning that.

We are using 16 executors/workers.

Our test input size is about 10 GB but we generate up to a total of 500GB of intermediate and final data.

Right now, we have gotten past our memory issues and we are now facing a communication timeout issue in some long-tail tasks, so that's something to watch out for.

If you come up with anything else, please let us know. :-)

-Suren

                                                            
SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: [hidden email]elos.io
W: www.velos.io



On Tue, Jun 10, 2014 at 9:42 PM, Allen Chang <[hidden email]> wrote:
Thanks for the clarification.

What is the proper way to configure RDDs when your aggregate data size
exceeds your available working memory size? In particular, in additional to
typical operations, I'm performing cogroups, joins, and coalesces/shuffles.

I see that the default storage level for RDDs is MEMORY_ONLY. Do I just need
to set all the storage level for all of my RDDs to something like
MEMORY_AND_DISK? Do I need to do anything else to get graceful behavior in
the presence of coalesces/shuffles, cogroups, and joins?

Thanks,
Allen



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-on-Data-size-larger-than-Memory-size-tp6589p7364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



--
                                                            
SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: [hidden email]elos.io
W: www.velos.io

Reply | Threaded
Open this post in threaded view
|

Re: Using Spark on Data size larger than Memory size

Allen Chang
Thanks. We've run into timeout issues at scale as well. We were able to workaround them by setting the following JVM options:

-Dspark.akka.askTimeout=300
-Dspark.akka.timeout=300
-Dspark.worker.timeout=300

NOTE: these JVM options *must* be set on worker nodes (and not just the driver/master) for the settings to take.

Allen