ALS memory limits

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

ALS memory limits

Debasish Das
Hi,

For our usecases we are looking into 20 x 1M matrices which comes in the similar ranges as outlined by the paper over here:

http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html

Is the exponential runtime growth in spark ALS as outlined by the blog still exists in recommendation.ALS ?

I am running a spark cluster of 10 nodes with total memory of around 1 TB with 80 cores....

With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on every worker which is around 8 GB....

Even if both the factor matrices are cached in memory I should be bounded by ~ 9 GB but even with 32 GB per worker I see GC errors...

I am debugging the scalability and memory requirements of the algorithm further but any insights will be very helpful...

Also there are two other issues:

1. If GC errors are hit, that worker JVM goes down and I have to restart it manually. Is this expected ?

2. When I try to make use of all 80 cores on the cluster I get some issues related to java.io.File not found exception on /tmp/ ? Is there some OS limit that how many cores can simultaneously access /tmp from a process ?

Thanks.
Deb

On Sun, Mar 16, 2014 at 2:20 PM, Sean Owen <[hidden email]> wrote:
Good point -- there's been another optimization for ALS in HEAD (https://github.com/apache/spark/pull/131), but yes the better place to pick up just essential changes since 0.9.0 including the previous one is the 0.9 branch.

--
Sean Owen | Director, Data Science | London


On Sun, Mar 16, 2014 at 2:18 PM, Patrick Wendell <[hidden email]> wrote:
Sean - was this merged into the 0.9 branch as well (it seems so based
on the message from rxin). If so it might make sense to try out the
head of branch-0.9 as well. Unless there are *also* other changes
relevant to this in master.

- Patrick

On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen <[hidden email]> wrote:
> You should simply use a snapshot built from HEAD of github.com/apache/spark
> if you can. The key change is in MLlib and with any luck you can just
> replace that bit. See the PR I referenced.
>
> Sure with enough memory you can get it to run even with the memory issue,
> but it could be hundreds of GB at your scale. Not sure I take the point
> about the JVM; you can give it 64GB of heap and executors can use that much,
> sure.
>
> You could reduce the number of features a lot to work around it too, or
> reduce the input size. (If anyone saw my blog post about StackOverflow and
> ALS -- that's why I snuck in a relatively paltry 40 features and pruned
> questions with <4 tags :) )
>
> I don't think jblas has anything to do with it per se, and the allocation
> fails in Java code, not native code. This should be exactly what that PR I
> mentioned fixes.
>
> --
> Sean Owen | Director, Data Science | London
>
>
> On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das <[hidden email]>
> wrote:
>>
>> Thanks Sean...let me get the latest code..do you know which PR was it ?
>>
>> But will the executors run fine with say 32 gb or 64 gb of memory ? Does
>> not JVM shows up issues when the max memory goes beyond certain limit...
>>
>> Also the failure is due to GC limits from jblas...and I was thinking that
>> jblas is going to call native malloc right ? May be 64 gb is not a big deal
>> then...I will try increasing to 32 and then 64...
>>
>> java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit
>> exceeded)
>>
>> org.jblas.DoubleMatrix.<init>(DoubleMatrix.java:323)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)scala.Array$.fill(Array.scala:267)com.verizon.bigdata.mllib.recommendation.ALSQR.updateBlock(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:346)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:345)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)scala.collection.Iterator$$anon$11.next(Iterator.scala:328)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>>
>>
>>
>> On Sun, Mar 16, 2014 at 11:42 AM, Sean Owen <[hidden email]> wrote:
>>>
>>> Are you using HEAD or 0.9.0? I know there was a memory issue fixed a few
>>> weeks ago that made ALS need a lot more memory than is needed.
>>>
>>> https://github.com/apache/incubator-spark/pull/629
>>>
>>> Try the latest code.
>>>
>>> --
>>> Sean Owen | Director, Data Science | London
>>>
>>>
>>> On Sun, Mar 16, 2014 at 11:40 AM, Debasish Das <[hidden email]>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I gave my spark job 16 gb of memory and it is running on 8 executors.
>>>>
>>>> The job needs more memory due to ALS requirements (20M x 1M matrix)
>>>>
>>>> On each node I do have 96 gb of memory and I am using 16 gb out of it. I
>>>> want to increase the memory but I am not sure what is the right way to
>>>> do
>>>> that...
>>>>
>>>> On 8 executor if I give 96 gb it might be a issue due to GC...
>>>>
>>>> Ideally on 8 nodes, I would run with 48 executors and each executor will
>>>> get 16 gb of memory..Total  48 JVMs...
>>>>
>>>> Is it possible to increase executors per node ?
>>>>
>>>> Thanks.
>>>> Deb
>>>
>>>
>>
>