StackOverflow Error when run ALS with 100 iterations

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

StackOverflow Error when run ALS with 100 iterations

Xiaoli Li
Hi,

I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS program cannot run  even with a very small size of training data (about 91 lines) due to StackVverFlow error when I set the number of iterations to 100. I think the problem may be caused by updateFeatures method which updates products RDD iteratively by join previous products RDD.


I am writing a program which has a similar update process with ALS.  This problem also appeared when I iterate too many times (more than 80). 

The iterative part of my code is as following:

solution = outlinks.join(solution). map {
     .......
 }


Has anyone had similar problem?  Thanks.


Xiaoli
Reply | Threaded
Open this post in threaded view
|

Re: StackOverflow Error when run ALS with 100 iterations

Cheng Lian

Probably this JIRA issue solves your problem. When running with large iteration number, the lineage DAG of ALS becomes very deep, both DAGScheduler and Java serializer may overflow because they are implemented in a recursive way. You may resort to checkpointing as a workaround.



On Wed, Apr 16, 2014 at 5:29 AM, Xiaoli Li <[hidden email]> wrote:
Hi,

I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS program cannot run  even with a very small size of training data (about 91 lines) due to StackVverFlow error when I set the number of iterations to 100. I think the problem may be caused by updateFeatures method which updates products RDD iteratively by join previous products RDD.


I am writing a program which has a similar update process with ALS.  This problem also appeared when I iterate too many times (more than 80). 

The iterative part of my code is as following:

solution = outlinks.join(solution). map {
     .......
 }


Has anyone had similar problem?  Thanks.


Xiaoli

Reply | Threaded
Open this post in threaded view
|

Re: StackOverflow Error when run ALS with 100 iterations

Xiaoli Li
Thanks a lot for your information. It really helps me.


On Tue, Apr 15, 2014 at 7:57 PM, Cheng Lian <[hidden email]> wrote:

Probably this JIRA issue solves your problem. When running with large iteration number, the lineage DAG of ALS becomes very deep, both DAGScheduler and Java serializer may overflow because they are implemented in a recursive way. You may resort to checkpointing as a workaround.



On Wed, Apr 16, 2014 at 5:29 AM, Xiaoli Li <[hidden email]> wrote:
Hi,

I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS program cannot run  even with a very small size of training data (about 91 lines) due to StackVverFlow error when I set the number of iterations to 100. I think the problem may be caused by updateFeatures method which updates products RDD iteratively by join previous products RDD.


I am writing a program which has a similar update process with ALS.  This problem also appeared when I iterate too many times (more than 80). 

The iterative part of my code is as following:

solution = outlinks.join(solution). map {
     .......
 }


Has anyone had similar problem?  Thanks.


Xiaoli


Reply | Threaded
Open this post in threaded view
|

Re: StackOverflow Error when run ALS with 100 iterations

MLnick
I'd also say that running for 100 iterations is a waste of resources, as ALS will typically converge pretty quickly, as in within 10-20 iterations.


On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li <[hidden email]> wrote:
Thanks a lot for your information. It really helps me.


On Tue, Apr 15, 2014 at 7:57 PM, Cheng Lian <[hidden email]> wrote:

Probably this JIRA issue solves your problem. When running with large iteration number, the lineage DAG of ALS becomes very deep, both DAGScheduler and Java serializer may overflow because they are implemented in a recursive way. You may resort to checkpointing as a workaround.



On Wed, Apr 16, 2014 at 5:29 AM, Xiaoli Li <[hidden email]> wrote:
Hi,

I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS program cannot run  even with a very small size of training data (about 91 lines) due to StackVverFlow error when I set the number of iterations to 100. I think the problem may be caused by updateFeatures method which updates products RDD iteratively by join previous products RDD.


I am writing a program which has a similar update process with ALS.  This problem also appeared when I iterate too many times (more than 80). 

The iterative part of my code is as following:

solution = outlinks.join(solution). map {
     .......
 }


Has anyone had similar problem?  Thanks.


Xiaoli



Reply | Threaded
Open this post in threaded view
|

Re: StackOverflow Error when run ALS with 100 iterations

amghost
This post has NOT been accepted by the mailing list yet.
In reply to this post by Cheng Lian
Hi, would you please how to checkpoint the training set rdd since all things are done in ALS.train method.
Reply | Threaded
Open this post in threaded view
|

Re: StackOverflow Error when run ALS with 100 iterations

LeoB
Just wanted to add a comment to the Jira ticket but I don't think I have
permission to do so, so answering here instead. I am encountering the same
issue with a stackOverflow Exception.
I would like to point out that there is a  
localCheckpoint
<https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-checkpointing.html>  
method which does not require HDFS to be installed. We could use this
instead of Checkpoint to cut down the lineage.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]