
12

Hi,
I'm implementing a recommender based on the algorithm described in
http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
basis for Spark's ALS implementation for data sets with implicit features.
The data set I'm working with is proprietary and I cannot share it,
however I can say that it's based on the same kind of data in the
paperrelative viewing time of videos. (Specifically, the "rating" for
each video is defined as total viewing time across all visitors divided by
video duration).
I'm seeing counterintuitive, sometimes nonsensical recommendations. For
comparison, I've run the training data through Oryx's inVM implementation
of implicit ALS with the same parameters. Oryx uses the same algorithm.
(Source in this file:
https://github.com/cloudera/oryx/blob/master/alscommon/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)
The recommendations made by each system compared to one other are very
differentmoreso than I think could be explained by differences in
initial state. The recommendations made by the Oryx models look much
better, especially as I increase the number of latent factors and the
iterations. The Spark models' recommendations don't improve with increases
in either latent factors or iterations. Sometimes, they get worse.
Because of the (understandably) highlyoptimized and terse style of
Spark's ALS implementation, I've had a very hard time following it well
enough to debug the issue definitively. However, I have found a section of
code that looks incorrect. As described in the paper, part of the implicit
ALS algorithm involves computing a matrix product YtCuY (equation 4 in the
paper). To optimize this computation, this expression is rewritten as YtY
+ Yt(Cu  I)Y. I believe that's what should be happening here:
https://github.com/apache/incubatorspark/blob/v0.9.0incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376However, it looks like this code is in fact computing YtY + YtY(Cu  I),
which is the same as YtYCu. If so, that's a bug. Can someone familiar with
this code evaluate my claim?
Cheers,
Michael


Hi Michael,
I can help check the current implementation. Would you please go to
https://sparkproject.atlassian.net/browse/SPARK and create a ticket
about this issue with component "MLlib"? Thanks!
Best,
Xiangrui
On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman < [hidden email]> wrote:
> Hi,
>
> I'm implementing a recommender based on the algorithm described in
> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
> basis for Spark's ALS implementation for data sets with implicit features.
> The data set I'm working with is proprietary and I cannot share it, however
> I can say that it's based on the same kind of data in the paperrelative
> viewing time of videos. (Specifically, the "rating" for each video is
> defined as total viewing time across all visitors divided by video
> duration).
>
> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
> comparison, I've run the training data through Oryx's inVM implementation
> of implicit ALS with the same parameters. Oryx uses the same algorithm.
> (Source in this file:
> https://github.com/cloudera/oryx/blob/master/alscommon/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)
>
> The recommendations made by each system compared to one other are very
> differentmoreso than I think could be explained by differences in initial
> state. The recommendations made by the Oryx models look much better,
> especially as I increase the number of latent factors and the iterations.
> The Spark models' recommendations don't improve with increases in either
> latent factors or iterations. Sometimes, they get worse.
>
> Because of the (understandably) highlyoptimized and terse style of Spark's
> ALS implementation, I've had a very hard time following it well enough to
> debug the issue definitively. However, I have found a section of code that
> looks incorrect. As described in the paper, part of the implicit ALS
> algorithm involves computing a matrix product YtCuY (equation 4 in the
> paper). To optimize this computation, this expression is rewritten as YtY +
> Yt(Cu  I)Y. I believe that's what should be happening here:
>
> https://github.com/apache/incubatorspark/blob/v0.9.0incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376>
> However, it looks like this code is in fact computing YtY + YtY(Cu  I),
> which is the same as YtYCu. If so, that's a bug. Can someone familiar with
> this code evaluate my claim?
>
> Cheers,
>
> Michael


On Tue, Mar 11, 2014 at 10:18 PM, Michael Allman < [hidden email]> wrote:
> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
> comparison, I've run the training data through Oryx's inVM implementation
> of implicit ALS with the same parameters. Oryx uses the same algorithm.
> (Source in this file:
> https://github.com/cloudera/oryx/blob/master/alscommon/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)
On this note, I should say that Oryx varies from that paper in a
couple small ways. In particular it the regularization parameter that
is used in the end is not just lambda, but lambda * alpha. (There are
decent reasons for this.)
So the difference with the "same" parameters could be down to this.
What param values are you using? It might be the difference.
(There is another difference in handling of negative values, but that
is probably irrelevant to you? It is in Spark now too though. It was
not in 0.9.0 but is in HEAD.)
> However, it looks like this code is in fact computing YtY + YtY(Cu  I),
> which is the same as YtYCu. If so, that's a bug. Can someone familiar with
> this code evaluate my claim?
I too can't be 100% certain I'm not missing something, but from a look
at that line, I don't think it is computing YtY(CuI). It is indeed
trying to accumulate the value Yt(CuI)Y by building it up from
pieces, from rows of Y. For one row of Y that piece is, excusing my
notation, Y(i)t (Cu(i)1) Y(i). The middle term is just a scalar so
it's fine to multiply it at the end as you see in that line.
You may wish to follow HEAD, which is a bit different:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L390The computation is actually the same as before (for positive input),
expressed a little differently.
Happy to help on this given that I know this code a little and the
code you are comparing it to a lot.


Line 376 should be correct as it is computing \sum_i (c_i  1) x_i
x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some
metrics to tell which recommendation is better? Xiangrui
On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng < [hidden email]> wrote:
> Hi Michael,
>
> I can help check the current implementation. Would you please go to
> https://sparkproject.atlassian.net/browse/SPARK and create a ticket
> about this issue with component "MLlib"? Thanks!
>
> Best,
> Xiangrui
>
> On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman < [hidden email]> wrote:
>> Hi,
>>
>> I'm implementing a recommender based on the algorithm described in
>> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms the
>> basis for Spark's ALS implementation for data sets with implicit features.
>> The data set I'm working with is proprietary and I cannot share it, however
>> I can say that it's based on the same kind of data in the paperrelative
>> viewing time of videos. (Specifically, the "rating" for each video is
>> defined as total viewing time across all visitors divided by video
>> duration).
>>
>> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
>> comparison, I've run the training data through Oryx's inVM implementation
>> of implicit ALS with the same parameters. Oryx uses the same algorithm.
>> (Source in this file:
>> https://github.com/cloudera/oryx/blob/master/alscommon/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java)
>>
>> The recommendations made by each system compared to one other are very
>> differentmoreso than I think could be explained by differences in initial
>> state. The recommendations made by the Oryx models look much better,
>> especially as I increase the number of latent factors and the iterations.
>> The Spark models' recommendations don't improve with increases in either
>> latent factors or iterations. Sometimes, they get worse.
>>
>> Because of the (understandably) highlyoptimized and terse style of Spark's
>> ALS implementation, I've had a very hard time following it well enough to
>> debug the issue definitively. However, I have found a section of code that
>> looks incorrect. As described in the paper, part of the implicit ALS
>> algorithm involves computing a matrix product YtCuY (equation 4 in the
>> paper). To optimize this computation, this expression is rewritten as YtY +
>> Yt(Cu  I)Y. I believe that's what should be happening here:
>>
>> https://github.com/apache/incubatorspark/blob/v0.9.0incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376>>
>> However, it looks like this code is in fact computing YtY + YtY(Cu  I),
>> which is the same as YtYCu. If so, that's a bug. Can someone familiar with
>> this code evaluate my claim?
>>
>> Cheers,
>>
>> Michael


It would be helpful to know what parameter inputs you are using.
If the regularization schemes are different (by a factor of alpha, which can often be quite high) this will mean that the same parameter settings could give very different results. A higher lambda would be required with Spark's version to be comparable.
When I submitted the PR for this, I verified (on ml100k, ml1m and ml10m data) that this version gives the same RMSE as Mahout's implicit model, as well as a separate Spark version that I wrote that was a fromscratch port of the Mahout algorithm (though I didn't compare vs Myrrix/Oryx). I'm fairly confident things are correct but if there is a bug let's definitely find and fix it!
@Sean, would it be a good idea to look at changing the regularization in Spark's ALS to alpha * lambda? What is the thinking behind this? If I recall, the Mahout version added something like (# ratings * lambda) as regularization in each factor update (for explicit), but implicit it was just lambda (I may be wrong here).


The mahout implementation is just a straightforward port of the paper.
No changes have been made.
On 03/12/2014 08:36 AM, Nick Pentreath wrote:
> It would be helpful to know what parameter inputs you are using.
>
> If the regularization schemes are different (by a factor of alpha, which
> can often be quite high) this will mean that the same parameter settings
> could give very different results. A higher lambda would be required with
> Spark's version to be comparable.
>
> When I submitted the PR for this, I verified (on ml100k, ml1m and ml10m
> data) that this version gives the same RMSE as Mahout's implicit model, as
> well as a separate Spark version that I wrote that was a fromscratch port
> of the Mahout algorithm (though I didn't compare vs Myrrix/Oryx). I'm
> fairly confident things are correct but if there is a bug let's definitely
> find and fix it!
>
> @Sean, would it be a good idea to look at changing the regularization in
> Spark's ALS to alpha * lambda? What is the thinking behind this? If I
> recall, the Mahout version added something like (# ratings * lambda) as
> regularization in each factor update (for explicit), but implicit it was
> just lambda (I may be wrong here).
>
>
>
> On Wed, Mar 12, 2014 at 4:57 AM, Xiangrui Meng < [hidden email]> wrote:
>
>> Line 376 should be correct as it is computing \sum_i (c_i  1) x_i
>> x_i^T, = \sum_i (alpha * r_i) x_i x_i^T. Are you computing some
>> metrics to tell which recommendation is better? Xiangrui
>>
>> On Tue, Mar 11, 2014 at 6:38 PM, Xiangrui Meng < [hidden email]> wrote:
>>> Hi Michael,
>>>
>>> I can help check the current implementation. Would you please go to
>>> https://sparkproject.atlassian.net/browse/SPARK and create a ticket
>>> about this issue with component "MLlib"? Thanks!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Tue, Mar 11, 2014 at 3:18 PM, Michael Allman < [hidden email]> wrote:
>>>> Hi,
>>>>
>>>> I'm implementing a recommender based on the algorithm described in
>>>> http://www2.research.att.com/~yifanhu/PUB/cf.pdf. This algorithm forms
>> the
>>>> basis for Spark's ALS implementation for data sets with implicit
>> features.
>>>> The data set I'm working with is proprietary and I cannot share it,
>> however
>>>> I can say that it's based on the same kind of data in the
>> paperrelative
>>>> viewing time of videos. (Specifically, the "rating" for each video is
>>>> defined as total viewing time across all visitors divided by video
>>>> duration).
>>>>
>>>> I'm seeing counterintuitive, sometimes nonsensical recommendations. For
>>>> comparison, I've run the training data through Oryx's inVM
>> implementation
>>>> of implicit ALS with the same parameters. Oryx uses the same algorithm.
>>>> (Source in this file:
>>>>
>> https://github.com/cloudera/oryx/blob/master/alscommon/src/main/java/com/cloudera/oryx/als/common/factorizer/als/AlternatingLeastSquares.java>> )
>>>>
>>>> The recommendations made by each system compared to one other are very
>>>> differentmoreso than I think could be explained by differences in
>> initial
>>>> state. The recommendations made by the Oryx models look much better,
>>>> especially as I increase the number of latent factors and the
>> iterations.
>>>> The Spark models' recommendations don't improve with increases in either
>>>> latent factors or iterations. Sometimes, they get worse.
>>>>
>>>> Because of the (understandably) highlyoptimized and terse style of
>> Spark's
>>>> ALS implementation, I've had a very hard time following it well enough
>> to
>>>> debug the issue definitively. However, I have found a section of code
>> that
>>>> looks incorrect. As described in the paper, part of the implicit ALS
>>>> algorithm involves computing a matrix product YtCuY (equation 4 in the
>>>> paper). To optimize this computation, this expression is rewritten as
>> YtY +
>>>> Yt(Cu  I)Y. I believe that's what should be happening here:
>>>>
>>>>
>> https://github.com/apache/incubatorspark/blob/v0.9.0incubating/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L376>>>>
>>>> However, it looks like this code is in fact computing YtY + YtY(Cu  I),
>>>> which is the same as YtYCu. If so, that's a bug. Can someone familiar
>> with
>>>> this code evaluate my claim?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>
>


On Wed, Mar 12, 2014 at 7:36 AM, Nick Pentreath
< [hidden email]> wrote:
> @Sean, would it be a good idea to look at changing the regularization in
> Spark's ALS to alpha * lambda? What is the thinking behind this? If I
> recall, the Mahout version added something like (# ratings * lambda) as
> regularization in each factor update (for explicit), but implicit it was
> just lambda (I may be wrong here).
I also used a different default alpha than the one suggested in the
paper: 1, instead of 40. But so does MLlib. And if alpha = 1, the
variation I mention here has no effect.
The idea was that alpha "is supposed to" control how much more weight
a known useritem value gets in the factorization. The weight is "1 +
alpha*r" for nonzero r, and of course "1" otherwise, and alpha can
make the difference larger.
But large alpha has the sideeffect of making the regularization terms
relatively smaller in the cost function. This dual effect seemed
undesirable. So: multiply the regularization term by alpha too to
disconnect these effects.
Other ALS papers like
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdfagain use a different definition of lambda by stuffing something else
into it. So the absolute value of lambda is already different in
different contexts.
So depending on Michael's settings this could be a red herring but
worth checking. The only other variation was in choosing the random
initial state but that too is the same now in both implementations (at
least in HEAD). The initial state really shouldn't matter so much. I
can't think of other variations.
Michael what was your eval metric?


Thank you everyone for your feedback. It's been very helpful, and though I still haven't found the cause of the difference between Spark and Oryx, I feel I'm making progress.
Xiangrui asked me to create a ticket for this issue. The reason I didn't do this originally is because it's not clear to me yet that this is a bug or a mistake on my part. I'd like to see where this conversation goes and then file a more clearcut issue if applicable.
Sean pointed out that Oryx differs in its use of the regularization parameter lambda. I'm aware of this and have been compensating for this difference from the start. Also, the handling of negative values is indeed irrelevant as I have none in my data.
After reviewing Sean's analysis and running some calculations in the console, I agree that the Spark code does compute YtCuY correctly.
Regarding testing, I'm computing EPR on a test set as outlined in the paper. I'm training on three weeks of data and testing on the following week. I recently updated my data sets and rebuilt and tested the new models. The results were inconclusive in that both models scored about the same.
I'm continuing to investigate the source of the wide difference in recommendations between implementations. I will reply with my findings when I have something more definitive.
Cheers and thanks again.


Ah, thank you, I had actually forgotten about this and this is indeed
probably a difference. This is from the other paper I cited:
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdfIt's the "WR" in "ALSWR"  weighted regularization. I suppose the
intuition is that you penalize complex explanations of prolific users
and items proportionally more.
The paper claims it helps and I also found it did. That could be the difference.

Sean Owen  Director, Data Science  London
On Thu, Mar 13, 2014 at 2:30 AM, Michael Allman < [hidden email]> wrote:


I've been thoroughly investigating this issue over the past couple of days and have discovered quite a bit. For one thing, there is definitely (at least) one issue/bug in the Spark implementation that leads to incorrect results for models generated with rank > 1 or a large number of iterations. I will post a bug report with a thorough explanation this weekend or on Monday.
I believe I've been able to track down every difference between the Spark and Oryx implementations that lead to difference results. I made some adjustments to the spark implementation so that, given the same initial product/item vectors, the resulting model is identical to the one produced by Oryx within a small numerical tolerance. I've verified this for small data sets and am working on verifying this with some large data sets.
Aside from those already identified in this thread, another significant difference in the Spark implementation is that it begins the factorization process by computing the product matrix (Y) from the initial user matrix (X). Both of the papers on ALS referred to in this thread begin the process by computing the user matrix. I haven't done any testing comparing the models generated starting from Y or X, but they are very different. Is there a reason Spark begins the iteration by computing Y?
Initializing both X and Y as is done in the Spark implementation seems unnecessary unless I'm overlooking some desired sideeffect. Only the factor matrix which generates the other in the first iteration needs to be initialized.
I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is there a reason not to do this?
Thanks.


Hi Michael,
Thanks for looking into the details! Computing X first and computing Y
first can deliver different results, because the initial objective
values could differ by a lot. But the algorithm should converge after
a few iterations. It is hard to tell which should go first. After all,
the definitions of "user" and "product" are arbitrary. One trick we
can do is to rescale the columns of X and Y after each iteration such
that they have the same column norms.
For the comparison, you should compute some metrics to verify the convergence.
I don't think initializing Y is necessary if we start with X. However,
if Y_0 is not used, the data is not actually generated. So the
overhead should be small.
Best,
Xiangrui
On Fri, Mar 14, 2014 at 5:52 PM, Michael Allman < [hidden email]> wrote:
> I've been thoroughly investigating this issue over the past couple of days
> and have discovered quite a bit. For one thing, there is definitely (at
> least) one issue/bug in the Spark implementation that leads to incorrect
> results for models generated with rank > 1 or a large number of iterations.
> I will post a bug report with a thorough explanation this weekend or on
> Monday.
>
> I believe I've been able to track down every difference between the Spark
> and Oryx implementations that lead to difference results. I made some
> adjustments to the spark implementation so that, given the same initial
> product/item vectors, the resulting model is identical to the one produced
> by Oryx within a small numerical tolerance. I've verified this for small
> data sets and am working on verifying this with some large data sets.
>
> Aside from those already identified in this thread, another significant
> difference in the Spark implementation is that it begins the factorization
> process by computing the product matrix (Y) from the initial user matrix
> (X). Both of the papers on ALS referred to in this thread begin the process
> by computing the user matrix. I haven't done any testing comparing the
> models generated starting from Y or X, but they are very different. Is there
> a reason Spark begins the iteration by computing Y?
>
> Initializing both X and Y as is done in the Spark implementation seems
> unnecessary unless I'm overlooking some desired sideeffect. Only the factor
> matrix which generates the other in the first iteration needs to be
> initialized.
>
> I also found that the product and user RDDs were being rebuilt many times
> over in my tests, even for tiny data sets. By persisting the RDD returned
> from updateFeatures() I was able to avoid a raft of duplicate computations.
> Is there a reason not to do this?
>
> Thanks.
>
>
>
> 
> View this message in context: http://apachesparkuserlist.1001560.n3.nabble.com/possiblebuginSparksALSimplementationtp2567p2704.html> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Administrator

On Mar 14, 2014, at 5:52 PM, Michael Allman < [hidden email]> wrote: I also found that the product and user RDDs were being rebuilt many times over in my tests, even for tiny data sets. By persisting the RDD returned from updateFeatures() I was able to avoid a raft of duplicate computations. Is there a reason not to do this? This sounds like a good thing to add, though I’d like to understand why these are being recomputed (it seemed that the code would only use each one once). Do you have any sense why that is?
Matei


The factor matrix Y is used twice in implicit ALS computation, one to
compute global Y^T Y, and another to compute local Y_i^T C_i Y_i.
Xiangrui
On Sun, Mar 16, 2014 at 1:18 PM, Matei Zaharia < [hidden email]> wrote:
> On Mar 14, 2014, at 5:52 PM, Michael Allman < [hidden email]> wrote:
>
> I also found that the product and user RDDs were being rebuilt many times
> over in my tests, even for tiny data sets. By persisting the RDD returned
> from updateFeatures() I was able to avoid a raft of duplicate computations.
> Is there a reason not to do this?
>
>
> This sounds like a good thing to add, though I'd like to understand why
> these are being recomputed (it seemed that the code would only use each one
> once). Do you have any sense why that is?
>
> Matei


You are correct, in the long run it doesn't matter which matrix you begin the iterative process with. I was thinking in terms of doing a sidebyside comparison to Oryx.
I've posted a bug report as SPARK1262. I described the problem I found and the mitigation strategy I've used. I think that this problem has many possible solutions, so I'm omitting a patch to let the community hash out the best approach. However, I will suggest we move to a pure Java implementation of a linear system solver to provide better assurances of correctness across platforms (differences in java.lang.Math notwithstanding) and to make the implementation more transparent. It is not clear exactly what native code JBlas is linked to and using for its solver. I suggested the QR decompositionbased solvers provided by Colt and Commons Math as candidate replacements.
Cheers.


Hi Xiangrui,
I don't see how https://github.com/apache/spark/pull/161 relates to ALS. Can you explain?
Also, thanks for addressing the issue with factor matrix persistence in PR 165. I was probably not going to get to that for a while.
I will try to test your changes today for speed improvements.
Cheers,
Michael


I just ran a runtime performance comparison between 0.9.0incubating and your als branch. I saw a 1.5x improvement in performance.

12
