Spark Matrix Factorization

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Matrix Factorization

Debasish Das
Hi,

I am not noticing any DSGD implementation of ALS in Spark.

There are two ALS implementations.

org.apache.spark.examples.SparkALS does not run on large matrices and seems more like a demo code.

org.apache.spark.mllib.recommendation.ALS looks feels more robust version and I am experimenting with it.

References here are Jellyfish, Twitter's implementation of Jellyfish called Scalafish, Google paper called Sparkler and similar idea put forward by IBM paper by Gemulla et al. (large-scale matrix factorization with distributed stochastic gradient descent)


Are there any plans of adding DSGD in Spark or there are any existing JIRA ?

Thanks.
Deb

Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Ameet Talwalkar
Hi Deb,

Thanks for your email.  We currently do not have a DSGD implementation in MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a different algorithm for solving the same the same bi-convex objective function.  

It would be a good thing to do add, but to the best of my knowledge, no one is actively working on this right now.

Also, as you mentioned, the ALS implementation in mllib is more robust/scalable than the one in spark.examples.

-Ameet


On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]> wrote:
Hi,

I am not noticing any DSGD implementation of ALS in Spark.

There are two ALS implementations.

org.apache.spark.examples.SparkALS does not run on large matrices and seems more like a demo code.

org.apache.spark.mllib.recommendation.ALS looks feels more robust version and I am experimenting with it.

References here are Jellyfish, Twitter's implementation of Jellyfish called Scalafish, Google paper called Sparkler and similar idea put forward by IBM paper by Gemulla et al. (large-scale matrix factorization with distributed stochastic gradient descent)


Are there any plans of adding DSGD in Spark or there are any existing JIRA ?

Thanks.
Deb


Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Sebastian Schelter
In reply to this post by Debasish Das
Just a minor correction: The Sparkler paper was done by IBM. IIRC they
did not only implement the algorithm but also modified Spark to tune it
for that usecase.

--sebastian

On 03.01.2014 00:16, Debasish Das wrote:

> Hi,
>
> I am not noticing any DSGD implementation of ALS in Spark.
>
> There are two ALS implementations.
>
> org.apache.spark.examples.SparkALS does not run on large matrices and seems
> more like a demo code.
>
> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
> and I am experimenting with it.
>
> References here are Jellyfish, Twitter's implementation of Jellyfish called
> Scalafish, Google paper called Sparkler and similar idea put forward by IBM
> paper by Gemulla et al. (large-scale matrix factorization with distributed
> stochastic gradient descent)
>
> https://github.com/azymnis/scalafish
>
> Are there any plans of adding DSGD in Spark or there are any existing JIRA ?
>
> Thanks.
> Deb
>

Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Charles Earl
In reply to this post by Ameet Talwalkar
In a slightly related note, I am trying to write a distributed PCA based upon 
The algorithm works by computing SVD locally then broadcasting the locally computed principal components. 
I wonder if anyone might have recommendation on scala native implementation of SVD.
C




On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <[hidden email]> wrote:
Hi Deb,

Thanks for your email.  We currently do not have a DSGD implementation in MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a different algorithm for solving the same the same bi-convex objective function.  

It would be a good thing to do add, but to the best of my knowledge, no one is actively working on this right now.

Also, as you mentioned, the ALS implementation in mllib is more robust/scalable than the one in spark.examples.

-Ameet


On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]> wrote:
Hi,

I am not noticing any DSGD implementation of ALS in Spark.

There are two ALS implementations.

org.apache.spark.examples.SparkALS does not run on large matrices and seems more like a demo code.

org.apache.spark.mllib.recommendation.ALS looks feels more robust version and I am experimenting with it.

References here are Jellyfish, Twitter's implementation of Jellyfish called Scalafish, Google paper called Sparkler and similar idea put forward by IBM paper by Gemulla et al. (large-scale matrix factorization with distributed stochastic gradient descent)


Are there any plans of adding DSGD in Spark or there are any existing JIRA ?

Thanks.
Deb





--
- Charles
Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Sebastian Schelter
> I wonder if anyone might have recommendation on scala native implementation
> of SVD.

Mahout has a scala implementation of an SVD variant called Stochastic SVD:

https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup

Otherwise, all the major java math libraries (mahout math, jblas,
commons-math) should provide an implementation that you can use in scala.

--sebastian

> C
>
>
>
>
> On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <[hidden email]>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>>> and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Dmitriy Lyubimov



On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <[hidden email]> wrote:
> I wonder if anyone might have recommendation on scala native implementation
> of SVD.

Mahout has a scala implementation of an SVD variant called Stochastic SVD:

https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
 
Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and eigen(). Unfortunately i have not put it on wiki yet but the summary is available here https://issues.apache.org/jira/browse/MAHOUT-1297

Mahout also has distributed PCA implementation (which is based on distributed Stochastic SVD and has a special provisions for sparse matrix cases). Unfortunately our wiki is in flux now due to migration off confluence to CMS so the SSVD page has not yet been migrated to CMS so confluence version is here https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition



Otherwise, all the major java math libraries (mahout math, jblas,
commons-math) should provide an implementation that you can use in scala.

--sebastian

> C
>
>
>
>
> On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <[hidden email]>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>>> and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Ameet Talwalkar
Hi all,

The following pull request implementing SVD in MLlib is highly relevant to this discussion.

-Ameet


On Fri, Jan 3, 2014 at 10:43 AM, Dmitriy Lyubimov <[hidden email]> wrote:



On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <[hidden email]> wrote:
> I wonder if anyone might have recommendation on scala native implementation
> of SVD.

Mahout has a scala implementation of an SVD variant called Stochastic SVD:

https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
 
Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and eigen(). Unfortunately i have not put it on wiki yet but the summary is available here https://issues.apache.org/jira/browse/MAHOUT-1297

Mahout also has distributed PCA implementation (which is based on distributed Stochastic SVD and has a special provisions for sparse matrix cases). Unfortunately our wiki is in flux now due to migration off confluence to CMS so the SSVD page has not yet been migrated to CMS so confluence version is here https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition



Otherwise, all the major java math libraries (mahout math, jblas,
commons-math) should provide an implementation that you can use in scala.

--sebastian

> C
>
>
>
>
> On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <[hidden email]>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>>> and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Debasish Das
In reply to this post by Dmitriy Lyubimov
Hi Dmitri,

We have a mahout mirror from github but I don't see any of the math-scala code.

Where do I see the math-scala code ? I thought github mirror is updated with svn repo.

Thanks.
Deb



On Fri, Jan 3, 2014 at 10:43 AM, Dmitriy Lyubimov <[hidden email]> wrote:



On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <[hidden email]> wrote:
> I wonder if anyone might have recommendation on scala native implementation
> of SVD.

Mahout has a scala implementation of an SVD variant called Stochastic SVD:

https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
 
Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and eigen(). Unfortunately i have not put it on wiki yet but the summary is available here https://issues.apache.org/jira/browse/MAHOUT-1297

Mahout also has distributed PCA implementation (which is based on distributed Stochastic SVD and has a special provisions for sparse matrix cases). Unfortunately our wiki is in flux now due to migration off confluence to CMS so the SSVD page has not yet been migrated to CMS so confluence version is here https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition



Otherwise, all the major java math libraries (mahout math, jblas,
commons-math) should provide an implementation that you can use in scala.

--sebastian

> C
>
>
>
>
> On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <[hidden email]>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>>> and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Debasish Das
In reply to this post by Ameet Talwalkar
Hi Ameet,

Matrix factorization is a non-convex problem and ALS solves it using 2 convex problems, DSGD solves the problem by finding a local minima.

I am experimenting with Spark Parallel ALS but I intend to port Scalafish https://github.com/azymnis/scalafish to Spark as well.

For bigger matrices jury is not out that which algorithms provides a better local optima with an iteration bound. It is also highly dependent on datasets I believe.

Thanks.
Deb



On Thu, Jan 2, 2014 at 4:06 PM, Ameet Talwalkar <[hidden email]> wrote:
Hi Deb,

Thanks for your email.  We currently do not have a DSGD implementation in MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a different algorithm for solving the same the same bi-convex objective function.  

It would be a good thing to do add, but to the best of my knowledge, no one is actively working on this right now.

Also, as you mentioned, the ALS implementation in mllib is more robust/scalable than the one in spark.examples.

-Ameet


On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]> wrote:
Hi,

I am not noticing any DSGD implementation of ALS in Spark.

There are two ALS implementations.

org.apache.spark.examples.SparkALS does not run on large matrices and seems more like a demo code.

org.apache.spark.mllib.recommendation.ALS looks feels more robust version and I am experimenting with it.

References here are Jellyfish, Twitter's implementation of Jellyfish called Scalafish, Google paper called Sparkler and similar idea put forward by IBM paper by Gemulla et al. (large-scale matrix factorization with distributed stochastic gradient descent)


Are there any plans of adding DSGD in Spark or there are any existing JIRA ?

Thanks.
Deb



Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Ameet Talwalkar
Matrix factorization is a non-convex problem and ALS solves it using 2 convex problems, DSGD solves the problem by finding a local minima.


ALS and SGD solve the same non-convex objective function, and thus both yield local minima.  The following reference provides a nice overview (in particular see equation 2 of this paper):






On Thu, Jan 2, 2014 at 4:06 PM, Ameet Talwalkar <[hidden email]> wrote:
Hi Deb,

Thanks for your email.  We currently do not have a DSGD implementation in MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a different algorithm for solving the same the same bi-convex objective function.  

It would be a good thing to do add, but to the best of my knowledge, no one is actively working on this right now.

Also, as you mentioned, the ALS implementation in mllib is more robust/scalable than the one in spark.examples.

-Ameet


On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]> wrote:
Hi,

I am not noticing any DSGD implementation of ALS in Spark.

There are two ALS implementations.

org.apache.spark.examples.SparkALS does not run on large matrices and seems more like a demo code.

org.apache.spark.mllib.recommendation.ALS looks feels more robust version and I am experimenting with it.

References here are Jellyfish, Twitter's implementation of Jellyfish called Scalafish, Google paper called Sparkler and similar idea put forward by IBM paper by Gemulla et al. (large-scale matrix factorization with distributed stochastic gradient descent)


Are there any plans of adding DSGD in Spark or there are any existing JIRA ?

Thanks.
Deb




Reply | Threaded
Open this post in threaded view
|

Re: Spark Matrix Factorization

Dmitriy Lyubimov
In reply to this post by Debasish Das
it's in Mahout - 0.9. It should be in very final stages now.


On Fri, Jan 3, 2014 at 10:51 AM, Debasish Das <[hidden email]> wrote:
Hi Dmitri,

We have a mahout mirror from github but I don't see any of the math-scala code.

Where do I see the math-scala code ? I thought github mirror is updated with svn repo.

Thanks.
Deb



On Fri, Jan 3, 2014 at 10:43 AM, Dmitriy Lyubimov <[hidden email]> wrote:



On Fri, Jan 3, 2014 at 10:28 AM, Sebastian Schelter <[hidden email]> wrote:
> I wonder if anyone might have recommendation on scala native implementation
> of SVD.

Mahout has a scala implementation of an SVD variant called Stochastic SVD:

https://svn.apache.org/viewvc/mahout/trunk/math-scala/src/main/scala/org/apache/mahout/math/scalabindings/SSVD.scala?view=markup
 
Mahout also has SVD and Eigen decompositions  mapped to scala as svd() and eigen(). Unfortunately i have not put it on wiki yet but the summary is available here https://issues.apache.org/jira/browse/MAHOUT-1297

Mahout also has distributed PCA implementation (which is based on distributed Stochastic SVD and has a special provisions for sparse matrix cases). Unfortunately our wiki is in flux now due to migration off confluence to CMS so the SSVD page has not yet been migrated to CMS so confluence version is here https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition



Otherwise, all the major java math libraries (mahout math, jblas,
commons-math) should provide an implementation that you can use in scala.

--sebastian

> C
>
>
>
>
> On Thu, Jan 2, 2014 at 7:06 PM, Ameet Talwalkar <[hidden email]>wrote:
>
>> Hi Deb,
>>
>> Thanks for your email.  We currently do not have a DSGD implementation in
>> MLlib. Also, just to clarify, DSGD is not a variant of ALS, but rather a
>> different algorithm for solving the same the same bi-convex objective
>> function.
>>
>> It would be a good thing to do add, but to the best of my knowledge, no
>> one is actively working on this right now.
>>
>> Also, as you mentioned, the ALS implementation in mllib is more
>> robust/scalable than the one in spark.examples.
>>
>> -Ameet
>>
>>
>> On Thu, Jan 2, 2014 at 3:16 PM, Debasish Das <[hidden email]>wrote:
>>
>>> Hi,
>>>
>>> I am not noticing any DSGD implementation of ALS in Spark.
>>>
>>> There are two ALS implementations.
>>>
>>> org.apache.spark.examples.SparkALS does not run on large matrices and
>>> seems more like a demo code.
>>>
>>> org.apache.spark.mllib.recommendation.ALS looks feels more robust version
>>> and I am experimenting with it.
>>>
>>> References here are Jellyfish, Twitter's implementation of Jellyfish
>>> called Scalafish, Google paper called Sparkler and similar idea put forward
>>> by IBM paper by Gemulla et al. (large-scale matrix factorization with
>>> distributed stochastic gradient descent)
>>>
>>> https://github.com/azymnis/scalafish
>>>
>>> Are there any plans of adding DSGD in Spark or there are any existing
>>> JIRA ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>
>
>