Computing cosine similiarity using pyspark

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Computing cosine similiarity using pyspark

jamal sasha
Hi,
  I have bunch of vectors like
[0.1234,-0.231,0.23131]
.... and so on.

and  I want to compute cosine similarity and pearson correlation using pyspark.. 
How do I do this?
Any ideas?
Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Computing cosine similiarity using pyspark

Andrew Ash
Hi Jamal,

I don't believe there are pre-written algorithms for Cosine similarity or Pearson Porrelation in PySpark that you can re-use. If you end up writing your own implementation of the algorithm though, the project would definitely appreciate if you shared that code back with the project for future users to leverage!

Andrew


On Thu, May 22, 2014 at 10:49 AM, jamal sasha <[hidden email]> wrote:
Hi,
  I have bunch of vectors like
[0.1234,-0.231,0.23131]
.... and so on.

and  I want to compute cosine similarity and pearson correlation using pyspark.. 
How do I do this?
Any ideas?
Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Computing cosine similiarity using pyspark

Andrei
In reply to this post by jamal sasha
Do you need cosine distance and correlation between vectors or between variables (elements of vector)? It would be helpful if you could tell us details of your task.


On Thu, May 22, 2014 at 5:49 PM, jamal sasha <[hidden email]> wrote:
Hi,
  I have bunch of vectors like
[0.1234,-0.231,0.23131]
.... and so on.

and  I want to compute cosine similarity and pearson correlation using pyspark.. 
How do I do this?
Any ideas?
Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Computing cosine similiarity using pyspark

Jeremy Freeman
Hi Jamal,

One nice feature of PySpark is that you can easily use existing functions from NumPy and SciPy inside your Spark code. For a simple example, the following uses Spark's cartesian operation (which combines pairs of vectors into tuples), followed by NumPy's corrcoef to compute the pearson correlation coefficient between every pair of a set of vectors. The vectors are an RDD of numpy arrays.

>> from numpy import array, corrcoef

>> data = sc.parallelize([array([1,2,3]),array([2,4,6.1]),array([3,2,1.1])])
>> corrs = data.cartesian(data).map(lambda (x,y): corrcoef(x,y)[0,1]).collect()
>> corrs
[1.0, 0.99990086740991746, -0.99953863896044948 ...

This just returns a list of the correlation coefficients, you could also add a key to each array, to keep track of which pair is which

>> data_with_keys = sc.parallelize([(0,array([1,2,3])),(1,array([2,4,6.1])),(2,array([3,2,1.1]))])
>> corrs_with_keys = data_with_keys.cartesian(data_with_keys).map(lambda ((k1,v1),(k2,v2)): ((k1,k2),corrcoef(v1,v2)[0,1])).collect()
>> corrs_with_keys
[((0, 0), 1.0), ((0, 1), 0.99990086740991746), ((0, 2), -0.99953863896044948) ...

Finally, you could just replace corrcoef in either of the above with scipy.spatial.distance.cosine to get your cosine similarity.

Hope that's useful, as Andrei said, the answer partly depends on exactly what you're trying to do.

-- Jeremy


On Fri, May 23, 2014 at 2:41 PM, Andrei <[hidden email]> wrote:
Do you need cosine distance and correlation between vectors or between variables (elements of vector)? It would be helpful if you could tell us details of your task.


On Thu, May 22, 2014 at 5:49 PM, jamal sasha <[hidden email]> wrote:
Hi,
  I have bunch of vectors like
[0.1234,-0.231,0.23131]
.... and so on.

and  I want to compute cosine similarity and pearson correlation using pyspark.. 
How do I do this?
Any ideas?
Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Computing cosine similiarity using pyspark

roxana.danger
This post has NOT been accepted by the mailing list yet.
In reply to this post by jamal sasha
Hi Jamal,
    Is there any update on this?
    Thanks,
        Roxana