Hi Jamal,

One nice feature of PySpark is that you can easily use existing functions from NumPy and SciPy inside your Spark code. For a simple example, the following uses Spark's cartesian operation (which combines pairs of vectors into tuples), followed by NumPy's corrcoef to compute the pearson correlation coefficient between every pair of a set of vectors. The vectors are an RDD of numpy arrays.

>> from numpy import array, corrcoef

>> data = sc.parallelize([array([1,2,3]),array([2,4,6.1]),array([3,2,1.1])])

>> corrs = data.cartesian(data).map(lambda (x,y): corrcoef(x,y)[0,1]).collect()

>> corrs

[1.0, 0.99990086740991746, -0.99953863896044948 ...

This just returns a list of the correlation coefficients, you could also add a key to each array, to keep track of which pair is which

>> data_with_keys = sc.parallelize([(0,array([1,2,3])),(1,array([2,4,6.1])),(2,array([3,2,1.1]))])

>> corrs_with_keys = data_with_keys.cartesian(data_with_keys).map(lambda ((k1,v1),(k2,v2)): ((k1,k2),corrcoef(v1,v2)[0,1])).collect()

>> corrs_with_keys

[((0, 0), 1.0), ((0, 1), 0.99990086740991746), ((0, 2), -0.99953863896044948) ...

Finally, you could just replace corrcoef in either of the above with scipy.spatial.distance.cosine to get your cosine similarity.

Hope that's useful, as Andrei said, the answer partly depends on exactly what you're trying to do.

-- Jeremy