Using columnSimilarity with threshold result in greater than one

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Using columnSimilarity with threshold result in greater than one

Soheil Pourbafrani
Testing the columnSimilarity method in Spark, I create a RowMatrix object:

val temp = sc.parallelize(Array((5.0, 1.0, 4.0), (2.0, 3.0, 8.0),
(4.0, 5.0, 10.0), (1.0,3.0, 6.0)))

val rows = temp.map(line => {
Vectors.dense(Array(line._1, line._2, line._3))
})

val mat = new RowMatrix(rows)

the matrix is:
5  1   4
2  3   8
4  5   10
1  3   6

It will return the cosinSimilarity of rows:
(5, 2, 4, 1)
(1, 3, 5, 3)
(4, 8, 10, 6)
that is :

MatrixEntry(0,2,0.8226366627527562)
MatrixEntry(0,1,0.755742181606458)
MatrixEntry(1,2,0.9847319278346619)

The problem is when I set threshold:
val est = mat.columnSimilarities(0.5)
and the result of some pairs will be greater than one and because it's similarity the result should be between zero and one!

MatrixEntry(0,2,2.821741602543195)
MatrixEntry(0,1,1.319846878608914)

My primary question is what is the interpretation of results greater than one?
Does Spark use the DIMSUM algorithm for just cosinSimilarities with a threshold or it use DIMSUM for similarities without a threshold, too?