BucketedRandomProjectionLSHModel algorithm details

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

BucketedRandomProjectionLSHModel algorithm details

vvinton
This post has NOT been accepted by the mailing list yet.
Hi There,

Using spark-mllib_2.11-2.1.0. Facing issue that BucketedRandomProjectionLSHModel.approxNearestNeighbors returns one result, always.

Dataset looks like:

+----+--------------------+-------------+------------------------+----------------------+
|  id|            features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension|
+----+--------------------+-------------+------------------------+----------------------+
|1045|(16384,[196,11016...|            0|    (16384,[196],[0.2...|  [[0.0], [0.0], [0...|
|1041|(16384,[4110,1065...|            0|    (16384,[196],[0.2...|  [[0.0], [0.0], [-...|
+----+--------------------+-------------+------------------------+----------------------+
Execution code:

Dataset<Row> approximatedDS = (Dataset<Row>) ((BucketedRandomProjectionLSHModel)model)
                            .approxNearestNeighbors(dataset,
                            vectorToCalculateAgainst, numberOfResults, false, MLFlowConstants.THEMES_PREDICTION_COLUMNS.distance.name());
Where:

numberOfResults = 2
vectorToCalculateAgainst = first vector in predictionVectorFeatures column
approximatedDS looks like follows:

+----+--------------------+-------------+------------------------+----------------------+------------------+
|  id|            features|kmeansCluster|predictionVectorFeatures|featuresInNewDimension|          distance|
+----+--------------------+-------------+------------------------+----------------------+------------------+
|1061|(16384,[196,11016...|            1|    (16384,[196],[0.2...|  [[0.0], [0.0], [0...|0.8536603178950374|
+----+--------------------+-------------+------------------------+----------------------+------------------+
I have suspicion, that in LSH.scala

  // Compute threshold to get exact k elements.
  // TODO: SPARK-18409: Use approxQuantile to get the threshold
  val modelDatasetSortedByHash = modelDataset.sort(hashDistCol).limit(numNearestNeighbors)
  val thresholdDataset = modelDatasetSortedByHash.select(max(hashDistCol))
  val hashThreshold = thresholdDataset.take(1).head.getDouble(0)

  // Filter the dataset where the hash value is less than the threshold.
  modelDataset.filter(hashDistCol <= hashThreshold)
}
last filter does wrong filtering, but may be wrong (do not know scala).

Can anyone help me understand how to make BucketedRandomProjectionLSHModel.approxNearestNeighbors to return multiple "nearest" vectors?

Thanks,