normalizer = Normalizer(inputCol="feature", outputCol="norm") data = normalizer.transform(tfidf) mat = IndexedRowMatrix( data.select("id", "norm")\ .rdd.map(lambda row: IndexedRow(row.id, row.norm.toArray()))).toBlockMatrix() dot = mat.multiply(mat.transpose())
In the output, I expect it generates a matrix with Matrix diagonal of 1 (because each vector's similarity to itself is one) and its Matrix diagonal is one, too (as desired).
The problem is when I want to weight words in the vector space to something other than typical TF-IDF. So I compute the vector space and create a vector for each document that the index of document's words has new weights and other than has weights zero.
for example the following vector is for document id 0.