Imagine that 4 documents exist as shown below:
D1: the cat sat on the mat
D2: the cat sat on the cat
D3: the cat sat
D4: the mat sat
where each word in the vocabulary can be translated to its wordID:
0 the
1 cat
2 sat
3 on
4 the
5 mat
Now every document, can be represented using sparse vectors as shown below:
Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
and finally, principal components can be computed as follows:
val data = Array(
Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
val pc: Matrix = mat.computePrincipalComponents(4)
What I want to do, is to read the following dataset and represent each document using sparse vectors like above, in order to compute the principal components.
In the form: docID wordID count
1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3
however I am not quite sure how to read and represent the dataset as sparse vectors. Any help would be much appreciated.
