Represent documents as a sequence of wordID & frequency and perform PCA

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Represent documents as a sequence of wordID & frequency and perform PCA

Old-School
This post has NOT been accepted by the mailing list yet.
This post was updated on .
Imagine that 4 documents exist as shown below:

D1: the cat sat on the mat
D2: the cat sat on the cat
D3: the cat sat
D4: the mat sat

where each word in the vocabulary can be translated to its wordID:

0 the
1 cat
2 sat
3 on
4 the
5 mat

Now every document, can be represented using sparse vectors as shown below:


Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))


and finally, principal components can be computed as follows:


val data = Array(
    Vectors.sparse(5, Seq((0, 2.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0))),
    Vectors.sparse(5, Seq((0, 2.0), (1, 2.0), (2, 1.0), (3, 1.0))),
    Vectors.sparse(5, Seq((0, 1.0), (1, 1.0), (2, 1.0))),
    Vectors.sparse(5, Seq((0, 1.0), (2, 1.0), (4, 1.0))))

val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
val pc: Matrix = mat.computePrincipalComponents(4)


What I want to do, is to read the following dataset and represent each document using sparse vectors like above, in order to compute the principal components.



In the form: docID wordID count

1 2 1
1 39 1
1 42 3
1 77 1
1 95 1
1 96 1
2 105 1
2 108 1
3 133 3

however I am not quite sure how to read and represent the dataset as sparse vectors. Any help would be much appreciated.
Loading...