This post has NOT been accepted by the mailing list yet.
I have a dataset that contains DocID, WordID and frequency (count) as shown below. Note that the first three numbers represent 1. the number of documents, 2. the number of words in the vocabulary and 3. the total number of words in the collection.
The problem is that I am not quite sure how to read the .txt.gz file as RDD and create an Array of sparse vectors as described above. Please note that I actually want to pass the data array in the PCA transformer.