Is it possible to implement Vector Space Model using PySpark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Is it possible to implement Vector Space Model using PySpark

Soheil Pourbafrani
Hi, I want to implement the Vector Space Model for texts using Spark. At the first step, I calculate the Vector of the files (dictionary) and I made it a broadcast variable to be accessible for all executors.

Vector_of_Words = selected_data.select('full_text').rdd\
   .map(lambda x : x[0].encode("ascii", "ignore"))\
  .flatMap(lambda row : tokenizeForVector(row))

Vector_of_Words_Broadcast = sc.broadcast(Vector_of_Words.collect())

At the second step, I calculate my customized TF-IDF for each word:

Hybrid_TFIDF = selected_data.rdd.flatMap(lambda x : tokenize(x))\
  .reduceByKey(lambda x, y : (x[0] + y[0], x[1] + ',' + y[1]))\
  .map(lambda x : (x[0], x[1][0], len(set(x[1][1].split(",")))\
                    ,x[1][1], x[1][0] * log10(N / len(set(str(x[1][1]).split(","))))))

It is in the form of (word, documentIDs, TF-IDF)
Now I don't know how should I continue? Overally my question is, is it possible to calculate Vector Space Model using Apache Spark?