Number of rows divided by rowsPerBlock cannot exceed maximum integer

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Number of rows divided by rowsPerBlock cannot exceed maximum integer

Soheil Pourbafrani
Hi, 
Doing cartesian multiplication against a matrix, I got the error:

pyspark.sql.utils.IllegalArgumentException: requirement failed: Number of rows divided by rowsPerBlock cannot exceed maximum integer.

Here is the code:

normalizer = Normalizer(inputCol="feature", outputCol="norm")
data = normalizer.transform(tfidf)

mat = IndexedRowMatrix(
data.select("ID", "norm")\
.rdd.map(lambda row: IndexedRow(row.ID, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
dot.toLocalMatrix().toArray()
The error point to the line:
.rdd.map(lambda row: IndexedRow(row.ID, row.norm.toArray()))).toBlockMatrix()
I reduce the data to only 5 sentences, but I still got the error!