PySpark OOM when running PCA

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

PySpark OOM when running PCA

Riccardo Ferrari
Hi list,

I am having troubles running a PCA with pyspark. I am trying to reduce a matrix size since my features after OHE gets 40k wide.

Spark 2.2.0 Stand-alone (Oracle JVM)
pyspark 2.2.0 from a docker (OpenJDK)

I'm starting the spark session from the notebook however I make sure to:
  • PYSPARK_SUBMIT_ARGS: "--packages ... --driver-memory 20G pyspark-shell"
  • sparkConf.set("spark.executor.memory", "24G")
  • sparkConf.set("spark.driver.memory", "20G")
My executors gets 24Gb per node, and my driver process starts with:
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx20G org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=24G ... pyspark-shell

So I should have plenty of memory to play with, however when running PCA.fit I get in the spark driver logs:
19/02/08 01:02:43 WARN TaskSetManager: Stage 29 contains a task of very large size (142 KB). The maximum recommended task size is 100 KB.
19/02/08 01:02:43 WARN RowMatrix: 34771 columns will require at least 9672 megabytes of memory!
19/02/08 01:02:46 WARN RowMatrix: 34771 columns will require at least 9672 megabytes of memory!

Eventually fails:
Py4JJavaError: An error occurred while calling o287.fit.
: java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
...
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:122)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:344)
    at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:387)
    at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48)
...

What am I missing ?
Any hints much appreciated,