Speed up Spark writes to Google Cloud storage

classic Classic list List threaded Threaded
1 message Options
SRK
Reply | Threaded
Open this post in threaded view
|

Speed up Spark writes to Google Cloud storage

SRK
hi,

Our Spark writes to GCS are slow. The reason I see is that a staging
directory used for the initial data generation following by copying the data
to actual directory in GCS. Following are few configs and code. Any
suggestions on how to speed this thing up will be great.

    sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode",
"dynamic")
   
sparkSession.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version",
"2")
   
sparkSession.conf.set("spark.hadoop.mapreduce.use.directfileoutputcommitter",
"true")
    sparkSession.conf.set(
      "spark.hadoop.mapred.output.committer.class",
      "org.apache.hadoop.mapred.DirectFileOutputCommitter"
    )

    sparkSession.sparkContext.hadoopConfiguration
      .set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

    sparkSession.sparkContext.hadoopConfiguration
      .set("spark.speculation", "false")


    snapshotInGCS.write
      .option("header", "true")
      .option("emptyValue", "")
      .option("delimiter", "^")
      .mode(SaveMode.Overwrite)
      .format("csv")
      .partitionBy("date", "id")
      .option("compression", "gzip")
      .save(s"gs://${bucketName}/${folderName}")



Thank you,
SK



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]