Spark 3.1.1 Preliminary results (mainly to do with Spark Structured Streaming)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Spark 3.1.1 Preliminary results (mainly to do with Spark Structured Streaming)

Mich Talebzadeh
Hi all,

We upgraded a cluster of three nodes previously using Spark 3.0.1 to the new release of Spark 3.1.1. The cluster is using RHES 7.6 with three nodes using yarn etc

We tested a job that creates random data and writes to Hive 3.0. This worked fine as before.

In version 3.0.1 we could do structured streaming with Kafka and write to Google BigQuery table. This was using PySpark

We had an issue running the same code on Google Dataproc which was built on  Spark-3.1.1-RC2.. This was not producing any results and was stuck on BatchId = 0. I have reported this in this forum before a few days ago. Once we upgraded our on-premise cluster today to 3.1.1, we ran the same code using Spark 3.1.1 to populate Google BigQuery. I am pleased that this is now working correctly!

The needed jar files for version 3.1.1 to read from Kafka and write to BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar files

  • commons-pool2-2.9.0.jar
  • spark-token-provider-kafka-0-10_2.12-3.1.0.jar
  • spark-sql-kafka-0-10_2.12-3.1.0.jar
  • kafka-clients-2.7.0.jar
  • spark-bigquery-latest_2.12.jar

Also the following are added to $SPARK_HOME/conf/spark-defaults.conf on all nodes

spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar

spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.