Spark 3.1.1 Preliminary results (mainly to do with Spark Structured Streaming)
We upgraded a cluster of three nodes previously using Spark 3.0.1 to the new release of Spark 3.1.1. The cluster is using RHES 7.6 with three nodes using yarn etc
We tested a job that creates random data and writes to Hive 3.0. This worked fine as before.
In version 3.0.1 we could do structured streaming with Kafka and write to Google BigQuery table. This was using PySpark
We had an issue running the same code on Google Dataproc which was built on Spark-3.1.1-RC2.. This was not producing any results and was stuck on BatchId = 0. I have reported this in this forum before a few days ago. Once we upgraded our on-premise cluster today to 3.1.1, we ran the same code using Spark 3.1.1 to populate Google BigQuery. I am pleased that this is now working correctly!
The needed jar files for version 3.1.1 to read from Kafka and write to BigQuery for 3.1.1 are as follows:
All under $SPARK_HOME/jars on all nodes. These are the latest available jar files
Also the following are added to $SPARK_HOME/conf/spark-defaults.conf on all nodes
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.