Adding third party specific jars to Spark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Adding third party specific jars to Spark

Mich Talebzadeh

The primer for this was the process of developing code for accessing BigQuery data from PyCharm on premises so that advanced analytics and graphics can be done on local.

Writes are an issue as BiqQuery buffers data in a temporary storage on GS bucket before pushing it into BigQuery database

One option is to use Dataproc clusters for doing write intensive activities there ($$$) and thereafter do the reads on on-premises (Linux) and on local (assuming you have a powerful enough Windows Box). The issue was more with writes.

To make this work believe or not is a bit of art as you need to find the correct versions of Spark plus the correct versions of JAR files to BigQuery that work in tandem

Anyhow the read and write to BigQuery work with Spark-3.0.1-bin-hadoop3.2/ and the following two JAR files

-rwxr--r--  1 hduser hadoop 33943429 Jan 12 23:30 spark-bigquery-latest_2.12.jar
-rwxr--r--  1 hduser hadoop 17663298 Jan 13 19:20 gcs-connector-hadoop3-2.2.0-shaded.jar
lrwxrwxrwx  1 hduser hadoop       38 Jan 13 19:22 gcs-connector.jar -> gcs-connector-hadoop3-2.2.0-shaded.jar

For me the option that worked was to put these two jar files in directory $SPARK_HOME/jars. 

Adding them to spark.driver.extraClassPath in $SPARK_HOME/conf/spark-defaults.conf did not work. Using spark-submit on PyCharm terminal with --jars added other issues.

So in short I put these two files in $SPARK_HOME/jars and it worked. I am not sure this is ideal but one advantage it has would be to create a container jar file spark-libs.jar

jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .

and put it under HDFS directory so all nodes of the cluster can access it. You need to add it to $SPARK_HOME/conf/spark-defaults.conf 


If anyone has any suggestions please let me know.