jar incompatibility with Spark 3.1.1 for structured streaming with kafka

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Gabor Somogyi
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Gabor Somogyi
I've just had a deeper look at the possible issue and here are my findings:
* In 3.0.1 KafkaTokenUtil.needTokenUpdate has 3 params
* In 3.1.1 KafkaTokenUtil.needTokenUpdate has 2 params
* I've decompiled spark-token-provider-kafka-0-10_2.12-3.1.1.jar and KafkaTokenUtil.needTokenUpdate has 2 params in it's definition
* I've decompiled spark-sql-kafka-0-10_2.12-3.1.1.jar and KafkaTokenUtil.needTokenUpdate is called only from getOrRetrieveConsumer with 2 params
* The type and number of params are matching withing the 3.1.1 jars
* We have a delegation token integration test which compiled and passed (just checked it)

If you think there is an issue in the code and/or packaging please open a jira with more details.

BR,
G


On Tue, Apr 6, 2021 at 12:54 PM Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
In reply to this post by Gabor Somogyi
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

srowen
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Gabor Somogyi
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

srowen
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <[hidden email]> wrote:
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
Fine.

Just to clarify please.

With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit

As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? 

So hence 

spark-submit --master local[4] --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 <python_file>





   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:39, Sean Owen <[hidden email]> wrote:
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <[hidden email]> wrote:
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
OK we found out the root cause of this issue.

We were writing to Redis from Spark and downloaded a recently compiled version of Redis jar with scala 2.12. 

spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar

It was giving grief. We removed that one. So the job runs with either

spark-sql-kafka-0-10_2.12-3.1.0.jar

or as packages through

spark-submit .. --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1

We will follow the suggested solution as per doc


Batch: 18

-------------------------------------------

+--------------------+------+-------------------+------+

|              rowkey|ticker|         timeissued| price|

+--------------------+------+-------------------+------+

|b539cb54-3ddd-47c...|  ORCL|2021-04-06 20:53:37| 41.32|

|2d4bae2d-649e-4b8...|   VOD|2021-04-06 20:53:37|317.48|

|2f51f188-6da4-4bb...|   MKS|2021-04-06 20:53:37|376.63|

|1a4c4645-8dc7-4ef...|    BP|2021-04-06 20:53:37| 571.5|

|45c9e738-ead7-4e5...|  SBRY|2021-04-06 20:53:37|244.76|

|48f93c13-43ad-422...|   SAP|2021-04-06 20:53:37| 58.71|

|ed4d89b1-7fc1-420...|   IBM|2021-04-06 20:53:37|105.91|

|44b3f0ce-27b8-4a9...|   MRW|2021-04-06 20:53:37|297.85|

|4441b0b5-32c1-4cb...|  MSFT|2021-04-06 20:53:37| 27.83|

|143398a4-13b5-494...|  TSCO|2021-04-06 20:53:37|183.42|

+--------------------+------+-------------------+------+


Now we need to go back to the drawing board and see how to integrate Redis 



Thanks


Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:52, Mich Talebzadeh <[hidden email]> wrote:
Fine.

Just to clarify please.

With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit

As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? 

So hence 

spark-submit --master local[4] --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 <python_file>





   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:39, Sean Owen <[hidden email]> wrote:
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <[hidden email]> wrote:
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Gabor Somogyi
Good to hear it's working.
Happy Spark usage.

G


On Tue, 6 Apr 2021, 21:56 Mich Talebzadeh, <[hidden email]> wrote:
OK we found out the root cause of this issue.

We were writing to Redis from Spark and downloaded a recently compiled version of Redis jar with scala 2.12. 

spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar

It was giving grief. We removed that one. So the job runs with either

spark-sql-kafka-0-10_2.12-3.1.0.jar

or as packages through

spark-submit .. --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1

We will follow the suggested solution as per doc


Batch: 18

-------------------------------------------

+--------------------+------+-------------------+------+

|              rowkey|ticker|         timeissued| price|

+--------------------+------+-------------------+------+

|b539cb54-3ddd-47c...|  ORCL|2021-04-06 20:53:37| 41.32|

|2d4bae2d-649e-4b8...|   VOD|2021-04-06 20:53:37|317.48|

|2f51f188-6da4-4bb...|   MKS|2021-04-06 20:53:37|376.63|

|1a4c4645-8dc7-4ef...|    BP|2021-04-06 20:53:37| 571.5|

|45c9e738-ead7-4e5...|  SBRY|2021-04-06 20:53:37|244.76|

|48f93c13-43ad-422...|   SAP|2021-04-06 20:53:37| 58.71|

|ed4d89b1-7fc1-420...|   IBM|2021-04-06 20:53:37|105.91|

|44b3f0ce-27b8-4a9...|   MRW|2021-04-06 20:53:37|297.85|

|4441b0b5-32c1-4cb...|  MSFT|2021-04-06 20:53:37| 27.83|

|143398a4-13b5-494...|  TSCO|2021-04-06 20:53:37|183.42|

+--------------------+------+-------------------+------+


Now we need to go back to the drawing board and see how to integrate Redis 



Thanks


Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:52, Mich Talebzadeh <[hidden email]> wrote:
Fine.

Just to clarify please.

With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit

As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? 

So hence 

spark-submit --master local[4] --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 <python_file>





   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:39, Sean Owen <[hidden email]> wrote:
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <[hidden email]> wrote:
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger

Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 07:51, Gabor Somogyi <[hidden email]> wrote:
Good to hear it's working.
Happy Spark usage.

G


On Tue, 6 Apr 2021, 21:56 Mich Talebzadeh, <[hidden email]> wrote:
OK we found out the root cause of this issue.

We were writing to Redis from Spark and downloaded a recently compiled version of Redis jar with scala 2.12. 

spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar

It was giving grief. We removed that one. So the job runs with either

spark-sql-kafka-0-10_2.12-3.1.0.jar

or as packages through

spark-submit .. --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1

We will follow the suggested solution as per doc


Batch: 18

-------------------------------------------

+--------------------+------+-------------------+------+

|              rowkey|ticker|         timeissued| price|

+--------------------+------+-------------------+------+

|b539cb54-3ddd-47c...|  ORCL|2021-04-06 20:53:37| 41.32|

|2d4bae2d-649e-4b8...|   VOD|2021-04-06 20:53:37|317.48|

|2f51f188-6da4-4bb...|   MKS|2021-04-06 20:53:37|376.63|

|1a4c4645-8dc7-4ef...|    BP|2021-04-06 20:53:37| 571.5|

|45c9e738-ead7-4e5...|  SBRY|2021-04-06 20:53:37|244.76|

|48f93c13-43ad-422...|   SAP|2021-04-06 20:53:37| 58.71|

|ed4d89b1-7fc1-420...|   IBM|2021-04-06 20:53:37|105.91|

|44b3f0ce-27b8-4a9...|   MRW|2021-04-06 20:53:37|297.85|

|4441b0b5-32c1-4cb...|  MSFT|2021-04-06 20:53:37| 27.83|

|143398a4-13b5-494...|  TSCO|2021-04-06 20:53:37|183.42|

+--------------------+------+-------------------+------+


Now we need to go back to the drawing board and see how to integrate Redis 



Thanks


Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:52, Mich Talebzadeh <[hidden email]> wrote:
Fine.

Just to clarify please.

With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit

As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? 

So hence 

spark-submit --master local[4] --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 <python_file>





   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:39, Sean Owen <[hidden email]> wrote:
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <[hidden email]> wrote:
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Gabor Somogyi
Not sure what you mean not working. You've added 3.1.1 to packages which uses:

I think it worth an end-to-end dep-tree analysis what is really happening on the cluster...

G


On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger

Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 07:51, Gabor Somogyi <[hidden email]> wrote:
Good to hear it's working.
Happy Spark usage.

G


On Tue, 6 Apr 2021, 21:56 Mich Talebzadeh, <[hidden email]> wrote:
OK we found out the root cause of this issue.

We were writing to Redis from Spark and downloaded a recently compiled version of Redis jar with scala 2.12. 

spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar

It was giving grief. We removed that one. So the job runs with either

spark-sql-kafka-0-10_2.12-3.1.0.jar

or as packages through

spark-submit .. --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1

We will follow the suggested solution as per doc


Batch: 18

-------------------------------------------

+--------------------+------+-------------------+------+

|              rowkey|ticker|         timeissued| price|

+--------------------+------+-------------------+------+

|b539cb54-3ddd-47c...|  ORCL|2021-04-06 20:53:37| 41.32|

|2d4bae2d-649e-4b8...|   VOD|2021-04-06 20:53:37|317.48|

|2f51f188-6da4-4bb...|   MKS|2021-04-06 20:53:37|376.63|

|1a4c4645-8dc7-4ef...|    BP|2021-04-06 20:53:37| 571.5|

|45c9e738-ead7-4e5...|  SBRY|2021-04-06 20:53:37|244.76|

|48f93c13-43ad-422...|   SAP|2021-04-06 20:53:37| 58.71|

|ed4d89b1-7fc1-420...|   IBM|2021-04-06 20:53:37|105.91|

|44b3f0ce-27b8-4a9...|   MRW|2021-04-06 20:53:37|297.85|

|4441b0b5-32c1-4cb...|  MSFT|2021-04-06 20:53:37| 27.83|

|143398a4-13b5-494...|  TSCO|2021-04-06 20:53:37|183.42|

+--------------------+------+-------------------+------+


Now we need to go back to the drawing board and see how to integrate Redis 



Thanks


Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:52, Mich Talebzadeh <[hidden email]> wrote:
Fine.

Just to clarify please.

With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit

As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? 

So hence 

spark-submit --master local[4] --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 <python_file>





   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 17:39, Sean Owen <[hidden email]> wrote:
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.

On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <[hidden email]> wrote:
Hi G

Thanks for the heads-up.

In a thread on 3rd of March I reported that 3.1.1 works in yarn mode 


From that mail

The needed jar files for version 3.1.1 to read from Kafka and write to
BigQuery for 3.1.1 are as follows:

All under $SPARK_HOME/jars on all nodes. These are the latest available jar
files


   - commons-pool2-2.9.0.jar
   - spark-token-provider-kafka-0-10_2.12-3.1.0.jar
   - spark-sql-kafka-0-10_2.12-3.1.0.jar
   - kafka-clients-2.7.0.jar
   - spark-bigquery-latest_2.12.jar


I just tested it and in local mode single JVM it works fine without the addition of package --> --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1
 BUT including all the above jars files

Batch: 17
-------------------------------------------
+--------------------+------+-------------------+------+
|              rowkey|ticker|         timeissued| price|
+--------------------+------+-------------------+------+
|54651f0d-1be0-4d7...|   IBM|2021-04-06 17:17:04| 91.92|
|8aa1ad79-4792-466...|   SAP|2021-04-06 17:17:04| 34.93|
|8567f327-cfec-43d...|  TSCO|2021-04-06 17:17:04| 324.5|
|138a1278-2f54-45b...|   VOD|2021-04-06 17:17:04| 241.4|
|e02793c3-8e78-47e...|  ORCL|2021-04-06 17:17:04|  17.6|
|0ab456fb-bd22-465...|  SBRY|2021-04-06 17:17:04|350.45|
|74588e92-a3e2-48c...|  MSFT|2021-04-06 17:17:04| 44.58|
|1e7203c6-6938-4ea...|    BP|2021-04-06 17:17:04| 588.0|
|1e55021a-148d-4aa...|   MRW|2021-04-06 17:17:04|171.21|
|229ad6f9-e4ed-475...|   MKS|2021-04-06 17:17:04|439.17|
+--------------------+------+-------------------+------+

However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and include the packages as suggested in the link


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

It cannot fetch the data

root
 |-- parsed_value: struct (nullable = true)
 |    |-- rowkey: string (nullable = true)
 |    |-- ticker: string (nullable = true)
 |    |-- timeissued: timestamp (nullable = true)
 |    |-- price: float (nullable = true)

{'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}
-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+----------+-----+
|rowkey|ticker|timeissued|price|
+------+------+----------+-----+
+------+------+----------+-----+

2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291)
        at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604)
        at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287)
        at org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63)
        at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79)
        at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task
java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


Now I deleted ~/.ivy2 directory and ran the job again

Ivy Default Cache set to: /home/hduser/.ivy2/cache
The jars for the packages stored in: /home/hduser/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0
 
let us go and have a look at the directory .ivy2/jars

 /home/hduser/.ivy2/jars> ltr
total 13108
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar
-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
drwxr-xr-x 4 hduser hadoop    4096 Apr  6 17:25 ..
drwxr-xr-x 2 hduser hadoop    4096 Apr  6 17:25 .

Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date.

Very confusing. Sounds like we have changed something in the cluster that as reported on 3rd March it  used to work with those jar files and now not working.

So in summary without those jar files added to $SPARK_HOME/jars it fails totally even with the packages added.

Cheers


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <[hidden email]> wrote:
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying
I don't see the --packages option...

G


On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <[hidden email]> wrote:
OK thanks for that.

I am using spark-submit with PySpark as follows

 spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201
Branch HEAD
Compiled by user ubuntu on 2021-02-22T01:33:19Z


spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores 2 xyz.py

enabling with virtual environment


That works fine with any job that does not do structured streaming in a client mode.


Running on local  node with 


spark-submit --master local[4] --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt --conf spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv --conf spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 xyz.py


works fine with the same spark version and $SPARK_HOME/jars 


Cheers



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 13:20, Sean Owen <[hidden email]> wrote:
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1.
You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version.

On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Gabor.

All nodes are running Spark /spark-3.1.1-bin-hadoop3.2

So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well.

They are installed identically on all nodes. 

I have looked at the Spark environment for classpath. Still I don't see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2.12-3.1.1.jar 
but works ok with  spark-sql-kafka-0-10_2.12-3.1.0.jar 

Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even

I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then I enquired the availability of new version from Maven that pointed to spark-sql-kafka-0-10_2.12-3.1.1.jar

So to confirm Spark out of the tarball does not have any

ltr spark-sql-kafka-*
ls: cannot access spark-sql-kafka-*: No such file or directory


For SSS, I had to add these

add commons-pool2-2.9.0.jar. The one shipped is commons-pool-1.5.4.jar!

add kafka-clients-2.7.0.jar  Did not have any

add  spark-sql-kafka-0-10_2.12-3.0.1.jar  Did not have any

I gather from your second mail, there seems to be an issue with spark-sql-kafka-0-10_2.12-3.1.1.jar ?

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <[hidden email]> wrote:
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only.
KafkaTokenUtil is in the token provider jar.

As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently.

BR,
G


On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh <[hidden email]> wrote:

Hi,


Any chance of someone testing  the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws 


java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar works fine


Thanks


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

srowen
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can.

On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <[hidden email]> wrote:
Not sure what you mean not working. You've added 3.1.1 to packages which uses:

I think it worth an end-to-end dep-tree analysis what is really happening on the cluster...

G


On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Gabor Somogyi
+1 on Sean's opinion

On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <[hidden email]> wrote:
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can.

On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <[hidden email]> wrote:
Not sure what you mean not working. You've added 3.1.1 to packages which uses:

I think it worth an end-to-end dep-tree analysis what is really happening on the cluster...

G


On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh

Did some tests. The concern is SSS job running under YARN


Scenario 1)  use spark-sql-kafka-0-10_2.12-3.1.0.jar

  • Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on CLASSPATH including $SPARK_HOME/jars
  • Added the said jar file to spark-submit in client mode (the only mode available to PySpark) with --jars
  • spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true .. bla bla..  --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar xyz.py

This works fine


Scenario 2) use spark-sql-kafka-0-10_2.12-3.1.1.jar in spark-submit


  •  spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.1.jar xyz.py

it failed with 


  • Caused by: java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


  • spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

it failed with

  • Caused by: java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


HTH


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 13:20, Gabor Somogyi <[hidden email]> wrote:
+1 on Sean's opinion

On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <[hidden email]> wrote:
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can.

On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <[hidden email]> wrote:
Not sure what you mean not working. You've added 3.1.1 to packages which uses:

I think it worth an end-to-end dep-tree analysis what is really happening on the cluster...

G


On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Amit Joshi
Hi Mich,

If I correctly understood your problem, it is that the spark-kafka jar is shadowed by the installed kafka client jar at run time.
I had been in that place earlier.
I can recommend resolving the issue using the shade plugin. The example I am pasting here works for pom.xml.
I am very sure you will find something for sbt as well.
This is a maven shade plugin to change the name of the class while packaging. This will form an uber jar.
<relocations>
    <
relocation>
        <
pattern>org.apache.kafka</pattern>
        <
shadedPattern>shade.org.apache.kafka</shadedPattern>
    </
relocation>
</
relocations>

Hope this helps.

Regards
Amit Joshi

On Wed, Apr 7, 2021 at 8:14 PM Mich Talebzadeh <[hidden email]> wrote:

Did some tests. The concern is SSS job running under YARN


Scenario 1)  use spark-sql-kafka-0-10_2.12-3.1.0.jar

  • Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on CLASSPATH including $SPARK_HOME/jars
  • Added the said jar file to spark-submit in client mode (the only mode available to PySpark) with --jars
  • spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true .. bla bla..  --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar xyz.py

This works fine


Scenario 2) use spark-sql-kafka-0-10_2.12-3.1.1.jar in spark-submit


  •  spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.1.jar xyz.py

it failed with 


  • Caused by: java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


  • spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

it failed with

  • Caused by: java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


HTH


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 13:20, Gabor Somogyi <[hidden email]> wrote:
+1 on Sean's opinion

On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <[hidden email]> wrote:
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can.

On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <[hidden email]> wrote:
Not sure what you mean not working. You've added 3.1.1 to packages which uses:

I think it worth an end-to-end dep-tree analysis what is really happening on the cluster...

G


On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger

Reply | Threaded
Open this post in threaded view
|

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh
Hi Amit,

Many thanks for your suggestion.

My problem is that I am using PySpark in this particular case and there is no SBT or Maven which is used similar to Scala to build an Uber jar file with shading.

So regrettably that is the only way I could resolve the problem by adding the jar file in spark-submit through --jars.

Regards

Mich



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 17:02, Amit Joshi <[hidden email]> wrote:
Hi Mich,

If I correctly understood your problem, it is that the spark-kafka jar is shadowed by the installed kafka client jar at run time.
I had been in that place earlier.
I can recommend resolving the issue using the shade plugin. The example I am pasting here works for pom.xml.
I am very sure you will find something for sbt as well.
This is a maven shade plugin to change the name of the class while packaging. This will form an uber jar.
<relocations>
    <
relocation>
        <
pattern>org.apache.kafka</pattern>
        <
shadedPattern>shade.org.apache.kafka</shadedPattern>
    </
relocation>
</
relocations>

Hope this helps.

Regards
Amit Joshi

On Wed, Apr 7, 2021 at 8:14 PM Mich Talebzadeh <[hidden email]> wrote:

Did some tests. The concern is SSS job running under YARN


Scenario 1)  use spark-sql-kafka-0-10_2.12-3.1.0.jar

  • Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on CLASSPATH including $SPARK_HOME/jars
  • Added the said jar file to spark-submit in client mode (the only mode available to PySpark) with --jars
  • spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true .. bla bla..  --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar xyz.py

This works fine


Scenario 2) use spark-sql-kafka-0-10_2.12-3.1.1.jar in spark-submit


  •  spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --jars $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.1.jar xyz.py

it failed with 


  • Caused by: java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


  • spark-submit --master yarn --deploy-mode client --conf spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G --executor-memory 4G --num-executors 2 --executor-cores 2 --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 xyz.py

it failed with

  • Caused by: java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


HTH


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 13:20, Gabor Somogyi <[hidden email]> wrote:
+1 on Sean's opinion

On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <[hidden email]> wrote:
You shouldn't be modifying your cluster install. You may at this point have conflicting, excess JARs in there somewhere. I'd start it over if you can.

On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <[hidden email]> wrote:
Not sure what you mean not working. You've added 3.1.1 to packages which uses:

I think it worth an end-to-end dep-tree analysis what is really happening on the cluster...

G


On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Gabor et. al.,

To be honest I am not convinced this package --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!

I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works fine. I reported the package working before because under $SPARK_HOME/jars on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we had the following entries:

spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar

So the jar file was picked up first anyway.

The concern I have is that that the package uses older version of jar files, namely: the following in .ivy2/jars

-rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 com.github.luben_zstd-jni-1.4.8-1.jar
-rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019 org.apache.commons_commons-pool2-2.6.2.jar
-rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020 org.apache.kafka_kafka-clients-2.6.0.jar
-rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57 org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58 org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
-rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020 org.lz4_lz4-java-1.7.1.jar
-rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019 org.slf4j_slf4j-api-1.7.30.jar
-rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014 org.spark-project.spark_unused-1.0.0.jar
-rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 org.xerial.snappy_snappy-java-1.1.8.2.jar


So I am not sure. Hence I want someone to verify this independently in anger