Spark version verification

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark version verification

Mich Talebzadeh
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Attila Zsolt Piros
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Mich Talebzadeh
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

srowen
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Mich Talebzadeh
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Kent Yao-2

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Mich Talebzadeh

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Attila Zsolt Piros
Hi!

Thanks Sean and Kent! By reading your answers I have also learnt something new.

[hidden email]: see the commit  content by prefixing it with https://github.com/apache/spark/commit/.
So in your case https://github.com/apache/spark/commit/1d550c4e90275ab418b9161925049239227f3dc9

Best Regards,
Attila

On Sun, Mar 21, 2021 at 5:02 PM Mich Talebzadeh <[hidden email]> wrote:

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Kent Yao-2
In reply to this post by Mich Talebzadeh
Hi Mich,
> What are the correlations among these links and the ability to establish a spark build version
   Check the documentation list here, http://spark.apache.org/documentation.html . And the `latest` always points to the list head, for example http://spark.apache.org/docs/latest/ means http://spark.apache.org/docs/3.1.1/ for now

The Spark build version in Spark releases is create by `spark-build-info ` see https://github.com/apache/spark/blob/89bf2afb3337a44f34009a36cae16dd0ff86b353/build/spark-build-info#L32 

Some other options to check the spark build info
1. the `RELEASE` file
cat RELEASE
Spark 3.0.1 (git revision 2b147c4cd5) built for Hadoop 2.7.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

2. bin/spark-submit —version


The git revision itself does not tell you whether the release is rc or final.

If you have the Spark source code locally, you can use `git show 1d550c4e90275ab418b9161925049239227f3dc9` and get the tag info, like `commit 1d550c4e90275ab418b9161925049239227f3dc9 (tag: v3.1.1-rc3, tag: v3.1.1)`.

Or you can compare the revision you have got with all tags here https://github.com/apache/spark/tags 

Bests,

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/22/2021 00:02[hidden email] wrote:

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--------------------------------------------------------------------- To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Mich Talebzadeh
Thanks Kent. I missed the first link 


 and did not check it.

I think that is what Sean referred to in his post.

spark-submit --version is probably easiest together with spark-shell and pyspark. However, none of these go to identifying whether it is the genuine article or release candidate.

An interesting one is what Attila kindly referred to in


spark-sql> select version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

And that opens the link

image.png


in GCP I have this on dataproc


spark-sql> select version();

3.1.1 122c0da8a0b9f5bc2b068643276b7b5a5a814d58


and trying the link as suggested


https://github.com/apache/spark/commit/122c0da8a0b9f5bc2b068643276b7b5a5a814d58


Points me to a non-existent page. However, this might be due to some customisation of code and patches etc.


Thanks everyone again.



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 16:51, Kent Yao <[hidden email]> wrote:
Hi Mich,
> What are the correlations among these links and the ability to establish a spark build version
   Check the documentation list here, http://spark.apache.org/documentation.html . And the `latest` always points to the list head, for example http://spark.apache.org/docs/latest/ means http://spark.apache.org/docs/3.1.1/ for now

The Spark build version in Spark releases is create by `spark-build-info ` see https://github.com/apache/spark/blob/89bf2afb3337a44f34009a36cae16dd0ff86b353/build/spark-build-info#L32 

Some other options to check the spark build info
1. the `RELEASE` file
cat RELEASE
Spark 3.0.1 (git revision 2b147c4cd5) built for Hadoop 2.7.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

2. bin/spark-submit —version


The git revision itself does not tell you whether the release is rc or final.

If you have the Spark source code locally, you can use `git show 1d550c4e90275ab418b9161925049239227f3dc9` and get the tag info, like `commit 1d550c4e90275ab418b9161925049239227f3dc9 (tag: v3.1.1-rc3, tag: v3.1.1)`.

Or you can compare the revision you have got with all tags here https://github.com/apache/spark/tags 

Bests,

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/22/2021 00:02[hidden email] wrote:

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

srowen
Right, that commit will not be in OSS, but some minor private variation or fork GCP is building from.

On Sun, Mar 21, 2021 at 12:31 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks Kent. I missed the first link 


 and did not check it.

I think that is what Sean referred to in his post.

spark-submit --version is probably easiest together with spark-shell and pyspark. However, none of these go to identifying whether it is the genuine article or release candidate.

An interesting one is what Attila kindly referred to in


spark-sql> select version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

And that opens the link

image.png


in GCP I have this on dataproc


spark-sql> select version();

3.1.1 122c0da8a0b9f5bc2b068643276b7b5a5a814d58


and trying the link as suggested


https://github.com/apache/spark/commit/122c0da8a0b9f5bc2b068643276b7b5a5a814d58


Points me to a non-existent page. However, this might be due to some customisation of code and patches etc.


Thanks everyone again.



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 16:51, Kent Yao <[hidden email]> wrote:
Hi Mich,
> What are the correlations among these links and the ability to establish a spark build version
   Check the documentation list here, http://spark.apache.org/documentation.html . And the `latest` always points to the list head, for example http://spark.apache.org/docs/latest/ means http://spark.apache.org/docs/3.1.1/ for now

The Spark build version in Spark releases is create by `spark-build-info ` see https://github.com/apache/spark/blob/89bf2afb3337a44f34009a36cae16dd0ff86b353/build/spark-build-info#L32 

Some other options to check the spark build info
1. the `RELEASE` file
cat RELEASE
Spark 3.0.1 (git revision 2b147c4cd5) built for Hadoop 2.7.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

2. bin/spark-submit —version


The git revision itself does not tell you whether the release is rc or final.

If you have the Spark source code locally, you can use `git show 1d550c4e90275ab418b9161925049239227f3dc9` and get the tag info, like `commit 1d550c4e90275ab418b9161925049239227f3dc9 (tag: v3.1.1-rc3, tag: v3.1.1)`.

Or you can compare the revision you have got with all tags here https://github.com/apache/spark/tags 

Bests,

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/22/2021 00:02[hidden email] wrote:

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Mich Talebzadeh
Ok, thanks. 



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 18:03, Sean Owen <[hidden email]> wrote:
Right, that commit will not be in OSS, but some minor private variation or fork GCP is building from.

On Sun, Mar 21, 2021 at 12:31 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks Kent. I missed the first link 


 and did not check it.

I think that is what Sean referred to in his post.

spark-submit --version is probably easiest together with spark-shell and pyspark. However, none of these go to identifying whether it is the genuine article or release candidate.

An interesting one is what Attila kindly referred to in


spark-sql> select version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

And that opens the link

image.png


in GCP I have this on dataproc


spark-sql> select version();

3.1.1 122c0da8a0b9f5bc2b068643276b7b5a5a814d58


and trying the link as suggested


https://github.com/apache/spark/commit/122c0da8a0b9f5bc2b068643276b7b5a5a814d58


Points me to a non-existent page. However, this might be due to some customisation of code and patches etc.


Thanks everyone again.



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 16:51, Kent Yao <[hidden email]> wrote:
Hi Mich,
> What are the correlations among these links and the ability to establish a spark build version
   Check the documentation list here, http://spark.apache.org/documentation.html . And the `latest` always points to the list head, for example http://spark.apache.org/docs/latest/ means http://spark.apache.org/docs/3.1.1/ for now

The Spark build version in Spark releases is create by `spark-build-info ` see https://github.com/apache/spark/blob/89bf2afb3337a44f34009a36cae16dd0ff86b353/build/spark-build-info#L32 

Some other options to check the spark build info
1. the `RELEASE` file
cat RELEASE
Spark 3.0.1 (git revision 2b147c4cd5) built for Hadoop 2.7.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

2. bin/spark-submit —version


The git revision itself does not tell you whether the release is rc or final.

If you have the Spark source code locally, you can use `git show 1d550c4e90275ab418b9161925049239227f3dc9` and get the tag info, like `commit 1d550c4e90275ab418b9161925049239227f3dc9 (tag: v3.1.1-rc3, tag: v3.1.1)`.

Or you can compare the revision you have got with all tags here https://github.com/apache/spark/tags 

Bests,

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/22/2021 00:02[hidden email] wrote:

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark version verification

Mich Talebzadeh
Basically, that is what dataproc team told us about  the way spark built it on 3.11-rc2 previously

"For Dataproc we do not download Spark binaries, we build Spark from source code. This specific version of Spark was built from v3.1.1-rc2 GitHub tag with additional Dataproc fixes and features."


So that explains the difference in built commit hash# for 3.1.1




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 18:16, Mich Talebzadeh <[hidden email]> wrote:
Ok, thanks. 



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 18:03, Sean Owen <[hidden email]> wrote:
Right, that commit will not be in OSS, but some minor private variation or fork GCP is building from.

On Sun, Mar 21, 2021 at 12:31 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks Kent. I missed the first link 


 and did not check it.

I think that is what Sean referred to in his post.

spark-submit --version is probably easiest together with spark-shell and pyspark. However, none of these go to identifying whether it is the genuine article or release candidate.

An interesting one is what Attila kindly referred to in


spark-sql> select version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

And that opens the link

image.png


in GCP I have this on dataproc


spark-sql> select version();

3.1.1 122c0da8a0b9f5bc2b068643276b7b5a5a814d58


and trying the link as suggested


https://github.com/apache/spark/commit/122c0da8a0b9f5bc2b068643276b7b5a5a814d58


Points me to a non-existent page. However, this might be due to some customisation of code and patches etc.


Thanks everyone again.



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 16:51, Kent Yao <[hidden email]> wrote:
Hi Mich,
> What are the correlations among these links and the ability to establish a spark build version
   Check the documentation list here, http://spark.apache.org/documentation.html . And the `latest` always points to the list head, for example http://spark.apache.org/docs/latest/ means http://spark.apache.org/docs/3.1.1/ for now

The Spark build version in Spark releases is create by `spark-build-info ` see https://github.com/apache/spark/blob/89bf2afb3337a44f34009a36cae16dd0ff86b353/build/spark-build-info#L32 

Some other options to check the spark build info
1. the `RELEASE` file
cat RELEASE
Spark 3.0.1 (git revision 2b147c4cd5) built for Hadoop 2.7.4
Build flags: -B -Pmesos -Pyarn -Pkubernetes -Psparkr -Pscala-2.12 -Phadoop-2.7 -Phive -Phive-thriftserver -DzincPort=3036

2. bin/spark-submit —version


The git revision itself does not tell you whether the release is rc or final.

If you have the Spark source code locally, you can use `git show 1d550c4e90275ab418b9161925049239227f3dc9` and get the tag info, like `commit 1d550c4e90275ab418b9161925049239227f3dc9 (tag: v3.1.1-rc3, tag: v3.1.1)`.

Or you can compare the revision you have got with all tags here https://github.com/apache/spark/tags 

Bests,

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/22/2021 00:02[hidden email] wrote:

Hi Kent,

Thanks for the links.

You have to excuse my ignorance, what are the correlations among these links and the ability to establish a spark build version?


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:55, Kent Yao <[hidden email]> wrote:

Kent Yao 
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.

spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.
spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
spark-func-extrasA library that brings excellent and useful functions from various modern database management systems to Apache Spark.




On 03/21/2021 23:28[hidden email] wrote:
Many thanks

spark-sql> SELECT version();
3.1.1 1d550c4e90275ab418b9161925049239227f3dc9

What does 1d550c4e90275ab418b9161925049239227f3dc9 signify please?




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 21 Mar 2021 at 15:14, Sean Owen <[hidden email]> wrote:
I believe you can "SELECT version()" in Spark SQL to see the build version.

On Sun, Mar 21, 2021 at 4:41 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the detailed info.

I was hoping that one can find a simpler answer to the Spark version than doing forensic examination on base code so to speak.

The primer for this verification is that on GCP dataprocs originally built on 3.11-rc2, there was an issue with running Spark Structured Streaming (SSS) which I reported to this forum before.

After a while and me reporting to Google, they have now upgraded the base to Spark 3.1.1 itself. I am not privy to how they did the upgrade itself.

In the meantime we installed 3.1.1 on-premise and ran it with the same Python code for SSS. It worked fine.

However, when I run the same code on GCP dataproc upgraded to 3.1.1, occasionally I see this error

21/03/18 16:53:38 ERROR org.apache.spark.scheduler.AsyncEventQueue: Listener EventLoggingListener threw an exception

java.util.ConcurrentModificationException

        at java.util.Hashtable$Enumerator.next(Hashtable.java:1387)


This may be for other reasons or the consequence of upgrading from 3.1.1-rc2 to 3.11?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sat, 20 Mar 2021 at 22:41, Attila Zsolt Piros <[hidden email]> wrote:
Hi!

I would check out the Spark source then diff those two RCs (first just take look to the list of the changed files):

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat
...

The shell scripts in the release can be checked very easily:
 

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".sh "
 bin/docker-image-tool.sh                           |   6 +-
 dev/create-release/release-build.sh                |   2 +-

We are lucky as docker-image-tool.sh is part of the released version.
Is it from v3.1.1-rc2 or v3.1.1-rc1?

Of course this only works if docker-image-tool.sh is not changed from the v3.1.1-rc2 back to v3.1.1-rc1.
So let's continue with the python (and latter with R) files:

$ git diff v3.1.1-rc1..v3.1.1-rc2 --stat | grep ".py "
 python/pyspark/sql/avro/functions.py               |   4 +-
 python/pyspark/sql/dataframe.py                    |   1 +
 python/pyspark/sql/functions.py                    | 285 +++++------
 .../pyspark/sql/tests/test_pandas_cogrouped_map.py |  12 +
 python/pyspark/sql/tests/test_pandas_map.py        |   8 +

...

After you have enough proof you can stop (to decide what is enough here should be decided by you). 
Finally you can use javap / scalap on the classes from the jars and check some code changes which is more harder to be analyzed than a simple text file.

Best Regards,
Attila


On Thu, Mar 18, 2021 at 4:09 PM Mich Talebzadeh <[hidden email]> wrote:
Hi 

What would be a signature in Spark version or binaries that confirms the release is built on Spark built on 3.1.1 as opposed to 3.1.1-RC-1 or RC-2?

Thanks

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.