[Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

Gokul
Hello All, 

I am calculating the hash value  of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key's are totally different. 

For the instance, 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+---------------+---------------+

|hash(40514XXXX)|hash(41751XXXX)|

+---------------+---------------+

|      976573657|      976573657|

+---------------+---------------+


scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|      777096871|

+-----------+---------------+

I do understand that hash() returns an integer, are these reached the max value?. 

Thanks & Regards, 
Gokula Krishnan (Gokul)
Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

Thakrar, Jayesh

Cannot reproduce your situation.

Can you share Spark version?

 

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0

      /_/

        

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+----------------+---------------+

|hash(40514XXXXX)|hash(41751XXXX)|

+----------------+---------------+

|     -1898845883|      916273350|

+----------------+---------------+

 

 

scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|    -1593820563|

+-----------+---------------+

 

 

scala>

 

From: Gokula Krishnan D <[hidden email]>
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user <[hidden email]>
Subject: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

 

Hello All, 

 

I am calculating the hash value  of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key's are totally different. 

 

For the instance, 

 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+---------------+---------------+

|hash(40514XXXX)|hash(41751XXXX)|

+---------------+---------------+

|      976573657|      976573657|

+---------------+---------------+

 

scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|      777096871|

+-----------+---------------+

I do understand that hash() returns an integer, are these reached the max value?. 

 

Thanks & Regards, 

Gokula Krishnan (Gokul)

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

Gokul
Hello Jayesh, 

I have masked the input values with XXXX.


Thanks & Regards, 
Gokula Krishnan (Gokul)


On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh <[hidden email]> wrote:

Cannot reproduce your situation.

Can you share Spark version?

 

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0

      /_/

        

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+----------------+---------------+

|hash(40514XXXXX)|hash(41751XXXX)|

+----------------+---------------+

|     -1898845883|      916273350|

+----------------+---------------+

 

 

scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|    -1593820563|

+-----------+---------------+

 

 

scala>

 

From: Gokula Krishnan D <[hidden email]>
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user <[hidden email]>
Subject: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

 

Hello All, 

 

I am calculating the hash value  of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key's are totally different. 

 

For the instance, 

 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+---------------+---------------+

|hash(40514XXXX)|hash(41751XXXX)|

+---------------+---------------+

|      976573657|      976573657|

+---------------+---------------+

 

scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|      777096871|

+-----------+---------------+

I do understand that hash() returns an integer, are these reached the max value?. 

 

Thanks & Regards, 

Gokula Krishnan (Gokul)

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

Thakrar, Jayesh

Not sure I get what you mean….

 

I ran the query that you had – and don’t get the same hash as you.

 

 

From: Gokula Krishnan D <[hidden email]>
Date: Friday, September 28, 2018 at 10:40 AM
To: "Thakrar, Jayesh" <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

 

Hello Jayesh, 

 

I have masked the input values with XXXX.

 

 

Thanks & Regards, 

Gokula Krishnan (Gokul)

 

 

On Wed, Sep 26, 2018 at 2:20 PM Thakrar, Jayesh <[hidden email]> wrote:

Cannot reproduce your situation.

Can you share Spark version?

 

Welcome to

      ____              __

     / __/__  ___ _____/ /__

    _\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0

      /_/

        

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+----------------+---------------+

|hash(40514XXXXX)|hash(41751XXXX)|

+----------------+---------------+

|     -1898845883|      916273350|

+----------------+---------------+

 

 

scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|    -1593820563|

+-----------+---------------+

 

 

scala>

 

From: Gokula Krishnan D <[hidden email]>
Date: Tuesday, September 25, 2018 at 8:57 PM
To: user <[hidden email]>
Subject: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

 

Hello All, 

 

I am calculating the hash value  of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key's are totally different. 

 

For the instance, 

 

scala> spark.sql("select hash('40514XXXXX'),hash('41751XXXX')").show()

+---------------+---------------+

|hash(40514XXXX)|hash(41751XXXX)|

+---------------+---------------+

|      976573657|      976573657|

+---------------+---------------+

 

scala> spark.sql("select hash('14589'),hash('40004XXXX')").show()

+-----------+---------------+

|hash(14589)|hash(40004XXXX)|

+-----------+---------------+

|  777096871|      777096871|

+-----------+---------------+

I do understand that hash() returns an integer, are these reached the max value?. 

 

Thanks & Regards, 

Gokula Krishnan (Gokul)