Issue with UDF Int Conversion - Str to Int

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue with UDF Int Conversion - Str to Int

ayan guha
Hi 

I am trying to implement simple hashing/checksum logic. The key logic is - 

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
dfr.show(truncate=False)
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case "always" 

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-------------+----------+-----------+  

The function working fine, as shown in the print statement. However values are not matching and vary widely. 

Any pointer? 

--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Issue with UDF Int Conversion - Str to Int

abeboparebop
You are overflowing the integer type, which goes up to a max value of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to `LongType()` and it works as expected.

On Mon, Mar 23, 2020 at 6:15 AM ayan guha <[hidden email]> wrote:
Hi 

I am trying to implement simple hashing/checksum logic. The key logic is - 

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
dfr.show(truncate=False)
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case "always" 

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-------------+----------+-----------+  

The function working fine, as shown in the print statement. However values are not matching and vary widely. 

Any pointer? 

--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Issue with UDF Int Conversion - Str to Int

ayan guha
Thanks a lot. Will try. 

On Mon, Mar 23, 2020 at 8:16 PM Jacob Lynn <[hidden email]> wrote:
You are overflowing the integer type, which goes up to a max value of 2147483647 (2^31 - 1). Change the return type of `sha2Int2` to `LongType()` and it works as expected.

On Mon, Mar 23, 2020 at 6:15 AM ayan guha <[hidden email]> wrote:
Hi 

I am trying to implement simple hashing/checksum logic. The key logic is - 

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
dfr.show(truncate=False)
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case "always" 

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-------------+----------+-----------+  

The function working fine, as shown in the print statement. However values are not matching and vary widely. 

Any pointer? 

--
Best Regards,
Ayan Guha


--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Issue with UDF Int Conversion - Str to Int

Enrico Minack
In reply to this post by ayan guha
Ayan,

no need for UDFs, the SQL API provides all you need (sha1, substring, conv):

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).show()
+----------+
|  sha2long|
+----------+
| 478797741|
|2520346415|
+----------+

This creates a lean query plan:

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).explain()
== Physical Plan ==
Union
:- *(1) Project [478797741 AS sha2long#74L]
:  +- Scan OneRowRelation[]
+- *(2) Project [2520346415 AS sha2long#76L]
   +- Scan OneRowRelation[]


Enrico


Am 23.03.20 um 06:13 schrieb ayan guha:
Hi 

I am trying to implement simple hashing/checksum logic. The key logic is - 

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
dfr.show(truncate=False)
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case "always" 

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-------------+----------+-----------+  

The function working fine, as shown in the print statement. However values are not matching and vary widely. 

Any pointer? 

--
Best Regards,
Ayan Guha


Reply | Threaded
Open this post in threaded view
|

Re: Issue with UDF Int Conversion - Str to Int

ayan guha
Awesome....Did not know about conv function so thanks for that

On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <[hidden email]> wrote:
Ayan,

no need for UDFs, the SQL API provides all you need (sha1, substring, conv):

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).show()
+----------+
|  sha2long|
+----------+
| 478797741|
|2520346415|
+----------+

This creates a lean query plan:

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).explain()
== Physical Plan ==
Union
:- *(1) Project [478797741 AS sha2long#74L]
:  +- Scan OneRowRelation[]
+- *(2) Project [2520346415 AS sha2long#76L]
   +- Scan OneRowRelation[]


Enrico


Am 23.03.20 um 06:13 schrieb ayan guha:
Hi 

I am trying to implement simple hashing/checksum logic. The key logic is - 

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
dfr.show(truncate=False)
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case "always" 

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-------------+----------+-----------+  

The function working fine, as shown in the print statement. However values are not matching and vary widely. 

Any pointer? 

--
Best Regards,
Ayan Guha


--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Issue with UDF Int Conversion - Str to Int

Vipul Rajan
Hi Ayan,

You don't have to bother with conversion at all. All functions that should work on number columns would still work as long as all values in the column are numbers:
scala> df2.printSchema
root
 |-- id: string (nullable = false)
 |-- id2: string (nullable = false)


scala> df2.show
+---+---+
| id|id2|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
|  3|  3|
|  4|  4|
|  5|  5|
|  6|  6|
|  7|  7|
|  8|  8|
|  9|  9|
+---+---+


scala> df2.select($"id" + $"id2").show
+----------+
|(id + id2)|
+----------+
|       0.0|
|       2.0|
|       4.0|
|       6.0|
|       8.0|
|      10.0|
|      12.0|
|      14.0|
|      16.0|
|      18.0|
+----------+


scala> df2.select(sum("id")).show
+-------+
|sum(id)|
+-------+
|   45.0|
+-------+

On Tue, Mar 24, 2020 at 12:11 AM ayan guha <[hidden email]> wrote:
Awesome....Did not know about conv function so thanks for that

On Tue, 24 Mar 2020 at 1:23 am, Enrico Minack <[hidden email]> wrote:
Ayan,

no need for UDFs, the SQL API provides all you need (sha1, substring, conv):

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).show()
+----------+
|  sha2long|
+----------+
| 478797741|
|2520346415|
+----------+

This creates a lean query plan:

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).explain()
== Physical Plan ==
Union
:- *(1) Project [478797741 AS sha2long#74L]
:  +- Scan OneRowRelation[]
+- *(2) Project [2520346415 AS sha2long#76L]
   +- Scan OneRowRelation[]


Enrico


Am 23.03.20 um 06:13 schrieb ayan guha:
Hi 

I am trying to implement simple hashing/checksum logic. The key logic is - 

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
dfr.show(truncate=False)
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case "always" 

2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881| +-------------+----------+-----------+  

The function working fine, as shown in the print statement. However values are not matching and vary widely. 

Any pointer? 

--
Best Regards,
Ayan Guha


--
Best Regards,
Ayan Guha