Compute the Hash of each row in new column

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Compute the Hash of each row in new column

Chetan Khatri
Hi Spark Users,
How can I compute Hash of each row and store in new column at Dataframe, could someone help me.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Compute the Hash of each row in new column

Riccardo Ferrari
Hi Chetan,

Would the sql function `hash` do the trick for your use-case ?

Best,

On Fri, Feb 28, 2020 at 1:56 PM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,
How can I compute Hash of each row and store in new column at Dataframe, could someone help me.

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Compute the Hash of each row in new column

Enrico Minack
In reply to this post by Chetan Khatri
This computes the md5 hash of a given column id of Dataset ds:

ds.withColumn("id hash", md5($"id")).show(false)

Test with this Dataset ds:

import org.apache.spark.sql.types._
val ds = spark.range(10).select($"id".cast(StringType))

Available are md5, sha, sha1, sha2 and hash:
https://spark.apache.org/docs/2.4.5/api/sql/index.html

Enrico


Am 28.02.20 um 13:56 schrieb Chetan Khatri:
> Hi Spark Users,
> How can I compute Hash of each row and store in new column at
> Dataframe, could someone help me.
>
> Thanks



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Compute the Hash of each row in new column

Chetan Khatri
Thanks Enrico
I want to compute hash of all the columns value in the row.

On Fri, Feb 28, 2020 at 7:28 PM Enrico Minack <[hidden email]> wrote:
This computes the md5 hash of a given column id of Dataset ds:

ds.withColumn("id hash", md5($"id")).show(false)

Test with this Dataset ds:

import org.apache.spark.sql.types._
val ds = spark.range(10).select($"id".cast(StringType))

Available are md5, sha, sha1, sha2 and hash:
https://spark.apache.org/docs/2.4.5/api/sql/index.html

Enrico


Am 28.02.20 um 13:56 schrieb Chetan Khatri:
> Hi Spark Users,
> How can I compute Hash of each row and store in new column at
> Dataframe, could someone help me.
>
> Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Compute the Hash of each row in new column

Enrico Minack
Well, then apply md5 on all columns:

ds.select(ds.columns.map(col) ++ ds.columns.map(column => md5(col(column)).as(s"$column hash")): _*).show(false)

Enrico

Am 02.03.20 um 11:10 schrieb Chetan Khatri:
Thanks Enrico
I want to compute hash of all the columns value in the row.

On Fri, Feb 28, 2020 at 7:28 PM Enrico Minack <[hidden email]> wrote:
This computes the md5 hash of a given column id of Dataset ds:

ds.withColumn("id hash", md5($"id")).show(false)

Test with this Dataset ds:

import org.apache.spark.sql.types._
val ds = spark.range(10).select($"id".cast(StringType))

Available are md5, sha, sha1, sha2 and hash:
https://spark.apache.org/docs/2.4.5/api/sql/index.html

Enrico


Am 28.02.20 um 13:56 schrieb Chetan Khatri:
> Hi Spark Users,
> How can I compute Hash of each row and store in new column at
> Dataframe, could someone help me.
>
> Thanks



Reply | Threaded
Open this post in threaded view
|

Re: Compute the Hash of each row in new column

Chetan Khatri
Thanks Enrico. I meant one hash of each single row in extra column
something like this.. val newDs = typedRows.withColumn("hash", hash(typedRows.columns.map(col): _*))

On Mon, Mar 2, 2020 at 3:51 PM Enrico Minack <[hidden email]> wrote:
Well, then apply md5 on all columns:

ds.select(ds.columns.map(col) ++ ds.columns.map(column => md5(col(column)).as(s"$column hash")): _*).show(false)

Enrico

Am 02.03.20 um 11:10 schrieb Chetan Khatri:
Thanks Enrico
I want to compute hash of all the columns value in the row.

On Fri, Feb 28, 2020 at 7:28 PM Enrico Minack <[hidden email]> wrote:
This computes the md5 hash of a given column id of Dataset ds:

ds.withColumn("id hash", md5($"id")).show(false)

Test with this Dataset ds:

import org.apache.spark.sql.types._
val ds = spark.range(10).select($"id".cast(StringType))

Available are md5, sha, sha1, sha2 and hash:
https://spark.apache.org/docs/2.4.5/api/sql/index.html

Enrico


Am 28.02.20 um 13:56 schrieb Chetan Khatri:
> Hi Spark Users,
> How can I compute Hash of each row and store in new column at
> Dataframe, could someone help me.
>
> Thanks