Column-level encryption in Spark SQL

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Column-level encryption in Spark SQL

john washington
Dear Spark team members,

Can you please advise if Column-level encryption is available in Spark SQL?
I am aware that HIVE supports column level encryption.

Appreciate your response.

Thanks,
John
Reply | Threaded
Open this post in threaded view
|

Re: Column-level encryption in Spark SQL

Jacek Laskowski
Hi,

Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either)

On Sat, Dec 19, 2020 at 2:38 AM john washington <[hidden email]> wrote:
Dear Spark team members,

Can you please advise if Column-level encryption is available in Spark SQL?
I am aware that HIVE supports column level encryption.

Appreciate your response.

Thanks,
John
Reply | Threaded
Open this post in threaded view
|

Re: Column-level encryption in Spark SQL

Mich Talebzadeh
Most enterprise databases provide Data Encryption of some form. For example Introduction to Transparent Data Encryption (oracle.com)

As far as I know Hive supports text and sequence file column level encryption that in turn rely on hdfs data encryption. see here 

In general this seems to be left  to the underlying storage. Most customers rely on tools like Protegrity tokenization solutions  before data is stored in data warehouse like Hive or Cloud databases etc

There should be no reason for Spark not to support it at least in simplest form. For example within PySpark one can create the table explicitly on Hive trying to encrypt columns ID and CLUSTERED below

sqltext  = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1):
  rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0]
  print ("number of rows is ",rows)
else:
  print("\nTable test.randomDataPy does not exist, creating table ")
  sqltext = """
     CREATE TABLE test.randomDataPy(
       ID INT
     , CLUSTERED INT
     , SCATTERED INT
     , RANDOMISED INT
     , RANDOM_STRING VARCHAR(50)
     , SMALL_VC VARCHAR(50)
     , PADDING  VARCHAR(4000)
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
       WITH SERDEPROPERTIES ('column.encode.columns'='ID, CLUSTERED', 'column.encode.classname'='org.apache.hadoop.hive.serde2.AESRewriter')
       STORED AS TEXTFILE    
    """
  spark.sql(sqltext)

Disclaimer: I have not tried it myself  but worth trying to see if it works.

HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 21 Jan 2021 at 11:44, Jacek Laskowski <[hidden email]> wrote:
Hi,

Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either)

On Sat, Dec 19, 2020 at 2:38 AM john washington <[hidden email]> wrote:
Dear Spark team members,

Can you please advise if Column-level encryption is available in Spark SQL?
I am aware that HIVE supports column level encryption.

Appreciate your response.

Thanks,
John
Reply | Threaded
Open this post in threaded view
|

Re: Column-level encryption in Spark SQL

Gourav Sengupta
Hi John,

as always I would start by asking what is that y0u are trying to achieve here? What is the exact security requirement? 

We can then start looking at the options available.

Regards,
Gourav Sengupta

On Thu, Jan 21, 2021 at 1:59 PM Mich Talebzadeh <[hidden email]> wrote:
Most enterprise databases provide Data Encryption of some form. For example Introduction to Transparent Data Encryption (oracle.com)

As far as I know Hive supports text and sequence file column level encryption that in turn rely on hdfs data encryption. see here 

In general this seems to be left  to the underlying storage. Most customers rely on tools like Protegrity tokenization solutions  before data is stored in data warehouse like Hive or Cloud databases etc

There should be no reason for Spark not to support it at least in simplest form. For example within PySpark one can create the table explicitly on Hive trying to encrypt columns ID and CLUSTERED below

sqltext  = ""
if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1):
  rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0]
  print ("number of rows is ",rows)
else:
  print("\nTable test.randomDataPy does not exist, creating table ")
  sqltext = """
     CREATE TABLE test.randomDataPy(
       ID INT
     , CLUSTERED INT
     , SCATTERED INT
     , RANDOMISED INT
     , RANDOM_STRING VARCHAR(50)
     , SMALL_VC VARCHAR(50)
     , PADDING  VARCHAR(4000)
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
       WITH SERDEPROPERTIES ('column.encode.columns'='ID, CLUSTERED', 'column.encode.classname'='org.apache.hadoop.hive.serde2.AESRewriter')
       STORED AS TEXTFILE    
    """
  spark.sql(sqltext)

Disclaimer: I have not tried it myself  but worth trying to see if it works.

HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 21 Jan 2021 at 11:44, Jacek Laskowski <[hidden email]> wrote:
Hi,

Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either)

On Sat, Dec 19, 2020 at 2:38 AM john washington <[hidden email]> wrote:
Dear Spark team members,

Can you please advise if Column-level encryption is available in Spark SQL?
I am aware that HIVE supports column level encryption.

Appreciate your response.

Thanks,
John