|
|
Dear Spark team members,
Can you please advise if Column-level encryption is available in Spark SQL? I am aware that HIVE supports column level encryption.
Appreciate your response.
Thanks, John
|
|
Hi,
Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either) Pozdrawiam, Jacek Laskowski ---- On Sat, Dec 19, 2020 at 2:38 AM john washington < [hidden email]> wrote: Dear Spark team members,
Can you please advise if Column-level encryption is available in Spark SQL? I am aware that HIVE supports column level encryption.
Appreciate your response.
Thanks, John
|
|
Most enterprise databases provide Data Encryption of some form. For example Introduction to Transparent Data Encryption (oracle.com)
As far as I know Hive supports text and sequence file column level encryption that in turn rely on hdfs data encryption. see here
In general this seems to be left to the underlying storage. Most customers rely on tools like Protegrity tokenization solutions before data is stored in data warehouse like Hive or Cloud databases etc
There should be no reason for Spark not to support it at least in simplest form. For example within PySpark one can create the table explicitly on Hive trying to encrypt columns ID and CLUSTERED below
sqltext = "" if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1): rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0] print ("number of rows is ",rows) else: print("\nTable test.randomDataPy does not exist, creating table ") sqltext = """ CREATE TABLE test.randomDataPy( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(50) , PADDING VARCHAR(4000) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('column.encode.columns'='ID, CLUSTERED', 'column.encode.classname'='org.apache.hadoop.hive.serde2.AESRewriter') STORED AS TEXTFILE """
spark.sql(sqltext)
Disclaimer: I have not tried it myself but worth trying to see if it works.
HTH
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
Hi,
Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either) Pozdrawiam, Jacek Laskowski ----
On Sat, Dec 19, 2020 at 2:38 AM john washington < [hidden email]> wrote: Dear Spark team members,
Can you please advise if Column-level encryption is available in Spark SQL? I am aware that HIVE supports column level encryption.
Appreciate your response.
Thanks, John
|
|
Hi John,
as always I would start by asking what is that y0u are trying to achieve here? What is the exact security requirement?
We can then start looking at the options available.
Regards, Gourav Sengupta On Thu, Jan 21, 2021 at 1:59 PM Mich Talebzadeh < [hidden email]> wrote: Most enterprise databases provide Data Encryption of some form. For example Introduction to Transparent Data Encryption (oracle.com)
As far as I know Hive supports text and sequence file column level encryption that in turn rely on hdfs data encryption. see here
In general this seems to be left to the underlying storage. Most customers rely on tools like Protegrity tokenization solutions before data is stored in data warehouse like Hive or Cloud databases etc
There should be no reason for Spark not to support it at least in simplest form. For example within PySpark one can create the table explicitly on Hive trying to encrypt columns ID and CLUSTERED below
sqltext = "" if (spark.sql("SHOW TABLES IN test like 'randomDataPy'").count() == 1): rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0] print ("number of rows is ",rows) else: print("\nTable test.randomDataPy does not exist, creating table ") sqltext = """ CREATE TABLE test.randomDataPy( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(50) , PADDING VARCHAR(4000) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('column.encode.columns'='ID, CLUSTERED', 'column.encode.classname'='org.apache.hadoop.hive.serde2.AESRewriter') STORED AS TEXTFILE """
spark.sql(sqltext)
Disclaimer: I have not tried it myself but worth trying to see if it works.
HTH
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
Hi,
Never heard of it (and have once been tasked to explore a similar use case). I'm curious how you'd like it to work? (no idea how Hive does this either) Pozdrawiam, Jacek Laskowski ----
On Sat, Dec 19, 2020 at 2:38 AM john washington < [hidden email]> wrote: Dear Spark team members,
Can you please advise if Column-level encryption is available in Spark SQL? I am aware that HIVE supports column level encryption.
Appreciate your response.
Thanks, John
|
|