Near Real time analytics with Spark and tokenization

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Near Real time analytics with Spark and tokenization

Mich Talebzadeh
Hi,

When doing micro-batch streaming of trade data we need to tokenization certain columns before data lands in Hbase with Lambda architecture.

There are two ways of tokenizing data, vault based and vault less using something like Protegrity tokenization.

The vault-based tokenization requires clear text and token values to be stored in a vault say Hbase and crucially the vault cannot be on the same Hadoop cluster that we are processing real time. It could be in another Hadoop cluster for tokenization.

This causes latency for real time analytics when token values have to be calculated and then stored in remote Hbase vault.

What is the general approach to this type of issue. It seems to be based to use vault-less tokenization?

Thanks

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Near Real time analytics with Spark and tokenization

Jörn Franke
Can’t you cache the token vault in a caching solution , such as Ignite? The lookup of single tokens would be really fast.
About what volumes one talks about? 

I assume you refer to PCI DSS, so security might be an important aspect which might be not that easy to achieve with vault-less tokenization. Then, with vault-less tokenization you need to recalculate all tokens  in case the secret is compromised.
There might be other compliance requirements , which may need to be weighted by the users.

On 15. Oct 2017, at 09:15, Mich Talebzadeh <[hidden email]> wrote:

Hi,

When doing micro-batch streaming of trade data we need to tokenization certain columns before data lands in Hbase with Lambda architecture.

There are two ways of tokenizing data, vault based and vault less using something like Protegrity tokenization.

The vault-based tokenization requires clear text and token values to be stored in a vault say Hbase and crucially the vault cannot be on the same Hadoop cluster that we are processing real time. It could be in another Hadoop cluster for tokenization.

This causes latency for real time analytics when token values have to be calculated and then stored in remote Hbase vault.

What is the general approach to this type of issue. It seems to be based to use vault-less tokenization?

Thanks

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.