Do you know what’s the best method to passing secrets to Spark operations, for e.g doing encryption, salting with a secret before hashing etc.?
I have multiple ideas on top of my head
The secret's source:
- environment variable
- config property
- remote service accessed through an API.
Passing to the executors:
1. Driver resolves the secret
a. it passes it to the encryption function as an argument, which ends up as an argument to a UDF/gets interpolated in the expression’s generated code.
b. it passes it to the encryption function as a literal expression. For security, I can create a SecretLiteral expression that redacts it from the pretty printed and SQL versions. Are there any other concerns here?
2. Executors resolves the secret
a. e.g. reads it from an env/config/service, only the env var name/property name/path/URI is passed as part of the plan. I need to cache the secret on the executor to prevent a performance hit especially in the remote service case.
b. Similarly to (1.b), I can create an expression that resolves the secret during execution.
In (1) the secret will be passed as part of the plan, so the RPC connections have to be encrypted if the attacker can sniff on the network for secrets. 1.b and 2.b is superior for composing with existing expressions, e.g `sha1(concat(colToMask, secretLit(“mySecret")))` for masking a column deterministically using a cryptographic hash function and a secret salt. (2) might involve a more complicated design than (1).
If you can point me to existing work in this space it would be a great help!