Spark Security

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Security

wilbertseoane
Hello,

I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files.

When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this creates a Spark connection to help process the data more quickly, but I want to verify that the data will be secure after loading it with the Spark connection and sparklyr.


Thanks,

Wilbert J. Seoane

Sent from iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Security

srowen
What do you mean by secure here?

On Fri, May 29, 2020 at 10:21 AM <[hidden email]> wrote:
Hello,

I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files.

When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this creates a Spark connection to help process the data more quickly, but I want to verify that the data will be secure after loading it with the Spark connection and sparklyr.


Thanks,

Wilbert J. Seoane

Sent from iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Security

srowen
If you load a file on your computer, that is unrelated to Spark.
Whatever you load via Spark APIs will at some point live in memory on the Spark cluster, or the storage you back it with if you store it.
Whether the cluster and storage are secure (like, ACLs / auth enabled) is up to whoever runs the cluster.

On Fri, May 29, 2020 at 1:54 PM <[hidden email]> wrote:
Hi Sean

I mean that I won’t be opening up my client for any data breaches or anything like that by connecting to Spark and loading in their data using sparklyr in R studio. 

Connecting with spark and loading in a tsv file on my local computer is secure correct?


Thanks

Wilbert J. Seoane

Sent from iPhone

On May 29, 2020, at 11:25 AM, Sean Owen <[hidden email]> wrote:


What do you mean by secure here?

On Fri, May 29, 2020 at 10:21 AM <[hidden email]> wrote:
Hello,

I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files.

When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this creates a Spark connection to help process the data more quickly, but I want to verify that the data will be secure after loading it with the Spark connection and sparklyr.


Thanks,

Wilbert J. Seoane

Sent from iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Security

Anwar AliKhan
In reply to this post by wilbertseoane
What is the size of your .tsv file   sir  ?
What is the size of your local hard drive   sir  ?


Regards


Wali Ahaad


On Fri, 29 May 2020, 16:21 , <[hidden email]> wrote:
Hello,

I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files.

When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this creates a Spark connection to help process the data more quickly, but I want to verify that the data will be secure after loading it with the Spark connection and sparklyr.


Thanks,

Wilbert J. Seoane

Sent from iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Security

wilbertseoane
In reply to this post by srowen
Hello,

This is what happens when I load the data using sparklyr::spark_read_csv() in R. It creates a "derby.log" file that says something along the lines of:
Sun May 31 14:17:02 EDT 2020:
Booting Derby version The Apache Software Foundation - Apache Derby - 10.12.1.1 - (1704137): instance xxxxxxx 
on database directory memory:C:\Users\wseoane\2020-05-31 sparklyr on three rows\databaseName=metastore_db with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$xxxxxxxxx 
Loaded from file:/C:/Users/wseoane/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7/jars/derby-10.12.1.1.jar
java.vendor=Oracle Corporation
java.runtime.version=1.8.0_241-b07
user.dir=C:\Users\wseoane\2020-05-31 sparklyr on three rows
os.name=Windows 10
os.arch=xxxxx
os.version=10.0
derby.system.home=null
Database Class Loader started - derby.database.classpath=''

I can then click to view details about the Spark connection in my browser while I have the Spark connection in sparklyr. Here are the results from a test .tsv file:
Jobs:
Jobs 2020-05-31 142103.png
SQL:
SQL 2020-05-31 142217.png
Stages:
Stages 2020-05-31 142217.png
Storage:
Storage 2020-05-31 142217.png

So, since sparklyr::spark_read_csv() reads in the data locally and not in the cloud, security is determined by my company's IT department correct (i.e. the firewalls that the IT department has in place in the network and the antivirus software they have installed on my computer and etc.)? If it were on the cloud, the cloud would need it's own layer of security ("up to whoever runs the cluster") but that is not relevant here since I am usinsparklyr::spark_read_csv(), correct?


Thanks,

Wilbert Seoane



On Fri, May 29, 2020 at 3:17 PM Sean Owen <[hidden email]> wrote:
If you load a file on your computer, that is unrelated to Spark.
Whatever you load via Spark APIs will at some point live in memory on the Spark cluster, or the storage you back it with if you store it.
Whether the cluster and storage are secure (like, ACLs / auth enabled) is up to whoever runs the cluster.

On Fri, May 29, 2020 at 1:54 PM <[hidden email]> wrote:
Hi Sean

I mean that I won’t be opening up my client for any data breaches or anything like that by connecting to Spark and loading in their data using sparklyr in R studio. 

Connecting with spark and loading in a tsv file on my local computer is secure correct?


Thanks

Wilbert J. Seoane

Sent from iPhone

On May 29, 2020, at 11:25 AM, Sean Owen <[hidden email]> wrote:


What do you mean by secure here?

On Fri, May 29, 2020 at 10:21 AM <[hidden email]> wrote:
Hello,

I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files.

When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this creates a Spark connection to help process the data more quickly, but I want to verify that the data will be secure after loading it with the Spark connection and sparklyr.


Thanks,

Wilbert J. Seoane

Sent from iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Security

wilbertseoane
In reply to this post by Anwar AliKhan
Hello,

My hard drive has about 80 GB of space left on it, and the RAM is about 12GB.

I am not sure the size of the .tsv file, but it will most likely be around 30 GB.


Thanks,

Wilbert Seoane



On Fri, May 29, 2020 at 5:03 PM Anwar AliKhan <[hidden email]> wrote:
What is the size of your .tsv file   sir  ?
What is the size of your local hard drive   sir  ?


Regards


Wali Ahaad


On Fri, 29 May 2020, 16:21 , <[hidden email]> wrote:
Hello,

I plan to load in a local .tsv file from my hard drive using sparklyr (an R package). I have figured out how to do this already on small files.

When I decide to receive my client’s large .tsv file, can I be confident that loading in data this way will be secure? I know that this creates a Spark connection to help process the data more quickly, but I want to verify that the data will be secure after loading it with the Spark connection and sparklyr.


Thanks,

Wilbert J. Seoane

Sent from iPhone
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Security

srowen
In reply to this post by wilbertseoane
spark_read_csv() does not read locally; again it is using Spark to read it.

If you are literally running a local Spark cluster locally on your machine, then all that is happening on your machine via Spark, because the driver/executors are one local process.
Otherwise, it is running wherever the Spark cluster is running - some machines within your org, or in the cloud, or wherever it was run. You would be running a driver process somewhere else.

Yes, what is relevant is network firewalls on the machines where Spark runs. (And potentially enabling auth in Spark itself).
Of course it also matters where the data is. Spark has nothing to say about how the data is being stored.



On Mon, Jun 1, 2020 at 7:20 AM Wilbert S. <[hidden email]> wrote:
Hello,

This is what happens when I load the data using sparklyr::spark_read_csv() in R. It creates a "derby.log" file that says something along the lines of:
Sun May 31 14:17:02 EDT 2020:
Booting Derby version The Apache Software Foundation - Apache Derby - 10.12.1.1 - (1704137): instance xxxxxxx 
on database directory memory:C:\Users\wseoane\2020-05-31 sparklyr on three rows\databaseName=metastore_db with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$xxxxxxxxx 
Loaded from file:/C:/Users/wseoane/AppData/Local/spark/spark-2.4.3-bin-hadoop2.7/jars/derby-10.12.1.1.jar
java.vendor=Oracle Corporation
java.runtime.version=1.8.0_241-b07
user.dir=C:\Users\wseoane\2020-05-31 sparklyr on three rows
os.name=Windows 10
os.arch=xxxxx
os.version=10.0
derby.system.home=null
Database Class Loader started - derby.database.classpath=''

I can then click to view details about the Spark connection in my browser while I have the Spark connection in sparklyr. Here are the results from a test .tsv file:
Jobs:
Jobs 2020-05-31 142103.png
SQL:
SQL 2020-05-31 142217.png
Stages:
Stages 2020-05-31 142217.png
Storage:
Storage 2020-05-31 142217.png

So, since sparklyr::spark_read_csv() reads in the data locally and not in the cloud, security is determined by my company's IT department correct (i.e. the firewalls that the IT department has in place in the network and the antivirus software they have installed on my computer and etc.)? If it were on the cloud, the cloud would need it's own layer of security ("up to whoever runs the cluster") but that is not relevant here since I am usinsparklyr::spark_read_csv(), correct?


Thanks,

Wilbert Seoane