Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Ranju Jain

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Jacek Laskowski
Hi,

> as Executors terminates after their work completes.

--conf spark.kubernetes.executor.deleteOnTermination=false ?

On Sun, Mar 7, 2021 at 5:23 PM Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Ranju Jain

Hi Jacek,

 

I am using this property spark.kubernetes.executor.deleteOnTermination=true only to troubleshoot else I am freeing up resources after executors complete their job.

Now I want to use some Shared storage which can be shared by all executors to write the part files.

Which Kubernetes Storage I should go for?

 

Regards

Ranju

From: Jacek Laskowski <[hidden email]>
Sent: Monday, March 8, 2021 4:14 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi,

 

> as Executors terminates after their work completes.

 

--conf spark.kubernetes.executor.deleteOnTermination=false ?

 

On Sun, Mar 7, 2021 at 5:23 PM Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Jacek Laskowski
Hi,

On GCP I'd go for buckets in Google Storage. Not sure how reliable it is in production deployments though. Only demo experience here.

On Mon, Mar 8, 2021 at 12:33 PM Ranju Jain <[hidden email]> wrote:

Hi Jacek,

 

I am using this property spark.kubernetes.executor.deleteOnTermination=true only to troubleshoot else I am freeing up resources after executors complete their job.

Now I want to use some Shared storage which can be shared by all executors to write the part files.

Which Kubernetes Storage I should go for?

 

Regards

Ranju

From: Jacek Laskowski <[hidden email]>
Sent: Monday, March 8, 2021 4:14 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi,

 

> as Executors terminates after their work completes.

 

--conf spark.kubernetes.executor.deleteOnTermination=false ?

 

On Sun, Mar 7, 2021 at 5:23 PM Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Mich Talebzadeh
In reply to this post by Ranju Jain
If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

That dict reference is to this yml file entry

CPVariables:
   tmp_bucket: "tmp_storage_bucket/tmp"
 
  
just create a temporary bucket and sub-directory tmp underneath

tmp_storage_bucket/tmp


HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Ranju Jain

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Mich Talebzadeh
Hi Ranju,

In your statement:

"What is the best shared storage can be used to collate all executors part files at one place."

Are you looking for performance or durability?

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):

gs://tmp_storage_bucket/


So you can try it and see if it works (create it first). Of course Spark needs to be aware of it. 


HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 8 Mar 2021 at 14:46, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Ranju Jain

Hi Mich,

 

I will check GCP buckets , don’t have much idea about how it works. It will be easy for me to study GCP bucket if you validate my understanding below:

 

Are you looking for performance or durability?

[Ranju]:Durability or I would say feasibility.

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):

[Ranju]: Please check my understanding on the statement written above.

1.  This bucket persist  after executors completes the job [i.e. stores processed records into bucket] and terminates.

                                Or it works as ephemeral storage , which will be there till executor is live?

  1. GCP bucket shareable across driver Pod and Executor Pods.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:32 PM
To: Ranju Jain <[hidden email]>
Cc: Ranju Jain <[hidden email]>; Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi Ranju,

 

In your statement:

 

"What is the best shared storage can be used to collate all executors part files at one place."

 

Are you looking for performance or durability?

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):


gs://tmp_storage_bucket/

 

So you can try it and see if it works (create it first). Of course Spark needs to be aware of it. 

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 14:46, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Mich Talebzadeh
It will be easier if I show what is in temporary storage bucket 


I have created a storage bucket called 

tmp_storage_bucket


Under that there is a directory /tmp where all temporary sub-directories are created

image.png


These temporary buckets were created on 3 March 2021 and I believe they will be there for a week. So you won't lose it at least after your spark job finishes.


Since your account has access to the storage bucket (r/w etc), then your pods should have access to it. This definitely works with GCP dataproc clusters so I will be surprised if it does not work for your pods. In a way it is visible by all so plays the role of NFS here.


Just try it and see if it works for your case?


HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 8 Mar 2021 at 15:19, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

I will check GCP buckets , don’t have much idea about how it works. It will be easy for me to study GCP bucket if you validate my understanding below:

 

Are you looking for performance or durability?

[Ranju]:Durability or I would say feasibility.

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):

[Ranju]: Please check my understanding on the statement written above.

1.  This bucket persist  after executors completes the job [i.e. stores processed records into bucket] and terminates.

                                Or it works as ephemeral storage , which will be there till executor is live?

  1. GCP bucket shareable across driver Pod and Executor Pods.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:32 PM
To: Ranju Jain <[hidden email]>
Cc: Ranju Jain <[hidden email]>; Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi Ranju,

 

In your statement:

 

"What is the best shared storage can be used to collate all executors part files at one place."

 

Are you looking for performance or durability?

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):


gs://tmp_storage_bucket/

 

So you can try it and see if it works (create it first). Of course Spark needs to be aware of it. 

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 14:46, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Mich Talebzadeh
And you can set your retention policy here

image.png



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 8 Mar 2021 at 15:33, Mich Talebzadeh <[hidden email]> wrote:
It will be easier if I show what is in temporary storage bucket 


I have created a storage bucket called 

tmp_storage_bucket


Under that there is a directory /tmp where all temporary sub-directories are created

image.png


These temporary buckets were created on 3 March 2021 and I believe they will be there for a week. So you won't lose it at least after your spark job finishes.


Since your account has access to the storage bucket (r/w etc), then your pods should have access to it. This definitely works with GCP dataproc clusters so I will be surprised if it does not work for your pods. In a way it is visible by all so plays the role of NFS here.


Just try it and see if it works for your case?


HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 8 Mar 2021 at 15:19, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

I will check GCP buckets , don’t have much idea about how it works. It will be easy for me to study GCP bucket if you validate my understanding below:

 

Are you looking for performance or durability?

[Ranju]:Durability or I would say feasibility.

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):

[Ranju]: Please check my understanding on the statement written above.

1.  This bucket persist  after executors completes the job [i.e. stores processed records into bucket] and terminates.

                                Or it works as ephemeral storage , which will be there till executor is live?

  1. GCP bucket shareable across driver Pod and Executor Pods.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:32 PM
To: Ranju Jain <[hidden email]>
Cc: Ranju Jain <[hidden email]>; Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi Ranju,

 

In your statement:

 

"What is the best shared storage can be used to collate all executors part files at one place."

 

Are you looking for performance or durability?

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):


gs://tmp_storage_bucket/

 

So you can try it and see if it works (create it first). Of course Spark needs to be aware of it. 

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 14:46, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Ranju Jain

Hi Mich,

 

(assuming you are using service account to run the spark job)

Yes I have created a spark service account and role of create, delete, get on resource pods.

 

With the screenshot attached below, I am assuming gcp bucket concept associated with Google Cloud Platform. Is this so?

I need to check Does this bucket creation possible for other platforms . I am using Kubernetes over Bare Metal.

 

Regards

Ranju

 

 

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 9:07 PM
To: user @spark <[hidden email]>
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

And you can set your retention policy here

 

image.png

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 15:33, Mich Talebzadeh <[hidden email]> wrote:

It will be easier if I show what is in temporary storage bucket 

 

I have created a storage bucket called 

tmp_storage_bucket

 

Under that there is a directory /tmp where all temporary sub-directories are created

 

image.png

 

These temporary buckets were created on 3 March 2021 and I believe they will be there for a week. So you won't lose it at least after your spark job finishes.

 

Since your account has access to the storage bucket (r/w etc), then your pods should have access to it. This definitely works with GCP dataproc clusters so I will be surprised if it does not work for your pods. In a way it is visible by all so plays the role of NFS here.

 

Just try it and see if it works for your case?

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 15:19, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

I will check GCP buckets , don’t have much idea about how it works. It will be easy for me to study GCP bucket if you validate my understanding below:

 

Are you looking for performance or durability?

[Ranju]:Durability or I would say feasibility.

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):

[Ranju]: Please check my understanding on the statement written above.

1.  This bucket persist  after executors completes the job [i.e. stores processed records into bucket] and terminates.

                                Or it works as ephemeral storage , which will be there till executor is live?

2.       GCP bucket shareable across driver Pod and Executor Pods.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:32 PM
To: Ranju Jain <[hidden email]>
Cc: Ranju Jain <[hidden email]>; Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi Ranju,

 

In your statement:

 

"What is the best shared storage can be used to collate all executors part files at one place."

 

Are you looking for performance or durability?

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):


gs://tmp_storage_bucket/

 

So you can try it and see if it works (create it first). Of course Spark needs to be aware of it. 

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 14:46, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

Mich Talebzadeh
On your point Ranju,

With the screenshot attached below, I am assuming gcp bucket concept associated with Google Cloud Platform. Is this so?


Yes that is correct


I need to check Does this bucket creation possible for other platforms . I am using Kubernetes over Bare Metal.


I believe it should work. Please try it.


HTH


Mich


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 8 Mar 2021 at 16:18, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

(assuming you are using service account to run the spark job)

Yes I have created a spark service account and role of create, delete, get on resource pods.

 

With the screenshot attached below, I am assuming gcp bucket concept associated with Google Cloud Platform. Is this so?

I need to check Does this bucket creation possible for other platforms . I am using Kubernetes over Bare Metal.

 

Regards

Ranju

 

 

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 9:07 PM
To: user @spark <[hidden email]>
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

And you can set your retention policy here

 

image.png

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 15:33, Mich Talebzadeh <[hidden email]> wrote:

It will be easier if I show what is in temporary storage bucket 

 

I have created a storage bucket called 

tmp_storage_bucket

 

Under that there is a directory /tmp where all temporary sub-directories are created

 

image.png

 

These temporary buckets were created on 3 March 2021 and I believe they will be there for a week. So you won't lose it at least after your spark job finishes.

 

Since your account has access to the storage bucket (r/w etc), then your pods should have access to it. This definitely works with GCP dataproc clusters so I will be surprised if it does not work for your pods. In a way it is visible by all so plays the role of NFS here.

 

Just try it and see if it works for your case?

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 15:19, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

I will check GCP buckets , don’t have much idea about how it works. It will be easy for me to study GCP bucket if you validate my understanding below:

 

Are you looking for performance or durability?

[Ranju]:Durability or I would say feasibility.

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):

[Ranju]: Please check my understanding on the statement written above.

1.  This bucket persist  after executors completes the job [i.e. stores processed records into bucket] and terminates.

                                Or it works as ephemeral storage , which will be there till executor is live?

2.       GCP bucket shareable across driver Pod and Executor Pods.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:32 PM
To: Ranju Jain <[hidden email]>
Cc: Ranju Jain <[hidden email]>; Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

Hi Ranju,

 

In your statement:

 

"What is the best shared storage can be used to collate all executors part files at one place."

 

Are you looking for performance or durability?

 

In general, every executor on every node should have access to GCP buckets created under project (assuming you are using service account to run the spark job):


gs://tmp_storage_bucket/

 

So you can try it and see if it works (create it first). Of course Spark needs to be aware of it. 

 

HTH

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Mon, 8 Mar 2021 at 14:46, Ranju Jain <[hidden email]> wrote:

Hi Mich,

 

Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod

running on same kubernetes Cluster will access that shared storage and convert all those part files to single file.

 

So I am looking for Shared Storage Options available to persist the part files.

What is the best shared storage can be used to collate all executors part files at one place.

 

Regards

Ranju

 

From: Mich Talebzadeh <[hidden email]>
Sent: Monday, March 8, 2021 8:06 PM
To: Ranju Jain <[hidden email]>
Cc: Attila Zsolt Piros <[hidden email]>; [hidden email]
Subject: Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

 

If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket

 

spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket'])

 

That dict reference is to this yml file entry

 

CPVariables:

   tmp_bucket: "tmp_storage_bucket/tmp"

 

  

just create a temporary bucket and sub-directory tmp underneath

 

tmp_storage_bucket/tmp

 

 

HTH

 

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

 

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

 

 

On Sun, 7 Mar 2021 at 16:23, Ranju Jain <[hidden email]> wrote:

Hi,

 

I need to save the Executors processed data in the form of part files , but I think persistent Volume is not an option for this as Executors terminates after their work completes.

So I am thinking to use shared volume across executor pods.

 

Should I go with NFS or is there any other Volume option as well to explore?

 

Regards

Ranju