Spark saveAsTextFile Disk Recommendation

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark saveAsTextFile Disk Recommendation

Ranju Jain

Hi All,

 

I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using  saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage.

 

I have a question like spark.local.dir is the directory which is used as a scratch space where mapoutputs files and RDDs might need to write by spark for shuffle operations etc.

And there it is strongly recommended to use local and fast disk to avoid any failure or performance impact.

 

Do we have any such recommendation for storing multiple part files of large dataset [ or Big RDD ] in fast disk?

This will help me to configure the write type of disk for resulting part files.

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

Re: Spark saveAsTextFile Disk Recommendation

Attila Zsolt Piros
Hi!

I would like to reflect only to the first part of your mail:

I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using  saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage.

What is your use case here?

As you mention collect() I can assume you have to process the data outside of Spark maybe with a 3rd party tool, isn't it? 

If you have 60-70 GB of data and you write it to text file then read it back within the same application then you still cannot call collect() on it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without any repetitions? I ask this because of the fileformat you are using: text file. Even for text file at least you can pass a compression codec as the 2nd argument of saveAsTextFile() (when you use this link you might need to scroll up a little bit.. at least my chrome displays the the saveAsTextFile method without the 2nd arg codec). As IO is slow a compressed data could be read back quicker: as there will be less data in the disk. Check the Snappy codec for example. 

But if there is a structure of your data and you have plan to process this data further within Spark then please consider something way better: a columnar storage format namely ORC or Parquet.

Best Regards,
Attila


On Sun, Mar 21, 2021 at 3:40 AM Ranju Jain <[hidden email]> wrote:

Hi All,

 

I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using  saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage.

 

I have a question like spark.local.dir is the directory which is used as a scratch space where mapoutputs files and RDDs might need to write by spark for shuffle operations etc.

And there it is strongly recommended to use local and fast disk to avoid any failure or performance impact.

 

Do we have any such recommendation for storing multiple part files of large dataset [ or Big RDD ] in fast disk?

This will help me to configure the write type of disk for resulting part files.

 

Regards

Ranju

Reply | Threaded
Open this post in threaded view
|

RE: Spark saveAsTextFile Disk Recommendation

Ranju Jain
In reply to this post by Ranju Jain

Hi Attila,

 

What is your use case here?

Client Driver Application not using collect but  internally calling python script which is reading part files records [comma separated string] of each cluster separately and copying records in other final csv file, so merging all part files data in single csv file. This script runs on every node and later they all combine to single file.

 

On the other hand is your data really just a collection of strings without any repetitions

[Ranju]:

Yes It is comma separated string.

And I just checked the 2nd argument of saveAsTextFile and I believe read and write will be faster on disk after use of compression. I will try this.

 

So I think there is no special requirement on type of disk for execution of saveAsTextFile as they are local I/O operations.

 

Regards

Ranju

 

------------

Hi!

I would like to reflect only to the first part of your mail:


I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using  saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage.


What is your use case here?

As you mention collect() I can assume you have to process the data outside of Spark maybe with a 3rd party tool, isn't it? 

If you have 60-70 GB of data and you write it to text file then read it back within the same application then you still cannot call collect() on it as it is still 60-70GB data, right?

On the other hand is your data really just a collection of strings without any repetitions? I ask this because of the fileformat you are using: text file. Even for text file at least you can pass a compression codec as the 2nd argument of
saveAsTextFile() (when you use this link you might need to scroll up a little bit.. at least my chrome displays the the saveAsTextFile method without the 2nd arg codec). As IO is slow a compressed data could be read back quicker: as there will be less data in the disk. Check the Snappy codec for example. 

But if there is a structure of your data and you have plan to process this data further within Spark then please consider something way better: a 
columnar storage format namely ORC or Parquet.

Best Regards,

Attila

 

 

From: Ranju Jain <[hidden email]>
Sent: Sunday, March 21, 2021 8:10 AM
To: [hidden email]
Subject: Spark saveAsTextFile Disk Recommendation

 

Hi All,

 

I have a large RDD dataset of around 60-70 GB which I cannot send to driver using collect so first writing that to disk using  saveAsTextFile and then this data gets saved in the form of multiple part files on each node of the cluster and after that driver reads the data from that storage.

 

I have a question like spark.local.dir is the directory which is used as a scratch space where mapoutputs files and RDDs might need to write by spark for shuffle operations etc.

And there it is strongly recommended to use local and fast disk to avoid any failure or performance impact.

 

Do we have any such recommendation for storing multiple part files of large dataset [ or Big RDD ] in fast disk?

This will help me to configure the write type of disk for resulting part files.

 

Regards

Ranju