Mutating RDD

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Mutating RDD

David Thomas
Let's say I have an RDD of text files from HDFS. During the runtime, is it possible to check for new files in a particular directory and if present, add them to the existing RDD?
Reply | Threaded
Open this post in threaded view
|

Re: Mutating RDD

Mayur Rustagi
RDD is immutable so modification of RDD is not possible, you can generate a new RDD unioning the two RDD created from new files and old in-memory RDD.
Regards
Mayur



On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <[hidden email]> wrote:
Let's say I have an RDD of text files from HDFS. During the runtime, is it possible to check for new files in a particular directory and if present, add them to the existing RDD?

Reply | Threaded
Open this post in threaded view
|

Re: Mutating RDD

David Thomas
Perfect.


On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi <[hidden email]> wrote:
RDD is immutable so modification of RDD is not possible, you can generate a new RDD unioning the two RDD created from new files and old in-memory RDD.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <[hidden email]> wrote:
Let's say I have an RDD of text files from HDFS. During the runtime, is it possible to check for new files in a particular directory and if present, add them to the existing RDD?


Reply | Threaded
Open this post in threaded view
|

Re: Mutating RDD

Ashish Rangole

You could also look at how the Spark Streaming DStream does what you described.

Take a look at Spark StreamingContext.textFileStream implementation.

On Feb 18, 2014 8:02 PM, "David Thomas" <[hidden email]> wrote:
Perfect.


On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi <[hidden email]> wrote:
RDD is immutable so modification of RDD is not possible, you can generate a new RDD unioning the two RDD created from new files and old in-memory RDD.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <[hidden email]> wrote:
Let's say I have an RDD of text files from HDFS. During the runtime, is it possible to check for new files in a particular directory and if present, add them to the existing RDD?


Reply | Threaded
Open this post in threaded view
|

Re: Mutating RDD

Tathagata Das
To add to the discussion, Spark Streaming's text file stream, automatically detects new files and generates RDD out of them. For example, if you run 10 seconds batches, then all new files (of the same format) generated in the directory every interval will be read and made into per-interval RDDs. Then you can do whatever you want with those RDDs. 

var unionRDD = ...

streamingContext.textFileStream(<directory>).foreachRDD(rdd => {
     // do what you want with the RDD
     // if you want to keep unioning
     unionRDD = unionRDD.union(rdd)  
})

However, not that keeping on unioning RDD can rapidly increase the number of partitions in the unioned RDD, which may degrade performance. Consider using RDD.coalesce periodically to reduce the number of partitions.

TD


On Wed, Feb 19, 2014 at 5:44 AM, Ashish Rangole <[hidden email]> wrote:

You could also look at how the Spark Streaming DStream does what you described.

Take a look at Spark StreamingContext.textFileStream implementation.

On Feb 18, 2014 8:02 PM, "David Thomas" <[hidden email]> wrote:
Perfect.


On Tue, Feb 18, 2014 at 7:58 PM, Mayur Rustagi <[hidden email]> wrote:
RDD is immutable so modification of RDD is not possible, you can generate a new RDD unioning the two RDD created from new files and old in-memory RDD.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Tue, Feb 18, 2014 at 6:33 PM, David Thomas <[hidden email]> wrote:
Let's say I have an RDD of text files from HDFS. During the runtime, is it possible to check for new files in a particular directory and if present, add them to the existing RDD?