To add to the discussion, Spark Streaming's text file stream, automatically detects new files and generates RDD out of them. For example, if you run 10 seconds batches, then all new files (of the same format) generated in the directory every interval will be read and made into per-interval RDDs. Then you can do whatever you want with those RDDs.
However, not that keeping on unioning RDD can rapidly increase the number of partitions in the unioned RDD, which may degrade performance. Consider using RDD.coalesce periodically to reduce the number of partitions.
On Wed, Feb 19, 2014 at 5:44 AM, Ashish Rangole <[hidden email]> wrote:
You could also look at how the Spark Streaming DStream does what you described.
Take a look at Spark StreamingContext.textFileStream implementation.