Implementation problem with Streaming

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Implementation problem with Streaming

sanjay_awat
Hi,

I had initially thought of a streaming approach to solve my problem, and I am stuck at few places and want opinion if this problem is suitable for streaming, or is it better to stick to basic spark.

Problem: I get chunks of log files in a folder and need to do some analysis on them on an hourly interval, eg. 11.00 to 11.59. The file chunks may or may not come in real time and there can be breaks between subsequent chunks.

pseudocode:
While{
  CheckForFile(localFolder)
  CopyToHDFS()
  RDDfile=read(fileFromHDFS)
  RDDHour=RDDHour.union.RDDfile.filter(keyHour=currentHr)
  if(RDDHour.keys().contains(currentHr+1) //next Hr has come, so current Hr should be complete
  {
      RDDHour.process()
      deleteFileFromHDFS()
      RDDHour.empty()
      currentHr++
  }
}

If I use streaming, I face the following problems:
1) Inability to keep a Java Variable (currentHr) in the driver which can be used across batches.
2) The input files may come with a break, for eg. 10.00 - 10.30 comes, then a break for 4 hours. If I use streaming, then I can't process the 10.00 - 10.30 batch as its incomplete, and the 1 hour DStream window for the 10.30 - 11.00 file will have previous RDD as empty as nothing was received in the preceding 4 hours. Basically Streaming takes file time as input and not the time inside the file content. 
3) no control on deleting file from HDFS as the program runs in a SparkStreamingContext loop

Any ideas on overcoming the above limitations or whether streaming is suitable for such kind of problem or not, will be helpful.

Regards,
Sanjay
Reply | Threaded
Open this post in threaded view
|

Re: Implementation problem with Streaming

Mayur Rustagi
2 good benefits of Streaming
1. maintains windows as you move across time, removing & adding monads as you move through the window
2. Connect with streaming systems like kafka to import data as it comes & process it

You dont seem to need any of these features, you would be better off using Spark with crontab maybe :), serializing your object in HDFS if its huge, or maintaining it in memory. 

Mayur Rustagi
Ph: +1 (760) 203 3257


On Tue, Mar 25, 2014 at 2:04 PM, Sanjay Awatramani <[hidden email]> wrote:
Hi,

I had initially thought of a streaming approach to solve my problem, and I am stuck at few places and want opinion if this problem is suitable for streaming, or is it better to stick to basic spark.

Problem: I get chunks of log files in a folder and need to do some analysis on them on an hourly interval, eg. 11.00 to 11.59. The file chunks may or may not come in real time and there can be breaks between subsequent chunks.

pseudocode:
While{
  CheckForFile(localFolder)
  CopyToHDFS()
  RDDfile=read(fileFromHDFS)
  RDDHour=RDDHour.union.RDDfile.filter(keyHour=currentHr)
  if(RDDHour.keys().contains(currentHr+1) //next Hr has come, so current Hr should be complete
  {
      RDDHour.process()
      deleteFileFromHDFS()
      RDDHour.empty()
      currentHr++
  }
}

If I use streaming, I face the following problems:
1) Inability to keep a Java Variable (currentHr) in the driver which can be used across batches.
2) The input files may come with a break, for eg. 10.00 - 10.30 comes, then a break for 4 hours. If I use streaming, then I can't process the 10.00 - 10.30 batch as its incomplete, and the 1 hour DStream window for the 10.30 - 11.00 file will have previous RDD as empty as nothing was received in the preceding 4 hours. Basically Streaming takes file time as input and not the time inside the file content. 
3) no control on deleting file from HDFS as the program runs in a SparkStreamingContext loop

Any ideas on overcoming the above limitations or whether streaming is suitable for such kind of problem or not, will be helpful.

Regards,
Sanjay