Streaming files as a whole

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Streaming files as a whole

Mayur Rustagi
I am trying to load xml in streaming and convert to csv and store it. When I use textfile it separates the file on "\n" and hence breaks the parser. Is it possible to receive the data one file at a time from the hdfs folder ?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971
Reply | Threaded
Open this post in threaded view
|

Re: Streaming files as a whole

Woody Christy
Take a look at the Mahout xmlinputformat class. That should get  you started.


On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi <[hidden email]> wrote:
I am trying to load xml in streaming and convert to csv and store it. When I use textfile it separates the file on "\n" and hence breaks the parser. Is it possible to receive the data one file at a time from the hdfs folder ?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971



--

Woody Christy
Solutions Architect | Partner Engineering | Cloudera Inc
@woodychristy



Reply | Threaded
Open this post in threaded view
|

Re: Streaming files as a whole

Mayur Rustagi
Hi,
I am using Spark Streaming for this, in Streaming I am trying to open the file as text file and Dstream.
Regards
Mayur



On Thu, Jan 30, 2014 at 7:17 PM, Woody Christy <[hidden email]> wrote:
Take a look at the Mahout xmlinputformat class. That should get  you started.


On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi <[hidden email]> wrote:
I am trying to load xml in streaming and convert to csv and store it. When I use textfile it separates the file on "\n" and hence breaks the parser. Is it possible to receive the data one file at a time from the hdfs folder ?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971



--

Woody Christy
Solutions Architect | Partner Engineering | Cloudera Inc
@woodychristy




Reply | Threaded
Open this post in threaded view
|

Re: Streaming files as a whole

Tathagata Das
This is a very late reply for this thread. If you are trying to read xml files from a directory and put it into a stream, there are two ways that may work. 

1. Something like this  -  streamingContext.fileStream[LongWritable, Text, XMLInputFormat](<directory>)
The XMLInputFormat class is what Woody suggested. If this InputFormat works correctly, then any new XML files created in the <directory> should get read as RDD in a DStream. However, there is no guarantee that it will read one file at a time. If two files got generated within a batch interval, then both will get read together in the same batch. 

2. If you want to manually control how the RDDs are fed, then take a look at streamingContext.queueStream. This allows you to create RDDs manually and push them in a queue. Spark Streaming will pull those RDDs and treat them as a stream. 

Hope this helps. Apologies for the late response.


On Thu, Jan 30, 2014 at 5:55 AM, Mayur Rustagi <[hidden email]> wrote:
Hi,
I am using Spark Streaming for this, in Streaming I am trying to open the file as text file and Dstream.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Thu, Jan 30, 2014 at 7:17 PM, Woody Christy <[hidden email]> wrote:
Take a look at the Mahout xmlinputformat class. That should get  you started.


On Thu, Jan 30, 2014 at 5:08 AM, Mayur Rustagi <[hidden email]> wrote:
I am trying to load xml in streaming and convert to csv and store it. When I use textfile it separates the file on "\n" and hence breaks the parser. Is it possible to receive the data one file at a time from the hdfs folder ?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971



--

Woody Christy
Solutions Architect | Partner Engineering | Cloudera Inc
@woodychristy