multi-line elements

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

multi-line elements

Philip Ogren
I have a file that consists of multi-line records.  Is it possible to
read in multi-line records with a method such as
SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so
that all the data for one element is in a single line?

Thanks,
Philip

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-line elements

suman bharadwaj
Even I'm new to spark. But I was able to write a custom sentence input format and sentence record reader which reads multiple lines of text with record boundary being "[.?!]\s*" using Hadoop APIs. And plugged in the SentenceInputFormat into the spark api as shown below:

val inputRead = sc.hadoopFile("<path to the file in hdfs>",classOf[SentenceTextInputFormat],classOf[LongWritable],classOf[Text]).map(value =>value._2.toString)

In your case, you can use the NLineInputFormat i guess which is provided by hadoop. And pass it as a parameter.

May be there are better ways to do it. 

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren <[hidden email]> wrote:
I have a file that consists of multi-line records.  Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so that all the data for one element is in a single line?

Thanks,
Philip


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-line elements

suman bharadwaj
Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above.

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj <[hidden email]> wrote:
Even I'm new to spark. But I was able to write a custom sentence input format and sentence record reader which reads multiple lines of text with record boundary being "[.?!]\s*" using Hadoop APIs. And plugged in the SentenceInputFormat into the spark api as shown below:

val inputRead = sc.hadoopFile("<path to the file in hdfs>",classOf[SentenceTextInputFormat],classOf[LongWritable],classOf[Text]).map(value =>value._2.toString)

In your case, you can use the NLineInputFormat i guess which is provided by hadoop. And pass it as a parameter.

May be there are better ways to do it. 

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren <[hidden email]> wrote:
I have a file that consists of multi-line records.  Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so that all the data for one element is in a single line?

Thanks,
Philip



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-line elements

Philip Ogren
Thank you for pointing me in the right direction! 

On 12/24/2013 2:39 PM, suman bharadwaj wrote:
Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above.

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 2:51 AM, suman bharadwaj <[hidden email]> wrote:
Even I'm new to spark. But I was able to write a custom sentence input format and sentence record reader which reads multiple lines of text with record boundary being "[.?!]\s*" using Hadoop APIs. And plugged in the SentenceInputFormat into the spark api as shown below:

val inputRead = sc.hadoopFile("<path to the file in hdfs>",classOf[SentenceTextInputFormat],classOf[LongWritable],classOf[Text]).map(value =>value._2.toString)

In your case, you can use the NLineInputFormat i guess which is provided by hadoop. And pass it as a parameter.

May be there are better ways to do it. 

Regards,
Suman Bharadwaj S


On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren <[hidden email]> wrote:
I have a file that consists of multi-line records.  Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so that all the data for one element is in a single line?

Thanks,
Philip




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-line elements

Christopher Nguyen
In reply to this post by Philip Ogren

Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third approach is indeed to pre-process with an appropriate mapper/reducer.

Sent while mobile. Pls excuse typos etc.

I have a file that consists of multi-line records.  Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so that all the data for one element is in a single line?

Thanks,
Philip

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-line elements

Azuryy Yu
Hi Philip,
you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit for you. you just specify stream.recordreader.begin and stream.recordreader.end, then this Reader can read the block records between BEGIN and END each time.


On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen <[hidden email]> wrote:

Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third approach is indeed to pre-process with an appropriate mapper/reducer.

Sent while mobile. Pls excuse typos etc.

I have a file that consists of multi-line records.  Is it possible to read in multi-line records with a method such as SparkContext.newAPIHadoopFile?  Or do I need to pre-process the data so that all the data for one element is in a single line?

Thanks,
Philip


Loading...