RDD Manipulation in Scala.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

RDD Manipulation in Scala.

trottdw
Hello, I am using Spark with Scala and I am attempting to understand the different filtering and mapping capabilities available.  I haven't found an example of the specific task I would like to do.

I am trying to read in a tab spaced text file and filter specific entries.  I would like this filter to be applied to different "columns" and not lines.  
I was using the following to split the data but attempts to filter by "column" afterwards are not working.
-----------------------------
   val data = sc.textFile("test_data.txt")
   var parsedData = data.map( _.split("\t").map(_.toString))
------------------------------

To try to give a more concrete example of my goal,
Suppose the data file is:
A1    A2     A3     A4
B1    B2     A3     A4
C1    A2     C2     C3


How would I filter the data based on the second column to only return those entries which have A2 in column two?  So, that the resulting RDD would just be:

A1    A2     A3     A4
C1    A2     C2     C3

Is there a convenient way to do this?  Any suggestions or assistance would be appreciated.
Reply | Threaded
Open this post in threaded view
|

Re: RDD Manipulation in Scala.

sowen
data.filter(_.split("\t")(1) == "A2")

?
--
Sean Owen | Director, Data Science | London


On Tue, Mar 4, 2014 at 1:06 PM, trottdw <[hidden email]> wrote:

> Hello, I am using Spark with Scala and I am attempting to understand the
> different filtering and mapping capabilities available.  I haven't found an
> example of the specific task I would like to do.
>
> I am trying to read in a tab spaced text file and filter specific entries.
> I would like this filter to be applied to different "columns" and not lines.
> I was using the following to split the data but attempts to filter by
> "column" afterwards are not working.
> -----------------------------
>    val data = sc.textFile("test_data.txt")
>    var parsedData = data.map( _.split("\t").map(_.toString))
> ------------------------------
>
> To try to give a more concrete example of my goal,
> Suppose the data file is:
> A1    A2     A3     A4
> B1    B2     A3     A4
> C1    A2     C2     C3
>
>
> How would I filter the data based on the second column to only return those
> entries which have A2 in column two?  So, that the resulting RDD would just
> be:
>
> A1    A2     A3     A4
> C1    A2     C2     C3
>
> Is there a convenient way to do this?  Any suggestions or assistance would
> be appreciated.
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Manipulation-in-Scala-tp2285.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: RDD Manipulation in Scala.

trottdw
This post was updated on .
Thanks Sean, I think that is doing what I needed.  It was much simpler than what I had been attempting.

Is it possible to do an OR statement filter?  So, that for example column 2 can be filtered by "A2" appearances and column 3 by "A4" simultaneously?

Or is it better to filter into separate RDDs and merge those?