optimize multiple filter operations

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
mrm
Reply | Threaded
Open this post in threaded view
|

optimize multiple filter operations

mrm
Hi,

My question is:

I have multiple filter operations where I split my initial rdd into two different groups. The two groups cover the whole initial set. In code, it's something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has been asked already, but could not find it.
Reply | Threaded
Open this post in threaded view
|

Re: optimize multiple filter operations

Rishi Yadav
you can try (scala version => you convert to python)

val set = initial.groupBy( x => if (x == something) "key1" else "key2")

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm <[hidden email]> wrote:
Hi,

My question is:

I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to
optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has
been asked already, but could not find it.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: optimize multiple filter operations

Imran Rashid-2
Rishi's approach will work, but its worth mentioning that because all of the data goes into only two groups, you will only process the resulting data with two tasks and so you're losing almost all parallelism.  Presumably you're processing a lot of data, since you only want to do one pass, so I doubt that would actually be helpful.

Unfortunately I don't think there is a better approach than doing two passes currently.  Given some more info about the downstream processes, there may be alternatives, but in general I think you are stuck.

Eg., here's a slight variation on Rishi's proposal, that may or may not work:

initial.groupBy{x => (if (x == something) "key1" else "key2"), util.Random.nextInt(500))}

which splits the data by a compound key -- first just a label of whether or not it matches, and then subdivides into another 500 groups.  This will result in nicely balanced tasks within each group, but also results in a shuffle of all the data, which can be pretty expensive.  You might be better off just doing two passes over the raw data.

Imran

On Fri, Nov 28, 2014 at 7:08 PM, Rishi Yadav <[hidden email]> wrote:
you can try (scala version => you convert to python)

val set = initial.groupBy( x => if (x == something) "key1" else "key2")

This would do one pass over original data.

On Fri, Nov 28, 2014 at 8:21 AM, mrm <[hidden email]> wrote:
Hi,

My question is:

I have multiple filter operations where I split my initial rdd into two
different groups. The two groups cover the whole initial set. In code, it's
something like:

set1 = initial.filter(lambda x: x == something)
set2 = initial.filter(lambda x: x != something)

By doing this, I am doing two passes over the data. Is there any way to
optimise this to do it in a single pass?

Note: I was trying to look in the mailing list to see if this question has
been asked already, but could not find it.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/optimize-multiple-filter-operations-tp20010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]