How to deal with multidimensional keys?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to deal with multidimensional keys?

Aureliano Buendia
Hi,

How is it possible to reduce by multidimensional keys?

For example, if every line is a tuple like:

(i, j, k, value)

or, alternatively:

((I, j, k), value)

how can spark handle reducing over j, or k?
Reply | Threaded
Open this post in threaded view
|

Re: How to deal with multidimensional keys?

Andrew Ash
If you had RDD[[i, j, k], value] then you could reduce by j by essentially mapping j into the key slot, doing the reduce, and then mapping it back:

rdd.map( ((i,j,k),v) => (j, (i, k, v)).reduce( ... ).map( (j,(i,k,v)) => ((i,j,k),v))

It's not pretty, but I've had to use this pattern before too.


On Thu, Jan 2, 2014 at 6:23 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

How is it possible to reduce by multidimensional keys?

For example, if every line is a tuple like:

(i, j, k, value)

or, alternatively:

((I, j, k), value)

how can spark handle reducing over j, or k?

Reply | Threaded
Open this post in threaded view
|

Re: How to deal with multidimensional keys?

K. Shankari
I have had to use this as well.

Sometimes, I create a POJO to hold the multi-dimensional key to make things easier.

ie.
class MultiKey(i, j, k) {
}

then I can define a reduce function that is over the multikey, e.g.

def reduceByI(mkv1: (MultiKey, Value), mkv2: (MultiKey: Value)) = if (mkv1.i > mkv2.i) v1 else v2

and then I can do

rdd.reduce(reduceByI)

Thanks,
Shankari


On Thu, Jan 2, 2014 at 3:28 PM, Andrew Ash <[hidden email]> wrote:
If you had RDD[[i, j, k], value] then you could reduce by j by essentially mapping j into the key slot, doing the reduce, and then mapping it back:

rdd.map( ((i,j,k),v) => (j, (i, k, v)).reduce( ... ).map( (j,(i,k,v)) => ((i,j,k),v))

It's not pretty, but I've had to use this pattern before too.


On Thu, Jan 2, 2014 at 6:23 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

How is it possible to reduce by multidimensional keys?

For example, if every line is a tuple like:

(i, j, k, value)

or, alternatively:

((I, j, k), value)

how can spark handle reducing over j, or k?