Help needed. Not sure how to reduceByKey works in spark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Help needed. Not sure how to reduceByKey works in spark

suman bharadwaj
Hi,

I'm new to spark. And i needed some help in understanding how reduceByKey works.

I have the following data:

col1                                col2   col3
1/11/2014 12:18:40 AM    123     143
1/11/2014 12:18:45 AM    123     143
1/11/2014 12:18:49 AM    123     143

the output i need is 

col2  col3    totaltime(currect value of col1 - prev val of col1)
123   143        9 

I'm doing the following:

map((col2,col3),col1).reduceByKey( <here i don't know how to perform the subtraction of dates > )

How to perform subtraction of dates ?
How does reduceByKey work when my map emits as follows ((col2,col3),(col1,col4))?


Thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: Help needed. Not sure how to reduceByKey works in spark

Andrew Ash
So for each (col2, col3) pair, you want the difference between the earliest col1 value and the latest col1 value?

I'd suggest something like this:

val data = sc.textFile(...).map(l => l.split("\t"))
data.map(r => ((r(1), r(2)), r(0)) // produce an RDD of ((col2, col3), col1)
    .groupByKey() // now have ((col2, col3) [col1s])
    .map(p => (p._1, (max(p._2) - min(p._2)))) // now have ((col2, col3), diffInCol1s)

The downside of this approach is that if you have a (col2, col3) pair with tons of col1 values, you might OOM one of your executors in the groupByKey.

Andrew


On Fri, Jan 10, 2014 at 11:01 AM, suman bharadwaj <[hidden email]> wrote:
Hi,

I'm new to spark. And i needed some help in understanding how reduceByKey works.

I have the following data:

col1                                col2   col3
1/11/2014 12:18:40 AM    123     143
1/11/2014 12:18:45 AM    123     143
1/11/2014 12:18:49 AM    123     143

the output i need is 

col2  col3    totaltime(currect value of col1 - prev val of col1)
123   143        9 

I'm doing the following:

map((col2,col3),col1).reduceByKey( <here i don't know how to perform the subtraction of dates > )

How to perform subtraction of dates ?
How does reduceByKey work when my map emits as follows ((col2,col3),(col1,col4))?


Thanks in advance.