more complex analytics

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

more complex analytics

amoc

Hi

Are there any examples on how to do any other operation apart from counting in spark via map then reduceByKey.

It’s pretty straight forward to do counts but how do I add in my own function (say conditional sum based on tuple fields or moving average)?

 

Here’s my count example so we have some code to work with

 

val inputList= List( ("name","1","11134"),("name","2","11134"),("name","1","11130"), ("name2","1","11133") )

sc.parallelize( inputList )

.map(x => (x,1) )

.reduceByKey(sumTuples)

.foreach(x=>println(x))

 

How would I add up field 2 from tuples which have fields “name” and the last field the same.

In my example the result I want is:

"name","1+2","11134"

“name","1","11130”

“name2","1","11133”

 

Thanks

-A

Reply | Threaded
Open this post in threaded view
|

Re: more complex analytics

Andrew Ash
I would key by those things that should be the same and then reduce by sum.

sc.parallelize(inputList)
.map(x => (x._1, x._2.toLong, x._3.toLong)) // parse to numeric values from String
.map(x => ((x._1, x._3), x._2)) // key by the name and final number field
.reduceByKey(_+_)

Andrew


On Tue, Feb 11, 2014 at 7:07 AM, Adrian Mocanu <[hidden email]> wrote:

Hi

Are there any examples on how to do any other operation apart from counting in spark via map then reduceByKey.

It’s pretty straight forward to do counts but how do I add in my own function (say conditional sum based on tuple fields or moving average)?

 

Here’s my count example so we have some code to work with

 

val inputList= List( ("name","1","11134"),("name","2","11134"),("name","1","11130"), ("name2","1","11133") )

sc.parallelize( inputList )

.map(x => (x,1) )

.reduceByKey(sumTuples)

.foreach(x=>println(x))

 

How would I add up field 2 from tuples which have fields “name” and the last field the same.

In my example the result I want is:

"name","1+2","11134"

“name","1","11130”

“name2","1","11133”

 

Thanks

-A


Reply | Threaded
Open this post in threaded view
|

Re: more complex analytics

purav aggarwal
In reply to this post by amoc
sc.parallelize(inputList)
.map(x => ((x._1,x._3),x._2))
.reduceByKey(_+_)

You need to understand what's happening when you say .map(x=>(x,1))
"For every x (which is a tuple of 3 fields in your case) - you map it to a pair with key = x and value = 1"
In .map(x => ((x._1,x._3),x._2)) - you set the key as your first and third field and value as your second field.


On Tue, Feb 11, 2014 at 8:37 PM, Adrian Mocanu <[hidden email]> wrote:

Hi

Are there any examples on how to do any other operation apart from counting in spark via map then reduceByKey.

It’s pretty straight forward to do counts but how do I add in my own function (say conditional sum based on tuple fields or moving average)?

 

Here’s my count example so we have some code to work with

 

val inputList= List( ("name","1","11134"),("name","2","11134"),("name","1","11130"), ("name2","1","11133") )

sc.parallelize( inputList )

.map(x => (x,1) )

.reduceByKey(sumTuples)

.foreach(x=>println(x))

 

How would I add up field 2 from tuples which have fields “name” and the last field the same.

In my example the result I want is:

"name","1+2","11134"

“name","1","11130”

“name2","1","11133”

 

Thanks

-A