Generic types and pair RDDs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Generic types and pair RDDs

dsiegmann
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


That works fine, but lets say I replace the type of the key with a generic type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


This latter function gets the compiler error "value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]".

The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong?

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: [hidden email] W: www.velos.io
Reply | Threaded
Open this post in threaded view
|

Re: Generic types and pair RDDs

Koert Kuipers
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann <[hidden email]> wrote:
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


That works fine, but lets say I replace the type of the key with a generic type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


This latter function gets the compiler error "value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]".

The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong?

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: [hidden email] W: www.velos.io

Reply | Threaded
Open this post in threaded view
|

Re: Generic types and pair RDDs

Aaron Davidson
Koert's answer is very likely correct. This implicit definition which converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is available for K: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124

To fully understand what's going on from a Scala beginner's point of view, you'll have to look up ClassTags, context bounds (the "K : ClassTag" syntax), and implicit functions. Fortunately, you don't have to understand monads...


On Tue, Apr 1, 2014 at 2:06 PM, Koert Kuipers <[hidden email]> wrote:
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {

    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann <[hidden email]> wrote:
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


That works fine, but lets say I replace the type of the key with a generic type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


This latter function gets the compiler error "value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]".

The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong?

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: [hidden email] W: www.velos.io


Reply | Threaded
Open this post in threaded view
|

Re: Generic types and pair RDDs

dsiegmann
That worked, thank you both! Thanks also Aaron for the list of things I need to read up on - I hadn't heard of ClassTag before.


On Tue, Apr 1, 2014 at 5:10 PM, Aaron Davidson <[hidden email]> wrote:
Koert's answer is very likely correct. This implicit definition which converts an RDD[(K, V)] to provide PairRDDFunctions requires a ClassTag is available for K: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1124

To fully understand what's going on from a Scala beginner's point of view, you'll have to look up ClassTags, context bounds (the "K : ClassTag" syntax), and implicit functions. Fortunately, you don't have to understand monads...


On Tue, Apr 1, 2014 at 2:06 PM, Koert Kuipers <[hidden email]> wrote:
  import org.apache.spark.SparkContext._
  import org.apache.spark.rdd.RDD
  import scala.reflect.ClassTag

  def joinTest[K: ClassTag](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {

    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
  }


On Tue, Apr 1, 2014 at 4:55 PM, Daniel Siegmann <[hidden email]> wrote:
When my tuple type includes a generic type parameter, the pair RDD functions aren't available. Take for example the following (a join on two RDDs, taking the sum of the values):

def joinTest(rddA: RDD[(String, Int)], rddB: RDD[(String, Int)]) : RDD[(String, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


That works fine, but lets say I replace the type of the key with a generic type:

def joinTest[K](rddA: RDD[(K, Int)], rddB: RDD[(K, Int)]) : RDD[(K, Int)] = {
    rddA.join(rddB).map { case (k, (a, b)) => (k, a+b) }
}


This latter function gets the compiler error "value join is not a member of org.apache.spark.rdd.RDD[(K, Int)]".

The reason is probably obvious, but I don't have much Scala experience. Can anyone explain what I'm doing wrong?

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: [hidden email] W: www.velos.io





--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: [hidden email] W: www.velos.io