Apache Spark-Subtract two datasets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Spark-Subtract two datasets

shashikant.kulkarni@gmail.com
Hello,

I have 2 datasets, Dataset<Class1> and other is Dataset<Class2>. I want the list of records which are in Dataset<Class1> but not in Dataset<Class2>. How can I do this in Apache Spark using Java Connector? I am using Apache Spark 2.2.0

Thank you
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark-Subtract two datasets

Imran Rajjad
if the datasets hold objects of different classes, then you will have to convert both of them to rdd and then rename the columns befrore you call rdd1.subtract(rdd2)

On Thu, Oct 12, 2017 at 10:16 PM, Shashikant Kulkarni <[hidden email]> wrote:
Hello,

I have 2 datasets, Dataset<Class1> and other is Dataset<Class2>. I want the list of records which are in Dataset<Class1> but not in Dataset<Class2>. How can I do this in Apache Spark using Java Connector? I am using Apache Spark 2.2.0

Thank you
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
I.R
Reply | Threaded
Open this post in threaded view
|

Re: Apache Spark-Subtract two datasets

Nathan Kronenfeld-2
In reply to this post by shashikant.kulkarni@gmail.com
I think you want a join of type "left_anti"... See below log

scala> import spark.implicits._
import spark.implicits._

scala> case class Foo (a: String, b: Int)
defined class Foo

scala> case class Bar (a: String, d: Double)
defined class Bar

scala> var fooDs = Seq(Foo("a", 1), Foo("b", 2), Foo("c", 3)).toDS
fooDs: org.apache.spark.sql.Dataset[Foo] = [a: string, b: int]

scala> var barDs = Seq(Bar("b", 2.1), Bar("c", 3.2), Bar("d", 4.3)).toDS
barDs: org.apache.spark.sql.Dataset[Bar] = [a: string, d: double]

scala> fooDs.join(barDs, Seq("a"), "left_anti").collect.foreach(println)
[a,1]


On Thu, Oct 12, 2017 at 1:16 PM, Shashikant Kulkarni <[hidden email]> wrote:
Hello,

I have 2 datasets, Dataset<Class1> and other is Dataset<Class2>. I want the list of records which are in Dataset<Class1> but not in Dataset<Class2>. How can I do this in Apache Spark using Java Connector? I am using Apache Spark 2.2.0

Thank you
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]