dataset best practice question

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

dataset best practice question

Mohit Jaggi
Fellow Spark Coders,
I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this:

df_a= read_csv
df_b = df.withColumn ( some_transform_that_adds_more_columns )
//repeat the above several times

With datasets, this will require defining

case class A { f1, f2, f3 } //fields from csv file
case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns
//repeat this 10 times

Is there a better way? 

Mohit.
Reply | Threaded
Open this post in threaded view
|

RE: dataset best practice question

kevin.r.mellott

Hi Mohit,

 

I’m not sure that there is a “correct” answer here, but I tend to use classes whenever the input or output data represents something meaningful (such as a domain model object). I would recommend against creating many temporary classes for each and every transformation step as that may be difficult to maintain over time.

 

Using withColumn statements will continue to work, and you don’t need to cast to your output class until you’ve setup all tranformations. Therefore, you can do things like:

 

case class A (f1, f2, f3)

case class B (f1, f2, f3, f4, f5, f6)

 

ds_a = spark.read.csv(“path”).as[A]

ds_b = ds_a

  .withColumn(“f4”, someUdf)

  .withColumn(“f5”, someUdf)

  .withColumn(“f6”, someUdf)

  .as[B]

 

Kevin

 

From: Mohit Jaggi <[hidden email]>
Sent: Tuesday, January 15, 2019 1:31 PM
To: user <[hidden email]>
Subject: dataset best practice question

 

Fellow Spark Coders,

I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this:

 

df_a= read_csv

df_b = df.withColumn ( some_transform_that_adds_more_columns )

//repeat the above several times

 

With datasets, this will require defining

 

case class A { f1, f2, f3 } //fields from csv file

case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns

//repeat this 10 times

 

Is there a better way? 

 

Mohit.

Reply | Threaded
Open this post in threaded view
|

Re: dataset best practice question

Mohit Jaggi
Thanks! I wanted to avoid repeating f1, f2, f3 in class B. I wonder whether the encoders/decoders work if I use mixins

On Tue, Jan 15, 2019 at 7:57 PM <[hidden email]> wrote:

Hi Mohit,

 

I’m not sure that there is a “correct” answer here, but I tend to use classes whenever the input or output data represents something meaningful (such as a domain model object). I would recommend against creating many temporary classes for each and every transformation step as that may be difficult to maintain over time.

 

Using withColumn statements will continue to work, and you don’t need to cast to your output class until you’ve setup all tranformations. Therefore, you can do things like:

 

case class A (f1, f2, f3)

case class B (f1, f2, f3, f4, f5, f6)

 

ds_a = spark.read.csv(“path”).as[A]

ds_b = ds_a

  .withColumn(“f4”, someUdf)

  .withColumn(“f5”, someUdf)

  .withColumn(“f6”, someUdf)

  .as[B]

 

Kevin

 

From: Mohit Jaggi <[hidden email]>
Sent: Tuesday, January 15, 2019 1:31 PM
To: user <[hidden email]>
Subject: dataset best practice question

 

Fellow Spark Coders,

I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this:

 

df_a= read_csv

df_b = df.withColumn ( some_transform_that_adds_more_columns )

//repeat the above several times

 

With datasets, this will require defining

 

case class A { f1, f2, f3 } //fields from csv file

case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns

//repeat this 10 times

 

Is there a better way? 

 

Mohit.