|
|
Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this:
df_a= read_csv df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times
With datasets, this will require defining
case class A { f1, f2, f3 } //fields from csv file case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns //repeat this 10 times
Is there a better way?
Mohit.
|
|
Hi Mohit, I’m not sure that there is a “correct” answer here, but I tend to use classes whenever the input or output data represents something meaningful (such as a domain model object). I would recommend against creating many temporary classes for each and every transformation step as that may be difficult to maintain over time. Using withColumn statements will continue to work, and you don’t need to cast to your output class until you’ve setup all tranformations. Therefore, you can do things like: case class A (f1, f2, f3) case class B (f1, f2, f3, f4, f5, f6) ds_a = spark.read.csv(“path”).as[A] ds_b = ds_a .withColumn(“f4”, someUdf) .withColumn(“f5”, someUdf) .withColumn(“f6”, someUdf) .as[B] Kevin From: Mohit Jaggi <[hidden email]> Sent: Tuesday, January 15, 2019 1:31 PM To: user <[hidden email]> Subject: dataset best practice question Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this: df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times With datasets, this will require defining case class A { f1, f2, f3 } //fields from csv file case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns
|
|
Thanks! I wanted to avoid repeating f1, f2, f3 in class B. I wonder whether the encoders/decoders work if I use mixins
Hi Mohit, I’m not sure that there is a “correct” answer here, but I tend to use classes whenever the input or output data represents something meaningful (such as a domain model object). I would recommend against creating many temporary classes for each and every transformation step as that may be difficult to maintain over time. Using withColumn statements will continue to work, and you don’t need to cast to your output class until you’ve setup all tranformations. Therefore, you can do things like: case class A (f1, f2, f3) case class B (f1, f2, f3, f4, f5, f6) ds_a = spark.read.csv(“path”).as[A] ds_b = ds_a .withColumn(“f4”, someUdf) .withColumn(“f5”, someUdf) .withColumn(“f6”, someUdf) .as[B] Kevin From: Mohit Jaggi <[hidden email]> Sent: Tuesday, January 15, 2019 1:31 PM To: user <[hidden email]> Subject: dataset best practice question Fellow Spark Coders, I am trying to move from using Dataframes to Datasets for a reasonably large code base. Today the code looks like this: df_b = df.withColumn ( some_transform_that_adds_more_columns ) //repeat the above several times With datasets, this will require defining case class A { f1, f2, f3 } //fields from csv file case class B { f1, f2, f3, f4 } //union of A and new field added by some_transform_that_adds_more_columns
|
|