How to deal Schema Evolution with Dataset API

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to deal Schema Evolution with Dataset API

Jorge Machado-2
Hello everyone,

One question to the community.

Imagine I have this

        Case class Person(age: int)

        spark.read.parquet(“inputPath”).as[Person]


After a few weeks of coding I change the class to:
        Case class Person(age: int, name: Option[String] = None)


Then when I run the new code on the same input it fails saying that It cannot find the name on the schema from the parquet file.

Spark version 2.3.3

How is the best way to guard or fix this? Regenerating all data seems not to be a option for us.

Thx
---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to deal Schema Evolution with Dataset API

Jorge Machado-2
Ok, I found a way to solve it.

Just pass the schema like this:

val schema = Encoders.product[Person].schema

spark.read.schema(schema).parquet(“input”)….

> On 9. May 2020, at 13:28, Jorge Machado <[hidden email]> wrote:
>
> Hello everyone,
>
> One question to the community.
>
> Imagine I have this
>
> Case class Person(age: int)
>
> spark.read.parquet(“inputPath”).as[Person]
>
>
> After a few weeks of coding I change the class to:
> Case class Person(age: int, name: Option[String] = None)
>
>
> Then when I run the new code on the same input it fails saying that It cannot find the name on the schema from the parquet file.
>
> Spark version 2.3.3
>
> How is the best way to guard or fix this? Regenerating all data seems not to be a option for us.
>
> Thx
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to deal Schema Evolution with Dataset API

Edgardo Szrajber
If you want to keep the dataset, maybe you can try to add a constructor to the case class (through the companion objcet) that receives only the age.
Bentzi


On Sat, May 9, 2020 at 17:50, Jorge Machado
Ok, I found a way to solve it.

Just pass the schema like this:

val schema = Encoders.product[Person].schema

spark.read.schema(schema).parquet(“input”)….

> On 9. May 2020, at 13:28, Jorge Machado <[hidden email]> wrote:
>
> Hello everyone,
>
> One question to the community.
>
> Imagine I have this
>
>     Case class Person(age: int)
>
>     spark.read.parquet(“inputPath”).as[Person]
>
>
> After a few weeks of coding I change the class to:
>     Case class Person(age: int, name: Option[String] = None)
>
>
> Then when I run the new code on the same input it fails saying that It cannot find the name on the schema from the parquet file.
>
> Spark version 2.3.3
>
> How is the best way to guard or fix this? Regenerating all data seems not to be a option for us.
>
> Thx
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]

>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]