Spark 3.0.1 new Proleptic Gregorian calendar

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0.1 new Proleptic Gregorian calendar

Saurabh Gulati
Hello,
First of all, Thanks to you guys for maintaining and improving Spark.

We just updated to Spark 3.0.1 and are facing some issues with the new Proleptic Gregorian calendar.

We have data from different sources in our platform and we saw there were some date/timestamp columns that go back to years before 1500.

According to this post, data written with spark 2.4 and read with 3.0 should result in some difference in dates/timestamps but we are not able to replicate this issue. We only encounter an exception that suggests us to set spark.sql.legacy.parquet.datetimeRebaseModeInRead/Write config options to make it work.

So, our main concern is:
  • How can we test/replicate this behavior? Since it's not very clear to us/nor we see any docs for this change, we can't decide with certainty which parameters to set and why.
  • What config options should we set,
    •  if we are always going to read old data written from Spark2.4 using Spark 3.0 
    • will always be writing newer data with Spark3.0.
We couldn't make a deterministic/informed choice so it's a better idea to ask the community what scenarios will be impacted and what will still work fine.

Thanks
Saurabh


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 new Proleptic Gregorian calendar

Maxim Gekk
Hello Saurabh,

>  What config options should we set,
> - if we are always going to read old data written from Spark2.4 using Spark 3.0

You should set spark.sql.legacy.parquet.datetimeRebaseModeInRead to LEGACY when you read old data.

You see this exception because Spark 3.0 cannot determine who wrote the parquet files and which calendar was used while saving the files. Starting from the version 2.4.6, Spark saves meta-data to parquet files, and Spark 3.0 can infer the mode automatically.

Maxim Gekk

Software Engineer

Databricks, Inc.



On Thu, Nov 19, 2020 at 8:10 PM Saurabh Gulati <[hidden email]> wrote:
Hello,
First of all, Thanks to you guys for maintaining and improving Spark.

We just updated to Spark 3.0.1 and are facing some issues with the new Proleptic Gregorian calendar.

We have data from different sources in our platform and we saw there were some date/timestamp columns that go back to years before 1500.

According to this post, data written with spark 2.4 and read with 3.0 should result in some difference in dates/timestamps but we are not able to replicate this issue. We only encounter an exception that suggests us to set spark.sql.legacy.parquet.datetimeRebaseModeInRead/Write config options to make it work.

So, our main concern is:
  • How can we test/replicate this behavior? Since it's not very clear to us/nor we see any docs for this change, we can't decide with certainty which parameters to set and why.
  • What config options should we set,
    •  if we are always going to read old data written from Spark2.4 using Spark 3.0 
    • will always be writing newer data with Spark3.0.
We couldn't make a deterministic/informed choice so it's a better idea to ask the community what scenarios will be impacted and what will still work fine.

Thanks
Saurabh