DataSourceV2 producing wrong date value in Custom Data Writer

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DataSourceV2 producing wrong date value in Custom Data Writer

Shubham Chaurasia
Hi All,

I am using custom DataSourceV2 implementation (Spark version 2.3.2)

Here is how I am trying to pass in date type from spark shell.

scala> val df = sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype", col("datetype").cast("date"))
scala> df.write.format("com.shubham.MyDataSource").save

Below is the minimal write() method of my DataWriter implementation.
@Override
public void write(InternalRow record) throws IOException {
ByteArrayOutputStream format = streamingRecordFormatter.format(record);
System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
} 
It prints an integer as output: 
MyDataWriter.write: 17039

Is this a bug?  or I am doing something wrong?

Thanks,
Shubham
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 producing wrong date value in Custom Data Writer

Ryan Blue
Shubham,

DataSourceV2 passes Spark's internal representation to your source and expects Spark's internal representation back from the source. That's why you consume and produce InternalRow: "internal" indicates that Spark doesn't need to convert the values.

Spark's internal representation for a date is the ordinal from the unix epoch date, 1970-01-01 = 0.

rb

On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <[hidden email]> wrote:
Hi All,

I am using custom DataSourceV2 implementation (Spark version 2.3.2)

Here is how I am trying to pass in date type from spark shell.

scala> val df = sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype", col("datetype").cast("date"))
scala> df.write.format("com.shubham.MyDataSource").save

Below is the minimal write() method of my DataWriter implementation.
@Override
public void write(InternalRow record) throws IOException {
ByteArrayOutputStream format = streamingRecordFormatter.format(record);
System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
} 
It prints an integer as output: 
MyDataWriter.write: 17039

Is this a bug?  or I am doing something wrong?

Thanks,
Shubham


--
Ryan Blue
Software Engineer
Netflix
Reply | Threaded
Open this post in threaded view
|

Re: DataSourceV2 producing wrong date value in Custom Data Writer

Shubham Chaurasia
Thanks Ryan

On Tue, Feb 5, 2019 at 10:28 PM Ryan Blue <[hidden email]> wrote:
Shubham,

DataSourceV2 passes Spark's internal representation to your source and expects Spark's internal representation back from the source. That's why you consume and produce InternalRow: "internal" indicates that Spark doesn't need to convert the values.

Spark's internal representation for a date is the ordinal from the unix epoch date, 1970-01-01 = 0.

rb

On Tue, Feb 5, 2019 at 4:46 AM Shubham Chaurasia <[hidden email]> wrote:
Hi All,

I am using custom DataSourceV2 implementation (Spark version 2.3.2)

Here is how I am trying to pass in date type from spark shell.

scala> val df = sc.parallelize(Seq("2019-02-05")).toDF("datetype").withColumn("datetype", col("datetype").cast("date"))
scala> df.write.format("com.shubham.MyDataSource").save

Below is the minimal write() method of my DataWriter implementation.
@Override
public void write(InternalRow record) throws IOException {
ByteArrayOutputStream format = streamingRecordFormatter.format(record);
System.out.println("MyDataWriter.write: " + record.get(0, DataTypes.DateType));
} 
It prints an integer as output: 
MyDataWriter.write: 17039

Is this a bug?  or I am doing something wrong?

Thanks,
Shubham


--
Ryan Blue
Software Engineer
Netflix