persistent tables in DataSource api V2

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

persistent tables in DataSource api V2

fansparker
1. In DataSource api V1, we were able to create persistent tables over custom
data sources using SQL DDL using "createRelation", "buildScan", "schema"
etc:. Is there a way to achieve this in DataSource api V2?

2. In DataSource api V1, any schema changes in the underlying custom data
source is not reflected on the already persisted tables, even if the
"schema()" is re-invoked with the updated schema. Is there a way to get the
persisted table's schema updated? Thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: schema changes of custom data source in persistent tables DataSourceV1

fansparker
Does anybody know if there is a way to get the persisted table's schema
updated when the underlying custom data source schema is changed? Currently,
we have to drop and re-create the table.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: schema changes of custom data source in persistent tables DataSourceV1

Piyush Acharya
Do you want to merge the schema when incoming data is changed?

spark.conf.set("spark.sql.parquet.mergeSchema", "true")
https://kontext.tech/column/spark/381/schema-merging-evolution-with-parquet-in-spark-and-hive


On Mon, Jul 20, 2020 at 3:48 PM fansparker <[hidden email]> wrote:
Does anybody know if there is a way to get the persisted table's schema
updated when the underlying custom data source schema is changed? Currently,
we have to drop and re-create the table.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: schema changes of custom data source in persistent tables DataSourceV1

Russell Spitzer
In reply to this post by fansparker
The last I looked into this the answer is no. I believe since there is a Spark Session internal relation cache, the only way to update a sessions information was a full drop and create. That was my experience with a custom hive metastore and entries read from it. I could change the entries in the metastore underneath the session but since the session cached the relation lookup I couldn't get it to reload the metadata.

DatssourceV2 does make this easy though

On Mon, Jul 20, 2020, 5:17 AM fansparker <[hidden email]> wrote:
Does anybody know if there is a way to get the persisted table's schema
updated when the underlying custom data source schema is changed? Currently,
we have to drop and re-create the table.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: schema changes of custom data source in persistent tables DataSourceV1

fansparker
Thanks Russell.  This
<https://gite.lirmm.fr/yagoubi/spark/commit/6463e0b9e8067cce70602c5c9006a2546856a9d6#fecff1a3ad108a52192ba9cd6dd7b11a3d18871b_0_141>  
shows that the "refreshTable" and "invalidateTable" could be used to reload
the metadata but they do not work in our case. I have tried to invoke the
"schema()" with the updated schema from the "buildScan()" as well.

It will be helpful to have this feature enabled for DataSourceV1 as the
schema evolves, i will check if this is an change that can be made.

You mentioned that it works in DataSourceV2. Is there an implementation
sample for persistent tables DataSourceV2 that works with spark 2.4.4?
Thanks again.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: schema changes of custom data source in persistent tables DataSourceV1

Russell Spitzer
The code you linked to is very old and I don't think that method works anymore (Hive context not existing anymore). My latest attempt at trying this was Spark 2.2 and I ran into the issues I wrote about before.

In DSV2 it's done via a catalog implementation, so you basically can write a new catalog that can create tables and such with whatever metadata you like. I'm not sure there is a Hive Metastore catalog implemented yet in DSV2. I also think if it was it would only be in Spark 3.0

On Mon, Jul 20, 2020 at 10:05 AM fansparker <[hidden email]> wrote:
Thanks Russell.  This
<https://gite.lirmm.fr/yagoubi/spark/commit/6463e0b9e8067cce70602c5c9006a2546856a9d6#fecff1a3ad108a52192ba9cd6dd7b11a3d18871b_0_141
shows that the "refreshTable" and "invalidateTable" could be used to reload
the metadata but they do not work in our case. I have tried to invoke the
"schema()" with the updated schema from the "buildScan()" as well.

It will be helpful to have this feature enabled for DataSourceV1 as the
schema evolves, i will check if this is an change that can be made.

You mentioned that it works in DataSourceV2. Is there an implementation
sample for persistent tables DataSourceV2 that works with spark 2.4.4?
Thanks again.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: schema changes of custom data source in persistent tables DataSourceV1

fansparker
Makes sense, Russell. I am trying to figure out if there is a way to enforce
metadata reload at "createRelation" if the provided schema in the new
sparkSession is different than the existing metadata schema.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]