Spark Structured Streaming and Kafka message schema evolution

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Structured Streaming and Kafka message schema evolution

Mich Talebzadeh
This is just a query.

In general Kafka-connect requires means to register that schema such that producers and consumers understand that. It also allows schema evolution, i.e. changes to metadata that identifies the structure of data sent via topic.

When we stream a kafka topic into (Spark Structured Streaming (SSS), the assumption is that by the time Spark processes that data, its structure  can be established. With foreachBatch, we create a dataframe on top of incoming batches of Json messages and the dataframe can be interrogated. However, the processing may fail if another column is added to the topic and the consumer (in this case SSS) is not aware of it. How can this change of schema be verified?

Thanks

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark Structured Streaming and Kafka message schema evolution

Jungtaek Lim-2
If I understand correctly, SQL semantics are strict on column schema. Reading via Kafka data source doesn't require you to specify the schema as it provides the key and value as binary, but once you deserialize them, unless you keep the type as primitive (e.g. String), you'll need to specify the schema, like from_json requires you to.

This wouldn't be changed even if you leverage Schema Registry - you'll need to provide the schema which is compatible with all schemas which records are associated with. I guess that's guaranteed if you use the latest version of the schema and you've changed the schema as "backward-compatible ways". I admit I haven't dealt with SR in SSS, but if you integrate the schema to the query plan, running query is unlikely getting the latest schema, but it still wouldn't matter as your query should only leverage the part of schema you've integrated, and the latest schema is "backward compatible" with the integrated schema.

Hope this helps.

Thanks
Jungtaek Lim (HeartSaVioR)

On Mon, Mar 15, 2021 at 9:25 PM Mich Talebzadeh <[hidden email]> wrote:
This is just a query.

In general Kafka-connect requires means to register that schema such that producers and consumers understand that. It also allows schema evolution, i.e. changes to metadata that identifies the structure of data sent via topic.

When we stream a kafka topic into (Spark Structured Streaming (SSS), the assumption is that by the time Spark processes that data, its structure  can be established. With foreachBatch, we create a dataframe on top of incoming batches of Json messages and the dataframe can be interrogated. However, the processing may fail if another column is added to the topic and the consumer (in this case SSS) is not aware of it. How can this change of schema be verified?

Thanks

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark Structured Streaming and Kafka message schema evolution

Mich Talebzadeh
Thanks Jungtaek.

I have reasons for this. So I will bring it up in another thread

Cheers,



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 15 Mar 2021 at 21:38, Jungtaek Lim <[hidden email]> wrote:
If I understand correctly, SQL semantics are strict on column schema. Reading via Kafka data source doesn't require you to specify the schema as it provides the key and value as binary, but once you deserialize them, unless you keep the type as primitive (e.g. String), you'll need to specify the schema, like from_json requires you to.

This wouldn't be changed even if you leverage Schema Registry - you'll need to provide the schema which is compatible with all schemas which records are associated with. I guess that's guaranteed if you use the latest version of the schema and you've changed the schema as "backward-compatible ways". I admit I haven't dealt with SR in SSS, but if you integrate the schema to the query plan, running query is unlikely getting the latest schema, but it still wouldn't matter as your query should only leverage the part of schema you've integrated, and the latest schema is "backward compatible" with the integrated schema.

Hope this helps.

Thanks
Jungtaek Lim (HeartSaVioR)

On Mon, Mar 15, 2021 at 9:25 PM Mich Talebzadeh <[hidden email]> wrote:
This is just a query.

In general Kafka-connect requires means to register that schema such that producers and consumers understand that. It also allows schema evolution, i.e. changes to metadata that identifies the structure of data sent via topic.

When we stream a kafka topic into (Spark Structured Streaming (SSS), the assumption is that by the time Spark processes that data, its structure  can be established. With foreachBatch, we create a dataframe on top of incoming batches of Json messages and the dataframe can be interrogated. However, the processing may fail if another column is added to the topic and the consumer (in this case SSS) is not aware of it. How can this change of schema be verified?

Thanks

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.