[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

maasg
Hi,

I'm looking into the Parquet format support for the File source in Structured Streaming. 
The docs mention the use of the option 'mergeSchema' to merge the schemas of the part files found.[1]

What would be the practical use of that in a streaming context? 

In its batch counterpart, `mergeSchemas` would infer the schema superset of the part-files found. 


When using the File source + parquet format in streaming mode, we must provide a schema to the readStream.schema(...) builder and that schema is fixed for the duration of the stream.

My current understanding is that:

- Files containing a subset of the fields declared in the schema will render null values for the non-existing fields.
- For files containing a superset of the fields, the additional data fields will be lost. 
- Files not matching the schema set on the streaming source, will render all fields null for each record in the file.

Is the 'mergeSchema' option playing another role? From the user perspective, they may think that this option would help their job cope with schema evolution at runtime, but that does not seem to be the case. 

What is the use of this option? 

-kr, Gerard.