structured streaming handling validation and json flattening

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

structured streaming handling validation and json flattening

Lian Jiang
Hi,

We have a structured streaming job that converting json into parquets. We want to validate the json records. If a json record is not valid, we want to log a message and refuse to write it into the parquet. Also the json has nesting jsons and we want to flatten the nesting jsons into other parquets by using the same streaming job. My questions are:

1. how to validate the json records in a structured streaming job?
2. how to flattening the nesting jsons in a structured streaming job?
3. is it possible to use one structured streaming job to validate json, convert json into a parquet and convert nesting jsons into other parquets?

I think unstructured streaming can achieve these goals but structured streaming is recommended by spark community.

Appreciate your feedback!
Reply | Threaded
Open this post in threaded view
|

Re: structured streaming handling validation and json flattening

Jacek Laskowski
Hi Lian,

"What have you tried?" would be a good starting point. Any help on this?

How do you read the JSONs? readStream.json? You could use readStream.text followed by filter to include/exclude good/bad JSONs.

On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <[hidden email]> wrote:
Hi,

We have a structured streaming job that converting json into parquets. We want to validate the json records. If a json record is not valid, we want to log a message and refuse to write it into the parquet. Also the json has nesting jsons and we want to flatten the nesting jsons into other parquets by using the same streaming job. My questions are:

1. how to validate the json records in a structured streaming job?
2. how to flattening the nesting jsons in a structured streaming job?
3. is it possible to use one structured streaming job to validate json, convert json into a parquet and convert nesting jsons into other parquets?

I think unstructured streaming can achieve these goals but structured streaming is recommended by spark community.

Appreciate your feedback!
Reply | Threaded
Open this post in threaded view
|

Re: structured streaming handling validation and json flattening

Phillip Henry
Hi,

I'm in a somewhat similar situation. Here's what I do (it seems to be working so far):

1. Stream in the JSON as a plain string.
2. Feed this string into a JSON library to validate it (I use Circe).
3. Using the same library, parse the JSON and extract fields X, Y and Z.
4. Create a dataset with fields X, Y, Z and the JSON as a String/
5. Write this dataset to HDFS as Parquet partitioned on X and sorted on Y.

Obviously, this is not exactly the same as your use case (for instance, I have no idea what your requirements are regarding "flattening the nesting jsons"). Also, I extract only a few fields that I use as columns in the resulting Dataset but then store the rest of the JSON as a string. However, the principle should be the same for you.

HTH.

Phillip





On Mon, Feb 11, 2019 at 2:59 PM Jacek Laskowski <[hidden email]> wrote:
Hi Lian,

"What have you tried?" would be a good starting point. Any help on this?

How do you read the JSONs? readStream.json? You could use readStream.text followed by filter to include/exclude good/bad JSONs.

On Sat, Feb 9, 2019 at 8:25 PM Lian Jiang <[hidden email]> wrote:
Hi,

We have a structured streaming job that converting json into parquets. We want to validate the json records. If a json record is not valid, we want to log a message and refuse to write it into the parquet. Also the json has nesting jsons and we want to flatten the nesting jsons into other parquets by using the same streaming job. My questions are:

1. how to validate the json records in a structured streaming job?
2. how to flattening the nesting jsons in a structured streaming job?
3. is it possible to use one structured streaming job to validate json, convert json into a parquet and convert nesting jsons into other parquets?

I think unstructured streaming can achieve these goals but structured streaming is recommended by spark community.

Appreciate your feedback!