having trouble using structured streaming with file sink (parquet)

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

having trouble using structured streaming with file sink (parquet)

This post has NOT been accepted by the mailing list yet.

Hi all,


I have recently started assessing structured streaming and ran into a little snag from the beginning.


Basically I wanted to read some data, do some basic aggregation and write the result to file:


import org.apache.spark.sql.functions.avg

import org.apache.spark.sql.streaming.ProcessingTime

val rawRecords = spark.readStream.schema(myschema).parquet("/mytest")

val q = rawRecords.withColumn("g",$"id" % 100).groupBy("g").agg(avg($"id"))

val res = q.writeStream.outputMode("complete").trigger(ProcessingTime("10 seconds")).format("parquet").option("path", "/test2").option("checkpointLocation", "/mycheckpoint").start


The problem is that it tells me that parquet does not support the complete mode (or update for that matter).

So how would I do a streaming with aggregation to file?

In general, my goal is to have a single (slow) streaming process which would write some profile and then have a second streaming process which would load the current dataframe to be used in join (I would stop the second streaming process and reload the dataframe periodically).


Any help would be appreciated.