[Structured Streaming] Robust watermarking calculation with future timestamps

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Structured Streaming] Robust watermarking calculation with future timestamps

Anastasios Zouzias
Hi all,

We currently have the following issue with a Spark Structured Streaming (SS) application. The application reads messages from thousands of source systems, stores them in Kafka and Spark aggregates them using SS and watermarking (15 minutes).

The root problem is that a few of the source systems have a wrong timezone setup that makes them emit messages from the future, i.e., +1 hour ahead of current time (mis-configuration or winter/summer timezone change (yeah!) ). Since watermarking is calculated as

(most latest timestamp value of all messages) - (watermarking threshold value, 15 mins),

most of the messages are dropped due to the fact that are delayed by more than 45 minutes. To an even more extreme scenario, even a single "future" / adversarial message can make the structured streaming application to report zero messages (per mini-batch).

Is there any user exposed SS API that allows a more robust calculation of watermarking, i.e., 95th percentile of timestamps instead of max timestamp? I understand that such calculation will be more expensive, but it will make the application more robust.

Any suggestions/ideas?

PS. Of course the best approach would be to fix the issue on all source systems but this might take time to do so (or perhaps drop future messages programmatically (yikes) ).

Best regards,
Anastasios