I will be streaming data and am trying to understand how to get rid of old data from a stream so it does not become to large. I will stream in one large table of buying data and join that to another table of different data. I need the last 14 days from the second table. I will not need data that is older than 14 days.
The important part of the code is the where in the SQL statement, "where t1.creation_time < current_timestamp() - interval 15 minutes"
For this example, I am hoping that the stream will not contain any rows more than 15 minutes ago. Is this assumption correct? I am not sure how to test this. In addition I have set a watermark on the first stream of 2 minutes. I am thinking that this watermark will make Spark wait an additional 2 minutes for any data that comes in late.