[Spark Structued Streaming]: Read kafka offset from a timestamp

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark Structued Streaming]: Read kafka offset from a timestamp

puneetloya
I would like to request a feature for reading data from Kafka Source based on
a timestamp. So that if the application needs to process data from a certain
time, it should be able to do it. I do agree, that there is checkpoint which
gives us a continuation of stream process but what if I want to rewind the
checkpoints.
According to Spark experts, its not advised to edit checkpoints and finding
the right offsets to replay Spark is tricky but replaying from a certain
timestamp is a lot easier, atleast with a decent monitoring system.( the
time from where things started to fall apart like a buggy push or a bad
setting change)

The Kafka consumer APIs support this method OffsetForTimes which can easily
give the right offsets,

https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/Consumer.html#offsetsForTimes

Similar to the StartingOffsets and EndingOffsets, it can support
startTimestamp and endTimeStamp

In a SAAS environment, when continuous data keeps flowing, these small
tweaks can help us repair our systems. Spark Structured Streaming is already
great but features like these will keep things under control in a live
production processing environment.

Cheers,
Puneet



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Structued Streaming]: Read kafka offset from a timestamp

Jungtaek Lim
It really depends on whether we use it only for starting query (instead of restoring from checkpoint) or we would want to restore the previous batch from specific time (with restoring state as well). 

For former would make sense and I'll try to see whether I can address it. For latter that doesn't look like possible because we normally want to go back to at least couple of hours which hundreds of batches could have been processed and no information for such batch is left. Looks like you're talking about the latter, but if we agree the former is also enough helpful I can take a look at it.

Btw, it might also help on batch query as well, actually sounds more helpful on batch query.

-Jungtaek Lim (HeartSaVioR)

2018년 11월 18일 (일) 오전 9:53, puneetloya <[hidden email]>님이 작성:
I would like to request a feature for reading data from Kafka Source based on
a timestamp. So that if the application needs to process data from a certain
time, it should be able to do it. I do agree, that there is checkpoint which
gives us a continuation of stream process but what if I want to rewind the
checkpoints.
According to Spark experts, its not advised to edit checkpoints and finding
the right offsets to replay Spark is tricky but replaying from a certain
timestamp is a lot easier, atleast with a decent monitoring system.( the
time from where things started to fall apart like a buggy push or a bad
setting change)

The Kafka consumer APIs support this method OffsetForTimes which can easily
give the right offsets,

https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/Consumer.html#offsetsForTimes

Similar to the StartingOffsets and EndingOffsets, it can support
startTimestamp and endTimeStamp

In a SAAS environment, when continuous data keeps flowing, these small
tweaks can help us repair our systems. Spark Structured Streaming is already
great but features like these will keep things under control in a live
production processing environment.

Cheers,
Puneet



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]