Structured Streaming: multiple sinks

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Structured Streaming: multiple sinks

aravias
This post has NOT been accepted by the mailing list yet.
1) We are consuming from  kafka using  structured streaming and  writing the processed data set to s3.
We also want to write the processed data to kafka moving forward, is it possible to do it from the same streaming query ? (spark  version 2.1.1)


2) In the logs, I see the streaming  query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what  the difference is between addBatch and getBatch ?  

3)  TriggerExecution - is it the time take  to both process the fetched data and writing to the sink?




"durationMs" : {
    "addBatch" : 2263426,
    "getBatch" : 12,
    "getOffset" : 273,
    "queryPlanning" : 13,
    "triggerExecution" : 2264288,
    "walCommit" : 552
  },

regards
aravias
Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming: multiple sinks

cbowden
This post has NOT been accepted by the mailing list yet.
1. would it not be more natural to write processed to kafka and sink processed from kafka to s3?
2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
2b. getBatch is the time Source#getBatch took as measured by StreamExecution.
3. triggerExecution is effectively end-to-end processing time for the micro-batch, note all other durations sum closely to triggerExecution, there is a little slippage based on book-keeping activities in StreamExecution.