In-order processing using spark streaming

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

In-order processing using spark streaming

scorpio
This post has NOT been accepted by the mailing list yet.
I am reading messages from Kafka and processing them using Spark Streaming. The incoming messages belong to various sessions and I am using a partitioned topic to ensure that the messages belonging to the same session end up in the same partition of Kafka topic.
Spark streaming job reads the messages from kafka topic using a direct consumer. There may be n number of executors reading in parallel. Since the messages belonging to a session were ordered at Kakfa, the same shall be read in an ordered fashion at the Kafka consumer being run by Spark engine. I believe Spark dedicates one core per executor to read from Kakfa and the remaining cores are used for processing the read messages. So, if the read messages are processed by more than one core in parallel, won't it break the in-order processing which I desire? If yes, I could limit the number of cores per executor but won't it mean under utilizing my Spark cluster resources?
Is there some way by which I can ensure that the session messages are processed by Spark Streaming in the order they were reported at kafka?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: In-order processing using spark streaming

JayeshLalwani
This post has NOT been accepted by the mailing list yet.
Option A

If you can get all the messages in a session into the same Spark partition, you can use df.mapWithPartition to process the whole partition. This will allow you to control the order in which the messages are processed within the partition.
This will work if messages are posted in Kafka in order and are guaranteed by Kafka to be delivered in order

Option B
If the messages can come out of order, and  have a timestamp associated with them, you can use window operations to sort messages within a window. You will need to make sure that messages in the same session land in the same Spark partition. This will add latency to the system though, because you won't process the messages until the watermark has expired.
Loading...