I'm having some trouble figuring out how receivers tie into spark driver-executor structure.
Do all executors have a receiver that is blocked as soon as it receives some stream data?
Or can multiple streams of data be taken as input into a single executor?
I have stream data coming in at every second coming from 5 different sources. I want to aggregate data from each of them. Does this mean I need 5 executors or does it have to do with threads on the executor?
I might be mixing in a few concepts here. Any help would be appreciated.
Note, none of this applies to Direct streaming approaches, only receiver based Dstreams.
You can think of a receiver as a long running task that never finishes. Each receiver is submitted to an executor slot somewhere, it then runs indefinitely and internally has a method which passes records over to a block management system. There is a timing that you set which decides when each block is "done" and records after that time has passed go into the next block (See parameter
So if you have 5 different receivers, you need at minimum 6 executor cores. 1 core for each receiver, and 1 core to actually do your processing work. In a real world case you probably want significantly more cores on the processing side than just 1. Without repartitioning you will never have more that
A quick example
I run 5 receivers with block interval of 100ms and spark batch interval of 1 second. I use union to group them all together, I will most likely end up with one Spark Job for each batch every second running with 50 partitions (1000ms / 100(ms / partition / receiver) * 5 receivers). If I have a total of 10 cores in the system. 5 of them are running receivers, The remaining 5 must process the 50 partitions of data generated by the last second of work.
And again, just to reiterate, if you are doing a direct streaming approach or structured streaming, none of this applies.
This is super helpful. Thank you so much.
Can you elaborate on the differences between structured streaming vs dstreams? How would the number of receivers required etc change?
On Sat, 8 Aug, 2020, 10:28 pm Russell Spitzer, <[hidden email]> wrote:
The direct approach, which is also available through dstreams, and structured streaming use a different model. Instead of being a push based streaming solution they instead are pull based. (In general)
On every batch the driver uses the configuration to create a number of partitions, each is responsible for independently pulling a number of records. The exact number of records and guarantees around the pull are source and configuration dependent. Since the system is pull based, there is no need for a receiver or block management system taking up resources. Every task/partition contains all the information required to get the data that it describes.
An example in Kafka, the driver might decide that batch 1 contains all the records between offset 1 and 100. It checks and sees that there are 10 Kafka partitions. So it ends up making a spark job which contains 10 tasks each task dedicated to a single Kafka partition. Each task will then independently ask for 100 records from it's Kafka partition. There will be no Spark resources used outside of those required for those 10 tasks.
On Sun, Aug 9, 2020, 10:44 PM Dark Crusader <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|