Spark Stream analysys on actual data time-stamp for lagging stream joins
This post has NOT been accepted by the mailing list yet.
I am considering Spark streaming for our product. I have following streaming problem to solve:
1) We have 10K entities with an average 300 metric collected for each entity.
2) Each metric stream needs to be processed for Machine Learning analyses. Total = 3000000 streams in the system at a time and growing.
3) I have following different level of metric stream processing.
- Each metric across entities is processed as a stream.
- two or more metrics can combine and get processed together as metric groups (eg. (M1, M2) (M6, M7, M10)) streams.
4) Two metrics can be grouped if the have max time difference of 30 second based on the actual data time stamp. For Metric group analysys. They are related events. I need to bin the different metric data in streams so that they are within a REAL DATA time stamp(not absolute time) limits.
5) I am doing the following -
- Collecting each metric as a stream through Kafka subscriber running on spark node and doing single metric stream analysis.
- Then I want to join this two stream and create a new Dstream for metric groups analysis. But a window on the joined stream might not fetch me metric data for each metric which fall withing 30 second data time stamp limit (due to lagging streams). Metric data for few streams might be missing or lagging by more then 30 seconds on actual data time stamp. So, I am planing to implement a binning system which will wait for say n*30 seconds max for data for different metric to arrive and get grouped with other metrics to form a metric group. In case a bin get filled before the time it should move forward for processing.
1) How can I model this problem using spark streaming?
2) Can I have one stream processed entirely at one node instead of distributed batch processing?
Each metric (or group) require lot of metric specific models to be loaded to the processing nodes for analysis. I would like to load only those models which are required, not all models at all Spark nodes.
3) I am loading each metric stream data from kafka subscriber, can I have that same node which load the data do the stream processing for loaded metric locally and do not distribute it to other node?