I know this is a basic question but someone enquired about it and I just wanted to fill my knowledge gap so to speak.
Within the context of Spark streaming, the RDD is created from the incoming topic and RDD is partitioned and each node of Spark is operating on a partition at that time. OK This series of operations are merged together and create a DAG. That means DAG keeps track of operations performed?
If a node goes down, the driver (application master) knows about it. Then, it tries to assign another node to continue the processing at the same place on that partition of RDD. This works but with reduced performance. However, to be able to handle the lost partition, the data has to be available to all nodes from the beginning. So we are talking about spark streaming not spark reading from HDFS, Hive table etc. The assumption is that the streaming data is cached in every single node? Is that correct!
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.