We have a situation where we are ingesting high volume streaming ingest coming from a Oracle table.
Whenever there is a change in Oracle table, a CDC process will write out the change in a Kafka or Event Hub stream, and the stream will be consumed a spark streaming application.
Because of some challenges in Oracle side, it is observed that commits in Oracle happens in big bursts, regularly over couple of millions of records, and especially delete transactions. Hence, the stream consumed by spark app is not evenly distributed.
a) Is there some special care should be taken to write this kind of spark app?
b) Is it better if we rather go with spark batch which can run every hour or so? In that case we can use event hub archival process to write data to storage every 5 mins and then consume from hdfs/storage every hour
c) Other than a CDC tool, is there any spark package which can actually listen to Oracle change stream? So can we use spark as the CDC tool itself?