Big Burst of Streaming Changes

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Big Burst of Streaming Changes

ayan guha

We have a situation where we are ingesting high volume streaming ingest coming from a Oracle table. 
The requirement 
Whenever there is a change in Oracle table, a CDC process will write out the change in a Kafka or Event Hub stream, and the stream will be consumed a spark streaming application. 

The Problem:
Because of some challenges in Oracle side, it is observed that commits in Oracle happens in big bursts, regularly over couple of millions of records, and especially delete transactions. Hence, the stream consumed by spark app is not evenly distributed. 

The Question:

a) Is there some special care should be taken to write this kind of spark app?
b) Is it better if we rather go with spark batch which can run every hour or so? In that case we can use event hub archival process to write data to storage every 5 mins and then consume from hdfs/storage every hour
c) Other than a CDC tool, is there any spark package which can actually listen to Oracle change stream? So can we use spark as the CDC tool itself?

Best Regards,
Ayan Guha