Using Spark Structured Streaming as an ETL tool

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Using Spark Structured Streaming as an ETL tool

Mich Talebzadeh

Apologies for being long winded.

Came to my mind that after struggling to get a reliable connection from Apache Kafka to Google BigQuery, (BQ) that one can build such a pipeline using Spark Structured Streaming (SSS) with foreachBatch provider.

The primer for this is the ability to feed the BQ tables through an efficient and reliable API. There is a group in Google called dataproc team who are responsible for maintaining the API through Spark API for BigQuery. This works fine including with Spark 3.1.1 release.

Trying to use Kafka-BigQuery Connector opens additional complexities that a third party may not want. Examples, dependency on another vendor to provide it as a service at a cost with Schema Registry etc. If you want to develop that connector yourself in-house, you may end up developing your own schema registry or sending the schema and payload in a Json type message. This onus rests on you.

With SSS and foreachBatch one can easily achieve this through perhaps 
trigger(processingTime, xx) which will allow the rate on which you want to feed your BQ tables. Of course foreachBatch itself can be used for data analysis but essentially, that would be another work stream.

I feel that using SSS for batch feed is probably a very good fit. I have seen database vendors developing their own stream manipulation for Kafka (I don't mean vendors supporting Kafka) a sort of streaming work, but I suppose Spark can do all that.

In the attached diagram, I want to replace Box 10 with another box similar to 3 using SSS. Let me know your thoughts




Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.


To unsubscribe e-mail: [hidden email]

LambdaArchitecture_MD_BigQuery.pdf (700K) Download Attachment