This is an idea that I have to turn them into a Spark package.
Over a period, I had to develop various Python functions to set the Spark connection parameters, read from and write to various sources and sinks. These allow us to use the existing packages for spark utilities in Python quickly without worrying about details. The example of contents are as follows:
Create or get Spark session local
Create or replace Spark session for a distributed environment
Create Spark context
Create Hive context
Load Spark configuration parameters for Structured Streaming including setting for back pressure, kafka.maxRatePerPartition, backpressure.pid.minRate etc
Load Spark configuration parameters for Hive
Load Spark configuration parameters for Google BigQuery
Load Spark configuration parameters for Redis
Load data from Google BigQuery into DF
Write data from DF to Google BigQuery
All parameter settings are user driven and can be read through a yaml file read into the Python dictionary. So pretty flexible
For example to read from Jdbc the code is as below
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.