Question about Spark, PySpark data frames and JDBC connections to TSQL databases
I am currently working on my first Spark app (V2.4.3, running as standalone
on a single node cluster, configured as 6 workers). I am coding this app in
This app is to be used to create a daily delta data set from a TSQL database
(sybase ASE). This data set returns new, updated and delete records.
The process is to load a reference parquet file into a data frame, the table
from TSQL into a data frame, which is hashed and then create 3 separate data
frames to store the three separate actions. These three last DF's are then
unioned together, repartitioned into 1 partition and outputted as parquet.
I have noticed a bit of weirdness around the connections to the TSQL
server/database/table. Occasionally, it will query the table anything up to
6 separate connections (all running the same query - Select <all tables>
from....). At other times, it will make only 1 or 2 connections (but if I
do restrict it back to 1 worker, it seems to be consistently only 1
connection, as I would have expected).
How does Spark determine how many connections to make? Is it only worker
dependent (which I originally thought)? Or is there something else that is
determining how many times to query the data?
Any insight would be appreciated. Ideally, I'd like to restrict the number
of database calls as much as I can.