[Spark SQL] Failure Scenarios involving JDBC and SQL databases
I’m writing a data source that shares similarities with Spark’s own JDBC implementation, and I’d like to ask a question about how Spark handles failure scenarios involving JDBC and SQL databases. To my understanding, if an executor dies
while it’s running a task, Spark will revive the executor and try to re-run that task. However, how does this play out in the context of data integrity and Spark’s JDBC data source API (e.g. df.write.format(“jdbc”).option(…).save())?
In the savePartition function of
JdbcUtils.scala, we see Spark calling the commit and rollback functions of the Java connection object generated from the database url/credentials provided by the user (screenshot below). Can someone provide some guidance on what exactly happens under certain
failure scenarios? For example, if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database? What
happens if the executor dies in the middle of calling commit() or rollback()?