[Spark SQL] Failure Scenarios involving JDBC and SQL databases

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[Spark SQL] Failure Scenarios involving JDBC and SQL databases

Ramon Tuason

Hi all,


I’m writing a data source that shares similarities with Spark’s own JDBC implementation, and I’d like to ask a question about how Spark handles failure scenarios involving JDBC and SQL databases. To my understanding, if an executor dies while it’s running a task, Spark will revive the executor and try to re-run that task. However, how does this play out in the context of data integrity and Spark’s JDBC data source API (e.g. df.write.format(“jdbc”).option(…).save())?


In the savePartition function of JdbcUtils.scala, we see Spark calling the commit and rollback functions of the Java connection object generated from the database url/credentials provided by the user (screenshot below). Can someone provide some guidance on what exactly happens under certain failure scenarios? For example, if an executor dies right after commit() finishes or before rollback() is called, does Spark try to re-run the task and write the same data partition again, essentially creating duplicate committed rows in the database? What happens if the executor dies in the middle of calling commit() or rollback()?


Thanks for your help!