Spark JDBC Connection for ETL

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark JDBC Connection for ETL

ravi1512
This post has NOT been accepted by the mailing list yet.
Requirement - Export data stored as parquet files from S3 to a SQL database. We will have incremental data as well, in that case it should simply append the data to the target table in the database. It should have a validation of input data before pushing the data to the SQL database.

Solution - Basically, we can choose between two tools SQOOP Export and Spark' JDBC API for this.

My Question and concern - Will Spark JDBC connection worth it as compared to a SQOOP Export? There are a few scenarios in which we will have advantage over SQOOP, like when the file is growing per minute SQOOP will have to extract only the incremental changes, this will take time in Map Reduce but will be smoother with Spark, I guess(Lazy Loading, Filter, Querying directly on the file). Plus, Spark JDBC is pretty straightforward. Will there be any performance difference between the two? What about data validation before sending in the data in case of SQOOP? Also, how can I create multiple JDBC connections for this process to be faster, in case of Spark JDBC.

Thanks,
Ravi Mishra
Reply | Threaded
Open this post in threaded view
|

Re: Spark JDBC Connection for ETL

Van Halen
This post has NOT been accepted by the mailing list yet.
Hi,

We use Spark JDBC and Sqoop extensively. the performance of Spark JDBC will depend on the performance of the JDBC driver and whether it supports "batching". in my experience, MySQL, Oracle and SQL Server jdbc drivers supports batching and I've gotten decent performance with Spark JDBC on those platforms. DW Azure jdbc drivers however does not support batching - Spark JDBC performance is somewhere around 1000 rows per sec. which is pathetically slow - so you will have to use HDFS put, Polybase, BCP or the Bulk API to copy data to DW Azure.