I am trying to figure out whether Spark is a good tool for my use case.
I’m trying to ETL a subset of a customers/orders database from Oracle to JSON. Rougly 3-5% of the overall customers table.
I tried to use the Spark JDBC datasource but it ends up fetching the entire customers and orders table to one executor. I read about the partitionColumn, lowerBound and upperBound options. Could they be used somehow to distribute the load across a set of executors while also filtering out at the source customers that are not part of my subset?
Or would it better to parallelize the subset of customers to export and have a map operation that will query the Oracle Database to transform the customer ID to a JSON object containing customer and orders details?
Or is Spark not suitable for this kind of processes?
Just asking for guidance in order to not lose too much time in wrong directions. Thanks for your help!