Parallel read parquet file, write to postgresql

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallel read parquet file, write to postgresql

James Starks
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just work? Any additional parameters needed to be added to SparkSession as well?

Also if I want to parallel write data to database, would options 'numPartitions' and 'batchsize' enough to improve write performance? For example,

                 mydf.format("jdbc").
                     option("driver", "org.postgresql.Driver").
                     option("url", url).
                     option("dbtable", table_name).
                     option("user", username).
                     option("password", password).
                     option("numPartitions", N) .
                     option("batchsize", M)
                     save

From Spark website (https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases), I only find these two parameters that would have impact  on db write performance.

I appreciate any suggestions.
Reply | Threaded
Open this post in threaded view
|

Re: Parallel read parquet file, write to postgresql

Shahab Yunus
Hi James.

--num-executors is use to control the number of parallel tasks (each per executors) running for your application. For reading and writing data in parallel data partitioning is employed. You can look here for quick intro how data partitioning work: 

You are write that numPartitions is the parameter that could be used to control that though in general spark itself identifies given the data in each stage, how to partition (i.e. how much to parallelize the read and write of data.)



On Mon, Dec 3, 2018 at 8:40 AM James Starks <[hidden email]> wrote:
Reading Spark doc (https://spark.apache.org/docs/latest/sql-data-sources-parquet.html). It's not mentioned how to parallel read parquet file with SparkSession. Would --num-executors just work? Any additional parameters needed to be added to SparkSession as well?

Also if I want to parallel write data to database, would options 'numPartitions' and 'batchsize' enough to improve write performance? For example,

                 mydf.format("jdbc").
                     option("driver", "org.postgresql.Driver").
                     option("url", url).
                     option("dbtable", table_name).
                     option("user", username).
                     option("password", password).
                     option("numPartitions", N) .
                     option("batchsize", M)
                     save

From Spark website (https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases), I only find these two parameters that would have impact  on db write performance.

I appreciate any suggestions.