[Spark Sql] Direct write on hive and s3 while executing a CTAS on spark sql

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Spark Sql] Direct write on hive and s3 while executing a CTAS on spark sql

francexo83
Hi all, 
I'm using spark 2.4.0, my spark.sql.catalogImplementation is set to hive while spark.sql.warehouse.dir is set to a specific s3 bucket.

I want to execute a CTAS statement in spark sql like the one below.
create table as db_name.table_name as (select ..)

When writing, spark always uses the hive staging folder on s3 as a scratch dir. Once the executors finish their computation SPARK  moves the files from the staging dir to the final location.
This is causing performance degradation in write phase because of the nature of the object storage where the rename operation is not permitted.

Is it possible to enable a direct-write on s3 bucket while performing a CTAS execution in the scenario depicted above?

I performed the write operation by using the DataFrameWriter.saveAsTable api and obtained the desired result. 

Thank you in advance