Intermittently getting "Can not create the managed table error" while creating table from spark 2.4

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Intermittently getting "Can not create the managed table error" while creating table from spark 2.4

abhijeet bedagkar

We are facing below error in spark 2.4 intermittently when saving the managed table from spark.

Error -
pyspark.sql.utils.AnalysisException: u"Can not create the managed table('`hive_issue`.`table`'). The associated location('s3://{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table') already exists.;"

Steps to reproduce--
1. Create dataframe from spark mid size data (30MB CSV file)
2. Save dataframe as a table
3. Terminate the session when above mentioned operation is in progress

Session termination is just a way to repro this issue. In real time we are facing this issue intermittently when we are running same spark jobs multiple times. We use EMRFS and HDFS from EMR cluster and we face the same issue on both of the systems.
The only ways we can fix this is by deleting the target folder where table will keep its files which is not option for us and we need to keep historical information in the table hence we use APPEND mode while writing to table.

Sample code--
from pyspark.sql import SparkSession
sc = SparkSession.builder.enableHiveSupport().getOrCreate()
df ="s3://{sample-bucket}1/DATA/consumecomplians.csv")
# Terminate session using ctrl + c after this statement post df.write action started

We went through the documentation of spark 2.4 [1] and found that spark is no longer allowing to create manage tables on non empty folders.

1. Any reason behind change in the spatk behaviour
2. To us it looks like a breaking change as despite specifying "overwrite" option spark in unable to wipe out existing data and create tables
3. Do we have any solution for this issue.