Cannot create parquet with snappy output for hive external table

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Cannot create parquet with snappy output for hive external table

Dhimant
This post has NOT been accepted by the mailing list yet.
Hi Group,

I am not able to load data into external hive table which is partitioned.

Trace :-

1. create external table test(id int, name string) stored as parquet location 'hdfs://testcluster/user/abc/test' tblproperties ('PARQUET.COMPRESS'='SNAPPY');

2.Spark code

   val spark = SparkSession.builder().enableHiveSupport().config("hive.exec.dynamic.partition", "true")
          .config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
        spark.sql("use default").show
        val rdd = sc.parallelize(Seq((1, "one"), (2, "two")))
        val df = spark.createDataFrame(rdd).toDF("id", "name")
        df.write.mode(SaveMode.Overwrite).insertInto("test")

3. I can see few snappy.parquet files.

4. create external table test(id int) partitioned by  (name string)  stored as parquet location 'hdfs://testcluster/user/abc/test' tblproperties ('PARQUET.COMPRESS'='SNAPPY');

5.Spark code

   val spark = SparkSession.builder().enableHiveSupport().config("hive.exec.dynamic.partition", "true")
          .config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
        spark.sql("use default").show
        val rdd = sc.parallelize(Seq((1, "one"), (2, "two")))
        val df = spark.createDataFrame(rdd).toDF("id", "name")
        df.write.mode(SaveMode.Overwrite).insertInto("test")

6. I see uncompressed files without snappy.parquet extension. parquet-tools.jar also confirms that this is uncompressed parquet file.

7.i tried following options as well, but no luck

df.write.mode(SaveMode.Overwrite).format("parquet").option("compression", "snappy").insertInto("test")


Thanks in advance.

Reply | Threaded
Open this post in threaded view
|

Re: Cannot create parquet with snappy output for hive external table

abhimadav
This post has NOT been accepted by the mailing list yet.
Hi,

This goes back to the documentation of saveAsTable and inserInto. Assuming you are on spark 2.0+

1) insertInto - Because it inserts data to an existing table, format or options will be ignored.
2) saveAsTable - In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception).

You can refer for the full documentation here.

For this case (partitioned target), could you try this?

df.write.partitionBy("name").mode("overwrite").format("parquet").option("compression","snappy").saveAsTable("test")

I tested it at my end.