Cannot read case-sensitive Glue table backed by Parquet

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Cannot read case-sensitive Glue table backed by Parquet

oripwk
This post was updated on .
Spark version: 2.4.2 on Amazon EMR 5.24.0

I have a Glue Catalog table backed by S3 Parquet directory. The Parquet
files have case-sensitive column names (like /lastModified/). It doesn't
matter what I do, I get lowercase column names (/lastmodified/) when reading
the Glue Catalog table with Spark:

        for {
          i <- Seq(false, true)
          j <- Seq("NEVER_INFER", "INFER_AND_SAVE", "INFER_ONLY")
          k <- Seq(false, true)
        } {
          val spark = SparkSession.builder()
            .config("spark.sql.hive.convertMetastoreParquet", i)
            .config("spark.sql.hive.caseSensitiveInferenceMode", j)
            .config("spark.sql.parquet.mergeSchema", k)
            .enableHiveSupport()
            .getOrCreate()
       
          import spark.sql
       
          val df = sql("""SELECT * FROM ecs_db.test_small""")
          df.columns.foreach(println)
        }



[1]
https://medium.com/@an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: Cannot read case-sensitive Glue table backed by Parquet

oripwk


This bug happens because the Glue table's SERDEPROPERTIES is missing two
important properties:

    spark.sql.sources.schema.numParts
    spark.sql.sources.schema.part.0

To solve the problem, I had to add those two properties via the Glue console
(couldn't do it with ALTER TABLE …)

I guess this is a bug with Glue crawlers, which do not set these properties
when creating the table.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Cannot read case-sensitive Glue table backed by Parquet

oripwk
Sorry, but my original solution is incorrect

1. Glue Crawlers are not supposed to set the spark.sql.sources.schema.*
properties, but Spark SQL should. The default in Spark 2.4 for
spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that
Spark infers the schema from the underlying files and alters the tables to
add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our
case, Spark failed to do so, because of a I"llegalArgumentException: Can not
create a Path from an empty string" exception which is caused because the
Hive database class instance has an empty locationUri property string. This
is caused because the Glue database does not have a Location property enter
image description here. After the schema is saved, Spark reads it from the
table.
2. There could be a way around this, by setting INFER_ONLY, which should
only infer the schema from the files and not attempt to alter the table
SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where
the inferred schema is then lowercased [1].

[1]
https://github.com/apache/spark/blob/c1b6fe479649c482947dfce6b6db67b159bd78a3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L284




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]