Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Parquet Filter pushdown not working and statistics are not generating for any column with Spark 1.6 CDH 5.7

Rabin Banerjee
Hi All ,


 I am using CDH 5.7 which comes with Spark version 1.6.0.  I am saving my data set as parquet data and then querying it . Query is executing fine But when I checked the files generated by spark, I found statistics(min/max) is missing for all the columns . And hence filters are not pushed down. Its scanning the entire file.


(1 to 30000).map(i => (i, i.toString)).toDF("a", "b").sort("a").write.parquet("/hdfs/path/to/store")



parquet-tools meta part-r-00186-03addad8-c19d-4812-b83b-a8708606183b.gz.parquet

creator:     parquet-mr version 1.5.0-cdh5.7.1 (build ${buildNumber}) 

extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}}]} 


file schema: spark_schema 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

a:           OPTIONAL INT32 R:0 D:1

b:           OPTIONAL BINARY O:UTF8 R:0 D:1


row group 1: RC:148 TS:2012 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

a:            INT32 GZIP DO:0 FPO:4 SZ:297/635/2.14 VC:148 ENC:BIT_PACKED,PLAIN,RLE

b:            BINARY GZIP DO:0 FPO:301 SZ:301/1377/4.57 VC:148 ENC:BIT_PACKED,PLAIN,RLE


As you can see from the parquet meta the STA field is missing. And spark is scanning all data of all files.

Any suggestion ?


Thanks //

RB