CBO not predicting cardinality on partition columns for Parquet tables

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

CBO not predicting cardinality on partition columns for Parquet tables

rajat mishra
Hi All,

I have an external table in spark whose underlying data files are in parquet format.
The table is partitioned. When I try to computed the statistics for a query where partition column is in where clause, the statistics returned contains only the sizeInBytes and not the no of rows count. 

  val ddl = """create external table test_p (Address String, Age String, CustomerID string, CustomerName string, CustomerSuffix string, Location string, Mobile String, Occupation String, Salary String ) PARTITIONED BY (Country string) Stored as PARQUET LOCATION '/dev/test3'"""
spark.sql(ddl)
spark.sql("msck repair table test_p")

spark.sql("Analyze table test_p compute statistics for columns Address,Age,CustomerID,CustomerName,CustomerSuffix,Location,Mobile,Occupation,Salary,Country").show()
spark.sql("Analyze table test_p partition(Country) compute statistics").show()
println(spark.sql("select * from test_p where country='Korea'").queryExecution.toStringWithStats)

The output I get is :
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p
      +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet

== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none)
   +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none)

== Physical Plan ==
*FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...

The same is working fine if I have an table whose underlying data file format is TextFile.
Am I missing any step above or is it a known thing in spark. 
Any help would be appreciated. 

Thanks,
Rajat