I have an external table in spark whose underlying data files are in parquet format.
The table is partitioned. When I try to computed the statistics for a query where partition column is in where clause, the statistics returned contains only the sizeInBytes and not the no of rows count.
val ddl = """create external table test_p (Address String, Age String, CustomerID string, CustomerName string, CustomerSuffix string, Location string, Mobile String, Occupation String, Salary String ) PARTITIONED BY (Country string) Stored as PARQUET LOCATION '/dev/test3'""" spark.sql(ddl) spark.sql("msck repair table test_p") spark.sql("Analyze table test_p compute statistics for columns Address,Age,CustomerID,CustomerName,CustomerSuffix,Location,Mobile,Occupation,Salary,Country").show() spark.sql("Analyze table test_p partition(Country) compute statistics").show()
println(spark.sql("select * from test_p where country='Korea'").queryExecution.toStringWithStats)
rajat mishra wrote
> When I try to computed the statistics for a query where partition column
> is in where clause, the statistics returned contains only the sizeInBytes
> and not the no of rows count.
We are also having the same issue. We have our data in partitioned parquet
files and were hoping to try out cbo but haven’t been able to get it
working: any query with a where clause on the partition column(s) (which is
the majority of realistic queries) seems to lose/ignore the rowCount stats.
We’ve generated both overall table stats (ANALYZE TABLE db.table PARTITION
COMPUTE STATISTICS;) and partitioned stats (ANALYZE TABLE db.table PARTITION
(col1, col2) COMPUTE STATISTICS;), and have verified that they are present
in the metastore.