CBO not working for Parquet Files

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

CBO not working for Parquet Files

rajat mishra
Hi All,

I have an external table in spark whose underlying data files are in parquet format.
The table is partitioned. When I try to computed the statistics for a query where partition column is in where clause, the statistics returned contains only the sizeInBytes and not the no of rows count. 

  val ddl = """create external table test_p (Address String, Age String, CustomerID string, CustomerName string, CustomerSuffix string, Location string, Mobile String, Occupation String, Salary String ) PARTITIONED BY (Country string) Stored as PARQUET LOCATION '/dev/test3'"""
spark.sql(ddl)
spark.sql("msck repair table test_p")

spark.sql("Analyze table test_p compute statistics for columns Address,Age,CustomerID,CustomerName,CustomerSuffix,Location,Mobile,Occupation,Salary,Country").show()
spark.sql("Analyze table test_p partition(Country) compute statistics").show()
println(spark.sql("select * from test_p where country='Korea'").queryExecution.toStringWithStats)

The output I get is :
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('country = Korea)
   +- 'UnresolvedRelation `test_p`

== Analyzed Logical Plan ==
Address: string, Age: string, CustomerID: string, CustomerName: string, CustomerSuffix: string, Location: string, Mobile: string, Occupation: string, Salary: string, Country: string
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9]
+- Filter (country#9 = Korea)
   +- SubqueryAlias test_p
      +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet

== Optimized Logical Plan ==
Project [Address#0, Age#1, CustomerID#2, CustomerName#3, CustomerSuffix#4, Location#5, Mobile#6, Occupation#7, Salary#8, Country#9], Statistics(sizeInBytes=2.2 KB, hints=none)
+- Filter (isnotnull(country#9) && (country#9 = Korea)), Statistics(sizeInBytes=2.2 KB, hints=none)
   +- Relation[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] parquet, Statistics(sizeInBytes=2.2 KB, hints=none)

== Physical Plan ==
*FileScan parquet default.test_p[Address#0,Age#1,CustomerID#2,CustomerName#3,CustomerSuffix#4,Location#5,Mobile#6,Occupation#7,Salary#8,Country#9] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/C:/dev/tests2/Country=Korea], PartitionCount: 1, PartitionFilters: [isnotnull(Country#9), (Country#9 = Korea)], PushedFilters: [], ReadSchema: struct<Address:string,Age:string,CustomerID:string,CustomerName:string,CustomerSuffix:string,Loca...

The same is working fine if I have an table whose underlying data file format is TextFile.
Am I missing any step above or is it a known thing in spark. 
Any help would be appreciated. 

Thanks,
Rajat
Reply | Threaded
Open this post in threaded view
|

Re: CBO not working for Parquet Files

emlyn
rajat mishra wrote
> When I try to computed the statistics for a query where partition column
> is in where clause, the statistics returned contains only the sizeInBytes
> and not the no of rows count.

We are also having the same issue. We have our data in partitioned parquet
files and were hoping to try out cbo but haven’t been able to get it
working: any query with a where clause on the partition column(s) (which is
the majority of realistic queries) seems to lose/ignore the rowCount stats.
We’ve generated both overall table stats (ANALYZE TABLE db.table PARTITION
COMPUTE STATISTICS;) and partitioned stats (ANALYZE TABLE db.table PARTITION
(col1, col2) COMPUTE STATISTICS;), and have verified that they are present
in the metastore.
 
I’ve also found this ticket:
https://issues.apache.org/jira/browse/SPARK-25185, but there it has no
response so far.
 
I suspect we must be missing something, as it seems that partitioned parquet
files would be a common use case, and if this is a bug in Spark I would have
expected it to have been picked up sooner.
 
Has anybody managed to get cbo working with partitioned parquet files? Is
this a known issue?
 
Thanks,
Emlyn



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]