[SPARK-SQL] Does Spark 3.0 support parquet predicate pushdown for array of structs?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[SPARK-SQL] Does Spark 3.0 support parquet predicate pushdown for array of structs?

Haijia Zhou-2
Hi,
 I know Spark 3.0 has added Parquet predicate pushdown for nested structures (SPARK-17636)
 Does it also support predicate pushdown for an array of structs?
 
 For example, say I have a spark table 'individuals' (in parquet format) with the following schema
 
root |-- individual_id: string (nullable = true) |-- devices: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- type: string (nullable = true) | | |-- carrier_name: string (nullable = true) | | |-- model: string (nullable = true) | | |-- vendor: string (nullable = true) | | |-- year_released: integer (nullable = true) | | |-- primary_hardware_type: string (nullable = true) | | |-- browser_name: string (nullable = true) | | |-- browser_version: string (nullable = true) | | |-- manufacturer: string (nullable = true)


I can then use the following code to find the number of individuals who have at least one device that was released after 2010

select count(*) as total_count from individuals  where exists(devices, dev -> dev.year_released > 2010)

The query runs well with spark 3.0 but it had to read all the columns of the nested structure 'devices', as shown below.
res14: org.apache.spark.sql.execution.SparkPlan = AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[finalmerge_count(merge count#59L) AS count(1)#55L], output=[total_count#54L]) +- Exchange SinglePartition, true, [id=#35] +- HashAggregate(keys=[], functions=[partial_count(1) AS count#59L], output=[count#59L]) +- Project +- Filter exists(devices#48, lambdafunction((lambda dev#56.year_released > 2018), lambda dev#56, false)) +- FileScan parquet [ids#48] Batched: true, DataFilters: [exists(devices#48, lambdafunction((lambda dev#56.year_released > 2018), lambda dev#56, false))], Format: Parquet, Location: InMemoryFileIndex[s3://..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<devices:array<struct<id_last_seen:date,type:string,value:string,carrier_name:string,model:string,vendor:string,in...


Any thoughts?

Thanks

Haijia