Does Spark support column scan pruning for array of structs?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Does Spark support column scan pruning for array of structs?

Haijia Zhou
I have a data frame in following schema:
household
root
 |-- country_code: string (nullable = true)
 |-- region_code: string (nullable = true)
 |-- individuals: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- individual_id: string (nullable = true)
 |    |    |-- ids: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- id_last_seen: date (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- year_released: integer (nullable = true)

I can use the following code to find households that contain at least one device that was released after the year 2018

val sql = """
select household_id
from household
where exists(individuals, id -> exists(id.ids, dev -> dev.year_released > 2018))
"""
val v = spark.sql(sql)

It works well, however, I found the query planner was not able to prune the unneeded columns, Spark instead has to read all columns of the nested structures

Tested this with spark 2.4.5 and 3.0.0, got the same result.

Just wonder if Spark supports or will add support to column scan pruning for an array of structs?