[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

Patrick Duin
Hi Spark users,

I've got an issue where I wrote a filter on a Hive table using dataframes and despite setting: 
spark.sql.hive.metastorePartitionPruning=true no partitions are being pruned.

In short:

Doing this: table.filter("partition=x or partition=y") will result in Spark fetching all partition metadata from the Hive metastore and doing the filtering after fetching the partitions.

On the other hand if my filter is "simple":
table.filter("partition=x ")
Spark does a call to the metastore that passes along the filter and fetches just the ones it needs.

Our case is where we have a lot of partitions on a table and the calls that result in all the partitions take minutes as well as causing us memory issues. Is this a bug or is there a better way of doing the filter call?

Thanks,
 Patrick

PS:
Sorry for crossposting I wasn't sure if the user list was the correct place to ask and I understood to go via stackoverflow first so my question is also here in more detail: