[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly
Hi Spark users,
I've got an issue where I wrote a filter on a Hive table using dataframes and despite setting: spark.sql.hive.metastorePartitionPruning=true no partitions are being pruned.
Doing this: table.filter("partition=x or partition=y") will result in Spark fetching all partition metadata from the Hive metastore and doing the filtering after fetching the partitions.
On the other hand if my filter is "simple":
Spark does a call to the metastore that passes along the filter and fetches just the ones it needs.
Our case is where we have a lot of partitions on a table and the calls that result in all the partitions take minutes as well as causing us memory issues. Is this a bug or is there a better way of doing the filter call?
for crossposting I wasn't sure if the user list was the correct place
to ask and I understood to go via stackoverflow first so my question is
also here in more detail: