Hi folks, I have created a table in the following manner:
I am hoping someone can offer debugging tips / advice on what to look for in the logs. I'm on a pretty old version of Spark (1.5.2) but this seems like something that I'm doing wrong.
CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition (
.... list of columns
COMMENT 'User Infomation'
PARTITIONED BY (account_id String,
STORED AS PARQUET
I then ran MSCK REPAIR TABLE to generate the partition information.
"show table extended like 'rum_beacon_partition' partition(account_id='XXXX',product='rum',group_id='XXXX',year='2017',month='09',day='12')
However, it does appear that when I issue a SQL query, the predicates do not correctly limit the files touched:
explain extended select uri from rum_beacon_partition where account_id='XXXX' and product='rum' and group_id='XXXX' and year='2017' and month='09' limit 2
Produces output that seems to indicate that every file is being touched (unless I'm misreading the output). It also crashes my filesystem so I suspect there is some truth to it.
Optimized logical plan looks fine I think:
== Optimized Logical Plan == |
| Limit 2 |
| Project [uri#16519] |
| Filter (((((account_id#16511 = XXXX) && (product#16512 = rum)) && (group_id#16513 = XXXX)) && (year#16514 = 2017)) && (month#16515 = 09)) |
But in the physical plan it seems that a ton of files are touched (both in account and date partitions)
|Free forum by Nabble||Edit this page|