Fwd: Delta with intelligent upsett

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Delta with intelligent upsett

ayan guha

Hi

we have a scenario where we have a large table  ie 5-6B records. The table is repository of data from past N years. It is possible that some updates take place on the data and thus er are using Delta table. 

As part of the business process we know updates can happen only within M years of past records where M is much smaller than N. Eg the table can hold 20 yrs of data but we know updates can happen only for last year not before that. 

Is there some way to indicate this additional intelligence to Delta so it can look into only last years data while running a merge or update? It seems to be an obvious performance booster. 

Any thoughts?
--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Delta with intelligent upsett

Gourav Sengupta
should not a where clause on the partition field help with that? I am obviously missing something in the question.

Regards,
Gourav

On Thu, Oct 31, 2019 at 9:15 PM ayan guha <[hidden email]> wrote:

Hi

we have a scenario where we have a large table  ie 5-6B records. The table is repository of data from past N years. It is possible that some updates take place on the data and thus er are using Delta table. 

As part of the business process we know updates can happen only within M years of past records where M is much smaller than N. Eg the table can hold 20 yrs of data but we know updates can happen only for last year not before that. 

Is there some way to indicate this additional intelligence to Delta so it can look into only last years data while running a merge or update? It seems to be an obvious performance booster. 

Any thoughts?
--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Delta with intelligent upsett

Roland Johann
If the dataset contains a column like changed_at/created_at you can use this as watermark and filter out rows that have changed_at/created_at before the watermark. 

Best Regards

Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann



Am 01.11.2019 um 07:52 schrieb Gourav Sengupta <[hidden email]>:

should not a where clause on the partition field help with that? I am obviously missing something in the question.

Regards,
Gourav

On Thu, Oct 31, 2019 at 9:15 PM ayan guha <[hidden email]> wrote:

Hi

we have a scenario where we have a large table  ie 5-6B records. The table is repository of data from past N years. It is possible that some updates take place on the data and thus er are using Delta table. 

As part of the business process we know updates can happen only within M years of past records where M is much smaller than N. Eg the table can hold 20 yrs of data but we know updates can happen only for last year not before that. 

Is there some way to indicate this additional intelligence to Delta so it can look into only last years data while running a merge or update? It seems to be an obvious performance booster. 

Any thoughts?
--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha

Reply | Threaded
Open this post in threaded view
|

Re: Delta with intelligent upsett

Burak Yavuz-2
In reply to this post by ayan guha
You can just add the target partitioning filter to your MERGE or UPDATE condition, e.g.

MERGE INTO target
USING source
ON target.key = source.key AND target.year = year(current_date())
...

Best,
Burak

On Thu, Oct 31, 2019, 10:15 PM ayan guha <[hidden email]> wrote:

Hi

we have a scenario where we have a large table  ie 5-6B records. The table is repository of data from past N years. It is possible that some updates take place on the data and thus er are using Delta table. 

As part of the business process we know updates can happen only within M years of past records where M is much smaller than N. Eg the table can hold 20 yrs of data but we know updates can happen only for last year not before that. 

Is there some way to indicate this additional intelligence to Delta so it can look into only last years data while running a merge or update? It seems to be an obvious performance booster. 

Any thoughts?
--
Best Regards,
Ayan Guha
--
Best Regards,
Ayan Guha