[EPIC] Incremental Model Improvements #10624

QMalcolm · 2024-08-28T15:59:38Z

Incremental models in dbt is a materialization strategy designed to efficiently update your data warehouse tables by only transforming and loading new or changed data since the last run. Instead of processing your entire dataset every time, incremental models append or update only the new rows, significantly reducing the time and resources required for your data transformations.

Even with all the benefits of incremental models as they exist today, there are limitations with this approach, such as:

burden is on YOU to calculate what’s “new” - what has already been loaded, what needs to be loaded, etc.
can be slow if you have many partitions to process (like when running in full-refresh mode) as it’s done in “one big” SQL statement - can time out, if it fails you end up needing to retry already successful partitions, etc.
if you want to specifically name a partition for your incremental model to process, you have to add additional “hack”y logic, likely using vars
data tests run on your entire model, rather than just the "new" data

In this project we're aiming to make incremental models easier to implement and more efficient to run.

P0s

Give feedback

P1s

Give feedback

cold_storage configuration - I can limit how much old data we need to maintain as active or readily available
Make time_minimum on microbatch incremental models optional by calculating min of mins #10702
Options

P2s

Give feedback

Add a draft title or issue reference here
Options

The text was updated successfully, but these errors were encountered:

MaartenN1234 · 2024-09-09T12:40:02Z

Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ?

I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ?

QMalcolm · 2024-09-13T17:24:09Z

Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ?

I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ?

@MaartenN1234 I'm not sure that I fully understand the question being asked. For my clarity, is the question whether this new functionality will support more than one input to an incremental model? If so, the answer is yes!

For example, say we turn the jaffle-shop customers model into an incremental microbatch model. It'd look like the following

{{ config(materialized='incremental', incremental_strategy='microbatch', unique_key='id', event_time='created_at', batch_size='day') }}

with

customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('orders') }}
),

customer_orders_summary as (
    select
        orders.customer_id,
        count(distinct orders.order_id) as count_lifetime_orders,
        count(distinct orders.order_id) > 1 as is_repeat_buyer,
        min(orders.ordered_at) as first_ordered_at,
        max(orders.ordered_at) as last_ordered_at,
        sum(orders.subtotal) as lifetime_spend_pretax,
        sum(orders.tax_paid) as lifetime_tax_paid,
        sum(orders.order_total) as lifetime_spend
    from orders
    group by 1
),

joined as (
    select
        customers.*,
        customer_orders_summary.count_lifetime_orders,
        customer_orders_summary.first_ordered_at,
        customer_orders_summary.last_ordered_at,
        customer_orders_summary.lifetime_spend_pretax,
        customer_orders_summary.lifetime_tax_paid,
        customer_orders_summary.lifetime_spend,
        case
            when customer_orders_summary.is_repeat_buyer then 'returning'
            else 'new'
        end as customer_type
    from customers

    left join customer_orders_summary
        on customers.customer_id = customer_orders_summary.customer_id
)

select * from joined

If the models orders and stg_customers both have an event_time defined (they don't need to be incremental themselves), then they will automatically be filtered and batched by the generated event time filters.

MaartenN1234 · 2024-09-17T11:14:14Z

Just for my understanding: Is it right that this issue seeks to address technical (performance/load) issues in models that take just one single ref as source (or if it has other sources as well we assume them to be stale) ?
I am looking for ways to support incremental processing of multi-table join models (e.g. https://discourse.getdbt.com/t/template-for-complex-incremental-models/10054, but I've seen many more similar help requests on community forums). To be sure, such features will not be in scope right ?

@MaartenN1234 I'm not sure that I fully understand the question being asked. For my clarity, is the question whether this new functionality will support more than one input to an incremental model? If so, the answer is yes!

For example, say we turn the jaffle-shop customers model into an incremental microbatch model. It'd look like the following
{{ config(materialized='incremental', incremental_strategy='microbatch', unique_key='id', event_time='created_at', batch_size='day') }}

with

customers as (
    select * from {{ ref('stg_customers') }}
),

orders as (
    select * from {{ ref('orders') }}
),

customer_orders_summary as (
    select
        orders.customer_id,
        count(distinct orders.order_id) as count_lifetime_orders,
        count(distinct orders.order_id) > 1 as is_repeat_buyer,
        min(orders.ordered_at) as first_ordered_at,
        max(orders.ordered_at) as last_ordered_at,
        sum(orders.subtotal) as lifetime_spend_pretax,
        sum(orders.tax_paid) as lifetime_tax_paid,
        sum(orders.order_total) as lifetime_spend
    from orders
    group by 1
),

joined as (
    select
        customers.*,
        customer_orders_summary.count_lifetime_orders,
        customer_orders_summary.first_ordered_at,
        customer_orders_summary.last_ordered_at,
        customer_orders_summary.lifetime_spend_pretax,
        customer_orders_summary.lifetime_tax_paid,
        customer_orders_summary.lifetime_spend,
        case
            when customer_orders_summary.is_repeat_buyer then 'returning'
            else 'new'
        end as customer_type
    from customers

    left join customer_orders_summary
        on customers.customer_id = customer_orders_summary.customer_id
)

select * from joined
If the models orders and stg_customers both have an event_time defined (they don't need to be incremental themselves), then they will automatically be filtered and batched by the generated event time filters.

The critical requirement for me, is that matching rows (on the join condition) in both sources are not neccesarily created in the same batch. So when the filter is on the sources independently:
select * from {{ ref('stg_customers') }} where event_time > last_processed_event_time
and
select * from {{ ref('orders') }} where event_time > last_processed_event_time

stuff will be wrong (e.g. if we would load one more order, we would loose all previous from the aggregate or when the customer data is updated while no new orders for this client are to be processed the update will not be propagated).

To get it right, it should become somewhat like this:
select * from {{ ref('stg_customers') }} where event_time > last_processed_event_time or (customer_id IN ( select customer_id from {{ ref('orders') }} where event_time > last_processed_event_time))
and
select * from {{ ref('orders') }} where (customer_id IN ( select customer_id from {{ ref('stg_customers') }} where event_time > last_processed_event_time UNION ALL select customer_id from {{ ref('orders') }} where event_time > last_processed_event_time))

So one needs to incorporate the join clause and the aggregation into the change detection

QMalcolm added this to the v1.9 milestone Aug 28, 2024

QMalcolm pinned this issue Aug 28, 2024

MichelleArk mentioned this issue Sep 5, 2024

[Feature] Accept and render EventTimeFilter in BaseRelation dbt-labs/dbt-adapters#294

Closed

jessedobbelaere mentioned this issue Sep 11, 2024

[Feature] Support for new microbatching incremental strategy dbt-labs/dbt-athena#715

Open

1 task

MichelleArk mentioned this issue Sep 14, 2024

[dbt-snowflake] Microbatch strategy dbt-labs/dbt-snowflake#1182

Closed

graciegoheen mentioned this issue Sep 14, 2024

[Feature] The insert_by_period materialization should graduate to part of the main project #4174

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Incremental Model Improvements #10624

[EPIC] Incremental Model Improvements #10624

QMalcolm commented Aug 28, 2024 •

edited by MichelleArk

Loading

P0s

P1s

P2s

MaartenN1234 commented Sep 9, 2024

QMalcolm commented Sep 13, 2024

MaartenN1234 commented Sep 17, 2024 •

edited

Loading

[EPIC] Incremental Model Improvements #10624

[EPIC] Incremental Model Improvements #10624

Comments

QMalcolm commented Aug 28, 2024 • edited by MichelleArk Loading

P0s

P1s

P2s

MaartenN1234 commented Sep 9, 2024

QMalcolm commented Sep 13, 2024

MaartenN1234 commented Sep 17, 2024 • edited Loading

QMalcolm commented Aug 28, 2024 •

edited by MichelleArk

Loading

MaartenN1234 commented Sep 17, 2024 •

edited

Loading