Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Create a flexible delete+insert incremental strategy without relying on primary/unique keys #10655

Open
3 tasks done
dlord opened this issue Sep 3, 2024 · 0 comments · May be fixed by dbt-labs/dbt-adapters#303
Open
3 tasks done
Labels
enhancement New feature or request incremental Incremental modeling with dbt performance triage

Comments

@dlord
Copy link

dlord commented Sep 3, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The delete+insert incremental strategy creates a temporary table with the new dataset, then it executes a DELETE statement using the temporary table as the filter condition. Below is an example of the generated statement:

delete
from
  "source_db"."dbt"."source_table" using "source_table__dbt_tmp173111887980" 
where
  (
    "source_table__dbt_tmp173111887980".event_time = "source_db"."dbt"."source_table".event_time 
  );

The current approach uses the unique_key specified in the model, and it probably works fine for most relational databases. For data warehouses like Redshift, this approach is very slow due to lack of primary and unique keys. As a result, the delete operation becomes very slow due to sequential scanning.

In order to work around this, I created a custom incremental strategy for a customer using the approach described here. The goal was to be able to provide a flexible way to specify custom predicates that works well with the underlying data and storage. The custom incremental strategy I wrote uses the existing incremental_predicates to build the delete statement. The custom incremental strategy ignores the unique_key to keep the implementation simple, and easy to switch to other strategies.

CC: @goodjira and @schmiman

Describe alternatives you've considered

Currently, if the merge strategy is set to delete+insert but no unique_key was specified, it is as if you used an append strategy. This behavior can be seen in the macro default__get_delete_insert_merge_sql. I am not sure if this was a deliberate decision.

As such, an alternative approach to creating a separate incremental strategy is to modify the existing default__get_delete_insert_merge_sql to allow users to not specify a unique_key, and only rely on the incremental_predicates.

Who will this benefit?

Users who are working with data warehouses like Redshift. Date columns that are used as sort keys can significantly benefit from this.

Here is an example: consider a model in Redshift that was created with the following config:

{{ config(materialized="incremental",
            incremental_strategy='delete_insert_no_unique_key',
            incremental_predicates = [
                "DBT_INTERNAL_DEST.event_time >= (select min(event_time) from DBT_INTERNAL_SOURCE)"
            ],
) }}

The resulting delete SQL will be this:

delete
from
  "source_db"."dbt"."source_table"
where
  (
    "source_db"."dbt"."source_table".event_time >= (select min(event_time) from "source_table__dbt_tmp173111887980")
  );

The performance difference in Redshift compared to a delete join is huge. I had one model where deleting 10M records took 30 minutes using the join approach. With the custom delete+insert strategy, this was reduced to 10 seconds due to the optimizations we can do in incremental_predicates

Are you interested in contributing this feature?

Yes

Anything else?

No response

@dlord dlord added enhancement New feature or request triage labels Sep 3, 2024
@dbeatty10 dbeatty10 added the incremental Incremental modeling with dbt label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request incremental Incremental modeling with dbt performance triage
Projects
None yet
3 participants