DBT Incremental Processing in Practice
What Is Incremental Processing
Tables in a data warehouse are typically updated in one of two ways:
- Full refresh: Recompute the entire table from scratch each time. Simple and reliable for small datasets, but costly for large ones — a table with 100 million order rows requires scanning all data on every refresh.
- Incremental processing: Process only "new or changed" data and merge the results into the existing table. The first run is a full load; subsequent runs process only the incremental data, which is much faster.
dbt's incremental materialization is the implementation of incremental processing. Its core logic is:
How does is_incremental() determine "first" vs "subsequent"?
dbt checks whether the target table already exists before running: if it does not exist, it is the first run; if it exists, it is a subsequent run. This means:
- Manually dropping the table and running again triggers a full rebuild
dbt run --full-refreshforces a full rebuild, ignoring the existing table
Prerequisites for incremental processing
Incremental processing requires that source data has a way to identify "what is new or changed." Common identification methods:
- Timestamps: Fields like
updated_at,created_atthat are automatically updated on each change - Auto-increment primary key: New rows always have a larger ID than existing rows
- Both combined: Both a timestamp and a primary key, complementing each other
If source data has delete operations but no updated_at field, the incremental model cannot detect deletions. In this case, use a full refresh, or switch to Table Stream to capture DELETE events.
The merge strategy is controlled by incremental_strategy. Singdata Lakehouse supports 4 strategies.
4 Incremental Strategies
merge (default)
Uses a MERGE INTO statement, matching on unique_key: matched rows are updated, unmatched rows are inserted. Suitable for scenarios with a primary key that need to handle data updates.
{{ this }} refers to the table corresponding to the current model. is_incremental() returns false on the first run and true on subsequent runs.
Verified data (from jaffle-shop-clickzetta): 61,948 order rows; the merge strategy takes approximately 20 seconds on the first run.
append
Only performs INSERT INTO, without deduplication. Suitable for log-type data — each record is naturally unique, no updates needed, only appending.
This is the actual implementation of the customer_cdc_stream model in snowflake-dbt2lakehouse-dbt — simulating a CDC data stream where each incremental run samples 10% of the data and appends it, without deduplication.
delete+insert
Deletes matching rows from the target table by unique_key first, then inserts the full result set. Suitable for scenarios that need to replace a batch of data by key — for example, backfilling historical partitions, correcting existing data, or when source data has no reliable updated_at field but does have a primary key.
This is the actual implementation of fct_orders_delete_insert in dbt-clickzetta examples.
insert_overwrite
INSERT OVERWRITE in dynamic partition mode. Suitable for data organized by partition, where only the target partitions are recomputed each time.
On incremental runs, only the partitions for the last 3 days are recomputed; historical partitions are untouched. This is the actual implementation of daily_revenue in dbt-clickzetta examples.
Strategy Selection
| Strategy | Use Cases | Notes |
|---|---|---|
merge | Has a primary key, needs to handle updates | Requires unique_key; source must not have multiple rows for the same key |
append | Logs, event streams — append only, no updates | No deduplication; repeated runs produce duplicate data |
delete+insert | Large data volume where merge performance is poor; replace by partition | Requires unique_key; delete then insert, with a brief data gap in between |
insert_overwrite | Data organized by partition, recompute a few partitions each time | Requires partition_by; the entire partition is replaced |
Incremental Filter Patterns
The key to incremental processing is: how to query only "new or changed" data. Three common patterns:
Filter by timestamp (most common):
Filter by primary key watermark (suitable for auto-increment IDs):
Both combined (from actual code in snowflake-dbt2lakehouse-dbt):
prune_days is a variable defined in dbt_project.yml (default 2), meaning reprocess data from the last N days while also processing new rows with a primary key larger than the current maximum. The two conditions are connected with OR to ensure no new data is missed.
Schema Change Handling
When a source table adds a new column, how does the incremental model handle it? This is controlled by on_schema_change:
| Value | Behavior |
|---|---|
ignore (default) | Ignore new columns, no error |
append_new_columns | Automatically add new columns to the target table; historical data has NULL for those columns |
sync_all_columns | Sync all column changes (add and remove columns) |
fail | Error immediately when there is a schema change |
append_new_columns is recommended for production — it neither errors on new columns nor silently loses new column data.
Complete Example: TPC-H Order Incremental Pipeline
The following is the actual incremental pipeline running in snowflake-dbt2lakehouse-dbt, from Bronze layer ingestion to Gold layer dimension table:
Bronze layer (stg_orders_incremental.sql):
Gold layer (dim_orders.sql):
Comparison of Three Incremental Processing Approaches
The incremental materialization described in this article is dbt's standard incremental processing approach, requiring manual dbt run to trigger. Singdata Lakehouse also supports two other incremental processing approaches for different scenarios:
| Approach | Trigger Mechanism | Use Cases |
|---|---|---|
incremental materialization (this article) | Manual dbt run | Need precise control over run timing; integration with scheduling systems |
| Dynamic Table | System auto-refreshes on refresh_interval | Need continuous automatic refresh; don't want to manage scheduling |
| Table Stream + incremental | Consumes CDC change stream; each dbt run processes new changes | Track row-level changes (INSERT/UPDATE/DELETE) in source tables |
The three approaches can be combined: use Table Stream to capture source table changes → use incremental to consume changes → use Dynamic Table to automatically aggregate results.
See DBT Real-Time Data Pipeline in Practice for details.
