DBT Snowflake Migration in Practice: TPC-H Data Warehouse Pipeline
If you have built a data warehouse pipeline with dbt on Snowflake, the core migration effort to Singdata Lakehouse is concentrated in 6 platform-specific features. Standard SQL models require zero changes.
This article demonstrates the complete migration process using a real migration project: migrating sfc-gh-dflippo/snowflake-dbt-demo (a dbt demo project maintained by a Snowflake official engineer) to Singdata Lakehouse. The data source uses the TPC-H standard dataset shared by both platforms. All models have been verified through actual execution — 24/24 passing.
Full code on GitHub: clickzetta/snowflake-dbt2lakehouse-dbt
Original Project
sfc-gh-dflippo/snowflake-dbt-demo is a dbt feature demonstration project maintained by a Snowflake official engineer. The business scenario is TPC-H customer order analysis: starting from raw order data, processing through Bronze (cleaning) → Silver (aggregation, segmentation) → Gold (dimension tables, metrics) layers, ultimately producing customer value segmentation, order fact tables, calendar dimensions, and other analytical models.
The project covers most of dbt's advanced features on Snowflake: Dynamic Tables, Streams (CDC), Sequences, Python models, Incremental strategies, etc. — making it an ideal subject for validating migration completeness.
The migrated code is in the 03_lakehouse/ directory. The original Snowflake code is preserved in 01_snowflake/ for comparison. Migration notes are in 02_migration/MIGRATION_NOTES.md.
Conclusion First
Your dbt project can be migrated; business logic does not need to be rewritten. This migration made 6 changes, all of which are platform configuration and function name replacements — not a single change touched the data processing logic itself. If your project only uses standard SQL (no Dynamic Tables, Streams, or Sequences), changing profiles.yml is all you need.
| Change | Effort | Description |
|---|---|---|
profiles.yml connection config | Very low | Field-by-field replacement, done in 5 minutes |
| Dynamic Table parameter names | Very low | target_lag → refresh_interval, snowflake_warehouse → refresh_vc |
| CDC Stream column names | Low | METADATA$ACTION → `__change_type`; use SELECT * EXCEPT(...) when consuming |
| Surrogate key | Low | SEQUENCE .nextval → IDENTITY column or row_number() over (...), note semantic differences (see Step 4) |
| Multi-column hash | Low | hash() → hash_combine(crc32(col), ...); requires understanding the Singdata Lakehouse function system |
| Calendar table row generation | Low | table(generator(...)) → explode(sequence(...)), a one-line replacement |
Snowflake-specific configs (transient, merge_exclude_columns, copy_grants) can be deleted directly — Singdata Lakehouse does not support them and does not need them.
Technology Stack Comparison
| Original Project (Snowflake) | After Migration (Lakehouse) | |
|---|---|---|
| dbt adapter | dbt-snowflake | dbt-clickzetta >= 1.7.10 (Python models require dbt-clickzetta[python] >= 1.7.10) |
| Compute resource config | snowflake_warehouse: target.warehouse | refresh_vc: default |
| Dynamic Table refresh schedule | target_lag: '1 hour' | refresh_interval: '1 HOUR' |
| Config change strategy | on_configuration_change: apply | Not supported; use ALTER DYNAMIC TABLE |
| Stream change type column | METADATA$ACTION | `__change_type` |
| Stream consumption mode | SELECT * | SELECT * EXCEPT(__change_type, __commit_timestamp, __commit_version) |
| Surrogate key | SEQUENCE .nextval | IDENTITY column (at table creation) or row_number() over (...) |
| Multi-column hash | hash(col1, col2, ...) | hash_combine(crc32(col1), crc32(col2), ...) |
| Row generation | table(generator(rowcount => N)) | explode(sequence(0, N-1)) |
| Sampling | sample (10) | TABLESAMPLE SYSTEM(10) |
| Data source | SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 | clickzetta_sample_data.tpch_100g |
| FX rate data | Cybersyn Marketplace (Snowflake-exclusive) | Mock seed CSV (2020–2024, 5 currencies) |
Standard SQL operations — SELECT, JOIN, GROUP BY, window functions, CTE, QUALIFY — have identical syntax and require no changes.
Prerequisites
Requires Python 3.10+ (3.12 recommended) and dbt-clickzetta >= 1.7.10.
Copy the connection config template and fill in your connection information:
Key field differences in profiles.yml: account → service (API address), warehouse → vcluster (compute cluster name). For full field descriptions, see the dbt ClickZetta Adapter Usage Guide.
Verify the connection:
Quick validation mode: TPC-H SF100 data is large (30 million customers, 150 million orders). A full first run takes about 10 minutes. To quickly validate migration correctness, use the sample_limit variable to limit the staging layer data volume:
Or use dbt directly:
Migration Steps
Step 1: Data Source Replacement
The original project depends on two Snowflake-exclusive data sources:
TPC-H data: SNOWFLAKE_SAMPLE_DATA.TPCH_SF1 → clickzetta_sample_data.tpch_100g
Singdata Lakehouse has a built-in shared dataset — no import needed. Just modify the schema in _sources.yml:
Change to (requires dbt-clickzetta >= 1.7.8):
Cybersyn FX rate data: Exclusive to Snowflake Marketplace; Singdata Lakehouse has no equivalent data source. The project already provides a mock seed replacement (2020–2024, USD base, 5 currencies, 9,135 rows):
Step 2: Dynamic Table Parameters
Only two parameter names need to be replaced:
Change to:
on_configuration_change='apply' can be deleted directly — Singdata Lakehouse does not support it. To modify Dynamic Table configuration, use ALTER DYNAMIC TABLE manually.
The original project also has a target_lag='DOWNSTREAM' (downstream-triggered refresh). Singdata Lakehouse does not have this concept; change it to an explicit refresh_interval:
Change to:
Step 3: CDC Stream
Snowflake Streams and Singdata Lakehouse Table Streams share the same concept, but differ in creation syntax and system column names:
| Snowflake Stream | Singdata Lakehouse Table Stream | |
|---|---|---|
| Creation syntax | CREATE STREAM ... ON TABLE t SHOW_INITIAL_ROWS = TRUE | CREATE TABLE STREAM ... ON TABLE t WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD') |
| Change type column | METADATA$ACTION (INSERT / DELETE) | `__change_type` (INSERT / UPDATE_BEFORE / UPDATE_AFTER / DELETE) |
| Is UPDATE | METADATA$ISUPDATE (boolean) | Determined by __change_type = 'UPDATE_BEFORE' |
| Row identifier | METADATA$ROW_ID | __commit_version |
| Commit time | No direct equivalent | __commit_timestamp |
Recommended pattern for consuming a Stream: Use SELECT * EXCEPT(...) to filter system columns without hardcoding business column names:
Macro to create a Table Stream (replacing Snowflake's get_stream() macro; full code in 03_lakehouse/macros/get_table_stream.sql):
Step 4: Surrogate Key
Snowflake uses SEQUENCE objects to generate auto-increment surrogate keys. Singdata Lakehouse supports IDENTITY auto-increment columns, which are semantically closer:
In Singdata Lakehouse, add identity directly to the column definition when creating the table:
However, dbt incremental models cannot directly declare identity columns in config. You need to create the table via pre_hook or use row_number() as an alternative. This project uses the row_number() approach:
Other Snowflake-specific configs that need to be deleted (Singdata Lakehouse does not support them and does not need them):
Step 5: Multi-Column Hash
Snowflake's hash() accepts multiple columns of any type. Singdata Lakehouse uses hash_combine(crc32(col), ...) as a replacement:
Change to:
hash_combine_commutative() only accepts bigint parameters. varchar columns need to be converted to integers using crc32() first before combining.
Step 6: Calendar Table Row Generation
Snowflake's table(generator(rowcount => N)) generates a sequence of N rows. Singdata Lakehouse uses explode(sequence(0, N-1)) as a replacement:
Change to:
Two date functions also need adjustment: to_char(date, 'YYYYMMDD')::number(8,0) changes to cast(date_format(date, 'yyyyMMdd') as int); last_day(date, 'YEAR') changes to date(concat(extract(year from date), '-12-31')).
Step 7: Python Model
The original project's async_bulk_operations.py deeply depends on Snowflake Stored Procedures (session.sproc.register), which Singdata Lakehouse does not support. This model needs to be skipped or redesigned.
The K-means clustering logic in customer_clustering.py has been successfully migrated — dbt-clickzetta 1.7.10+ supports Python models, executed via ZettaPark.
Installation:
Migration key points:
| Snowflake | Singdata Lakehouse |
|---|---|
import snowflake.snowpark as snowpark | Standard Python; ZettaPark session is compatible |
session.sproc.register(...) | Not supported — remove parallel stored procedure logic |
packages=['snowflake-snowpark-python', 'joblib'] | packages=['scikit-learn', 'pandas', 'numpy'] (auto-installed) |
dbt.ref(...).to_pandas() returns uppercase column names | ZettaPark returns lowercase column names; add df.columns = df.columns.str.upper() |
Usage (models/silver/run/customer_clustering.py):
Packages declared in dbt.config(packages=[...]) are automatically pip installed before the model runs, supported in both local and Studio environments.
End-to-End Verification
e2e.py runs 12 automated checks on the migration results, covering the full chain from seed to gold layer:
Actual run results:
12/12 checks passed.
Expected Run Time
The first run takes approximately 9-10 minutes (default minimum-size VCluster). Dynamic Tables require a full refresh on first creation (about 4 minutes) — this is normal behavior, not a hang. Subsequent incremental refreshes take only a few seconds.
