DBT Advanced Features in Practice


Indexes

Singdata Lakehouse supports three index types, declared in the config block of a dbt model and automatically created after the table is built:

Index TypeUse CasesExample Query
bloomfilterEquality queries (WHERE col = 'xxx')WHERE order_id = 'abc123'
invertedFull-text search (match_all, match_any functions)WHERE match_all(status, 'completed')
vectorVector similarity searchORDER BY cosine_distance(embedding, [...])

Why do you need indexes?

Singdata Lakehouse stores data as Parquet files on object storage. Without an index, WHERE order_id = 'abc123' requires scanning all rows in all files. A Bloomfilter index maintains a compact probabilistic data structure on each file. At query time, the index is checked first — if it says "this value definitely does not exist in this file," the entire file is skipped, dramatically reducing I/O.

An inverted index is designed for full-text search scenarios. It tokenizes text and builds a word-to-row mapping, so match_all(col, 'keyword') can directly locate rows containing the keyword without scanning the entire table.

A vector index is for AI scenarios, supporting approximate nearest neighbor (ANN) search on high-dimensional vectors — orders of magnitude faster than brute-force distance computation across all vectors.

Bloomfilter + Inverted Index

{{ config( materialized='table', indexes=[ {'type': 'bloomfilter', 'columns': ['order_id']}, {'type': 'bloomfilter', 'columns': ['customer_id']}, {'type': 'inverted', 'columns': ['status']} ] ) }} select order_id, customer_id, amount, status, region, dt from {{ ref('stg_orders') }}

Each index is a dictionary: type specifies the type, columns specifies the columns (currently each index supports only a single column). dbt automatically executes CREATE INDEX statements after CREATE TABLE completes — no manual operation needed.

Vector Index

{{ config( materialized='table', indexes=[ { 'type': 'vector', 'columns': ['embedding'], 'distance_function': 'cosine_distance', 'scalar_type': 'f32' } ] ) }} select id, embedding from {{ ref('vector_source') }}

A vector index requires additional parameters: distance_function (supports cosine_distance, l2_distance, dot_product, etc.) and scalar_type (f32, f16, b1).


Clone (Zero-Copy Clone)

Clone is Singdata Lakehouse's zero-copy cloning capability — cloning a table requires no data copying, only metadata copying, making it extremely fast and consuming no additional storage.

Why zero-copy?

Singdata Lakehouse uses a storage-compute separation architecture. Data is stored as Parquet files on object storage (OSS/S3/COS), and table metadata (which files belong to this table) is stored in the metadata service. A Clone operation simply creates a new "pointer" at the metadata layer pointing to the same batch of Parquet files — no data files need to be copied. The cloned table and the source table share the underlying files until one of them performs a write operation, at which point new files are created (Copy-on-Write).

This means cloning a 1 TB table takes the same time as cloning a 1 KB table — both are millisecond-level.

Basic Clone

{{ config( materialized='clone', source=target.database ~ '.' ~ target.schema ~ '.fct_orders_partitioned' ) }}

source specifies the table to clone. target.database ~ '.' ~ target.schema is Jinja concatenation of the current connection's database and schema.

Typical use: CI/CD environment isolation — quickly clone a production table in a test environment, delete it after testing, without affecting production data.

Time Travel Clone

Clone to a historical point in time, for recovering from accidental data operations or comparing historical versions:

{{ config( materialized='clone', source=target.database ~ '.' ~ target.schema ~ '.fct_orders_partitioned', at_timestamp="current_timestamp() - interval 1 hours" ) }}

A few notes (from comments in dbt-clickzetta examples):

  • The time point must be within the data_retention_days retention period (up to 90 days)
  • The source table must have existed at that time point
  • Timestamps use the server timezone (usually UTC+8)

Common time expressions:

current_timestamp() - interval 1 hours -- 1 hour ago current_timestamp() - interval 1 days -- 1 day ago '2024-01-05 15:00:00' -- fixed point in time date_sub(current_date(), 1) -- yesterday


Materialized View

A materialized view pre-computes and physically stores query results, suitable for accelerating high-frequency fixed aggregation queries. The difference from Dynamic Table: materialized views require manual refresh, while Dynamic Tables refresh automatically on a schedule.

{{ config(materialized='materialized_view') }} select region, dt, count(order_id) as order_count, sum(amount) as revenue from {{ ref('stg_orders') }} where status = 'completed' group by region, dt

Selection guidance:

  • Need automatic continuous refresh → use Dynamic Table
  • Fixed query pattern, can accept manual refresh → use materialized view
  • No physical storage needed, only logical encapsulation → use a regular view

VCluster per-model

Singdata Lakehouse supports assigning a compute cluster to individual models, enabling resource isolation between different types of models:

{{ config( materialized='dynamic_table', refresh_interval='1 HOUR', refresh_vc='default_ap' -- dynamic_table specific: specifies the cluster for refresh jobs ) }} select ... from {{ ref('stg_orders') }}

You can also configure by directory in bulk in dbt_project.yml (vcluster is the general execution cluster; refresh_vc is the Dynamic Table-specific refresh cluster):

models: my_project: marts: +vcluster: default_ap # aggregation query models use the analytics cluster staging: +vcluster: default # ETL write models use the general-purpose cluster

Cluster type selection:

ScenarioRecommended Cluster TypeReason
incremental writes, seedGeneral-purpose (DEFAULT)Supports automatic small file merging, suitable for frequent writes
Large-scale aggregation queries, Dynamic Table refreshAnalytics (DEFAULT_AP)Optimized for large-scale scans and aggregations
Data sync, CDCSync (PORTAL_SYNC_VC)Designed specifically for data integration scenarios

Grants

dbt supports automatically granting permissions after a model runs, allowing specified roles to query the table:

{{ config( materialized='table', grants={ 'select': [var('grant_role', 'workspace_analyst')] } ) }} select region, count(order_id) as order_count, sum(amount) as revenue from {{ ref('stg_orders') }} where status = 'completed' group by region

var('grant_role', 'workspace_analyst') uses a dbt variable, defaulting to granting access to the workspace_analyst role. This can be overridden at runtime:

dbt run --vars '{"grant_role": "my_role"}' --select regional_revenue_with_grants


Combined Usage Example

The following is a model that combines multiple advanced features:

{{ config( materialized='incremental', incremental_strategy='delete+insert', unique_key='order_id', vcluster='default', -- ETL writes use the general-purpose cluster indexes=[ {'type': 'bloomfilter', 'columns': ['order_id']}, {'type': 'bloomfilter', 'columns': ['customer_id']}, {'type': 'inverted', 'columns': ['status']} ], grants={ 'select': ['workspace_analyst'] } ) }} select order_id, customer_id, amount, status, region, dt, updated_at from {{ ref('stg_orders') }} {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %}

This model:

  • Uses the delete+insert strategy for incremental updates, running on the general-purpose cluster (default)
  • Automatically creates bloomfilter and inverted indexes after the table is built
  • Automatically grants access to the workspace_analyst role after running