Lakehouse CDC Change Data Capture Guide (Table Stream)

Overview

Change Data Capture (CDC) is the foundation of real-time data pipelines. Singdata Lakehouse provides the TABLE STREAM object, which automatically tracks INSERT, UPDATE, and DELETE operations on a table and presents them as an incremental data stream for downstream consumption. This guide is organized by business scenario to help you quickly master Table Stream creation and consumption methods.


SQL Commands Covered

CommandPurposeUse Case
CREATE TABLE STREAMCreate a change data streamBind to a source table and enable CDC tracking
SELECT * FROM stream_nameQuery incremental dataConsume change data into downstream tables
SHOW TABLE STREAMSView Stream listMonitor Stream status and lag
DROP TABLE STREAMDrop a StreamClean up CDC objects no longer needed

Prerequisites

The following examples use a simulated user table users_cdc:

-- Create source table
CREATE TABLE IF NOT EXISTS users_cdc (
    user_id INT,
    user_name STRING,
    status STRING
);

-- Insert initial data
INSERT INTO users_cdc VALUES
(1, 'Alice', 'active'),
(2, 'Bob', 'active');

Create Table Stream

Use CREATE TABLE STREAM to bind to a source table. The Stream records all changes since its creation or since the last consumption.

-- Create a Table Stream
CREATE TABLE STREAM users_cdc_stream ON TABLE users_cdc
WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD');

Consume Change Data

Query the Stream to retrieve incremental change records. The rows returned by the Stream include all columns from the original table as well as the metadata column __change_type (operation type).

-- Simulate source table changes
INSERT INTO users_cdc VALUES (3, 'Carol', 'active');
UPDATE users_cdc SET status = 'inactive' WHERE user_id = 2;

-- Query the Stream to retrieve changes
SELECT *, __change_type 
FROM users_cdc_stream 
ORDER BY user_id;

Result Explanation:

user_iduser_namestatus__change_type
2BobactiveUPDATE_BEFORE
2BobinactiveUPDATE_AFTER
3CarolactiveINSERT

Offset Advancement After Consumption

The Table Stream offset automatically advances after downstream DML operations consume the Stream. This means that once you insert Stream data into a target table, the Stream's cursor advances, and the next query will only return new changes.

-- Create a target table
CREATE TABLE IF NOT EXISTS users_sync (
    user_id INT,
    user_name STRING,
    status STRING,
    sync_time TIMESTAMP
);

-- Consume the Stream and write to the target table (offset advances automatically)
INSERT INTO users_sync
SELECT user_id, user_name, status, CURRENT_TIMESTAMP()
FROM users_cdc_stream;

-- Query the Stream again (should be empty as the offset has advanced)
SELECT COUNT(*) FROM users_cdc_stream;

Result Explanation:

COUNT(*)
0

View Stream Information

Use SHOW TABLE STREAMS to view a Stream's status, bound table, and consumption lag.

-- View the Stream list
SHOW TABLE STREAMS LIKE 'users_cdc_stream';

Key Field Descriptions:

  • table_name: The bound source table
  • mode: Consumption mode (STANDARD / APPEND_ONLY)
  • stale_after: Stream expiration time (the Stream becomes invalid if not consumed beyond this time)

Drop Table Stream

Use DROP TABLE STREAM to remove a Stream object that is no longer needed.

-- Drop the Stream
DROP TABLE STREAM users_cdc_stream;

Clean Up Test Data

After completing CDC verification, it is recommended to clean up test tables:

-- Drop test tables
DROP TABLE IF EXISTS users_cdc;
DROP TABLE IF EXISTS users_cdc_stream;
DROP TABLE IF EXISTS users_sync;

Notes

  1. Offset Advancement Mechanism: The Stream offset only advances when consumed by DML statements (such as INSERT INTO ... SELECT FROM stream). Pure SELECT queries do not advance the offset.
  2. Expiration Time: Streams depend on the Time Travel retention period. If left unconsumed beyond data_retention_days, the Stream becomes STALE and cannot be read further.
  3. APPEND_ONLY Mode: If the source table is append-only (no UPDATE/DELETE), you can create a Stream in APPEND_ONLY mode for better performance: CREATE TABLE STREAM ... WITH PROPERTIES ('TABLE_STREAM_MODE' = 'APPEND_ONLY').
  4. Dynamic Table Consumption: Dynamic Tables can build incremental pipelines based on Table Streams, enabling end-to-end real-time data warehousing.