Table Stream Change Data Capture

Table Stream is a feature of Singdata Lakehouse used to capture changes in table objects. By defining a Table Stream object, you can record and track data changes based on an existing table. This object leverages the multi-version history feature of Lakehouse Table, recording the specified version (or the latest version) of the source table as the initial read position when created. When you query the Table Stream using SQL, it returns all change records from the initial point to the current latest version.

Below is a schematic diagram of how Table Stream works:

Table Stream Working Principle

Table Stream does not store actual data; it only records and maintains the data version points of the source table. The points are updated to the latest data version only when DML statements such as INSERT, DELETE, UPDATE, MERGE, etc., are used to operate on the Table Stream.

Application Scenarios

  1. SQL ETL Tasks: Using Table Stream can simplify the identification of incremental data and improve the processing efficiency of ETL tasks. Traditional ETL tasks need to filter incremental data through partition filtering or WHERE condition filtering, which may generate additional system pressure and resource consumption in real-time scenarios. Table Stream identifies change data based on data versions, capturing incremental data without additional operations and calculations, thereby improving ETL job efficiency and saving computational overhead.
  2. Real-time Data Provision: Similar to database CDC (Change Data Capture), the Table Stream of the Lakehouse data platform can capture change data and continuously synchronize these data as detailed records to downstream systems.

Table Stream Types

Table Stream has two modes:

  1. STANDARD Mode: Tracks all DML changes of the source object, including inserts, updates, and deletes (including table truncation). This mode provides row-level changes by connecting and processing all delta data changes to provide row-level increments. Delta changes in Table Stream refer to data changes that occur between two transaction time points. For example, if a row is inserted and then updated after the offset of the Table Stream, the delta change is a new row. If a row is inserted and then deleted after the offset of the Table Stream, the delta change is no row. In other words, delta changes reflect the latest state of the source object, not historical changes.
  2. APPEND_ONLY Mode: Records only the INSERT operations of the object. Update and delete operations are not recorded. For example, if 10 rows are initially inserted into the table and then a delete operation is performed to delete 5 rows without moving the point, the Table Stream still records 10 row operations.

Using Table Stream

Creating a Table Stream

You can use the CREATE TABLE STREAM command to create a Table Stream. Here is an example:

-- Create test table
CREATE TABLE data_change_test (id INT, name STRING);
INSERT INTO data_change_test VALUES (1, 'apple');
ALTER TABLE data_change_test SET PROPERTIES ('change_tracking' = 'true');
-- Create a table stream on data_change_test to get incremental records inserted from the current time
CREATE TABLE STREAM data_change_test_stream ON TABLE data_change_test
WITH PROPERTIES('TABLE_STREAM_MODE' = 'APPEND_ONLY');

-- Insert test data
INSERT INTO data_change_test VALUES (2, 'banana');
SELECT * FROM data_change_test; -- Can find two records
SELECT * FROM data_change_test_stream; -- Viewing the stream can find one record, because the time specified when creating the stream is after the record with id 1
-- Use DML statements to consume records
CREATE TABLE data_change_test_offset (id INT, name STRING);
INSERT INTO data_change_test_offset SELECT id, name FROM data_change_test_stream; -- Write into the table
SELECT id, name FROM data_change_test_stream; -- Query stream data has been consumed

Delete Table Stream

You can use the DROP TABLE STREAM command to delete a Table Stream. Here is an example:

DROP TABLE STREAM IF EXISTS data_change_test_stream;

Table Stream Capturing Data Change Timeliness

Table Stream senses data changes based on the metadata modification submission time of the object. The specific timeliness is as follows:

  • DML method to modify data: After the DML task is successfully completed, the changed data will be visible in the Table Stream object.
  • Bulkload: After the bulkload task is successfully completed, the changed data will be visible in the Table Stream object.
  • Streaming ingestion: When writing data through the Ingestion Service streaming API, changes are submitted by default every 1 minute, and the changed data will be visible in the Table Stream object. Note: When performing SQL queries on the target table itself for streaming writes, the data is visible in real-time. This only constrains the timeliness of the Table Stream visibility based on the target table.