Develop Dynamic Tables for Near Real-Time Incremental Processing

【Preview Release】This feature is currently in an invited preview stage. To try it out, please contact Singdata Technology through the official website.

Tutorial Overview

Through this tutorial, you will learn how to use Lakehouse Dynamic Table to build a complete real-time data ETL processing flow. The scenario design is as follows:

This tutorial will implement a streaming data processing pipeline case with minute-level latency through data import tasks, data cleaning and transformation based on dynamic tables, and data aggregation tasks.

This tutorial leverages Lakehouse Tutorial, providing online tutorial guides and script import functions. After logging into the Lakehouse Web console, you can enter the tutorial through the "Lakehouse Tutorial" entry and follow the online guide to complete the tutorial.

Prerequisites

Dynamic Table is an object type that supports processing only incremental change data, similar to materialized views. When creating a dynamic table, you need to define the data computation logic. By performing a refresh action, you can trigger the computation logic and update the dynamic table. Compared to traditional ETL tasks, dynamic tables only require the declaration of full semantic computation logic at the time of definition, and dynamic tables can automatically perform incremental computation optimization during refresh. With dynamic tables, Lakehouse can perform near real-time incremental processing for streaming data, providing data preparation for real-time analysis.

Tutorial Steps

Data Import: Create the raw table and continuously write user behavior data;
Develop Dynamic Table Model: Construct data cleaning and data aggregation processing flows through Dynamic Table;
Verify Processing Results: Monitor the refresh execution status of the data pipeline constructed by the dynamic table, and check the changes in the consumption layer data table to verify the real-time data processing results.

Through the above steps, you will learn how to develop streaming processing tasks in Lakehouse using dynamic tables.

Preparation

First, create a behavior log table and write test data through the INSERT INTO task.

To simplify the tutorial, this tutorial simulates continuous data writing through periodically scheduled INSERT INTO tasks. In actual production environments, streaming write interfaces are often used for real-time data, and Singdata Lakehouse also provides SDK/Flink Connector and other methods for real-time writing. It is not recommended to use INSERT INTO for production tasks.

This tutorial provides SQL scripts for creating data tables and inserting test data in the "Development" module. Open the [Tutorial_Working_With_Dynamic_Table->Step01.Preparation] script file, configure a 1-minute interval scheduling strategy, and submit the deployment to simulate real-time data import.

Transform Layer Dynamic Table Model Development

In the [Development] module, open the [Tutorial_Working_With_Dynamic_Table->Step02.Transformation_With_Dynamic_Table] sample dynamic table task file. This dynamic table defines the ETL logic for cleaning and transforming the raw table.

Refer to the figure below to configure the "Run Cluster" and "Scheduling Parameters":

Submit Deployment

Click the [Submit] button to deploy the dynamic table model to the Lakehouse target environment. The system will automatically refresh according to the refresh frequency set by the dynamic table.

Aggregate Layer Dynamic Table Model Development

In the [Development] module, open the [Tutorial_Working_With_Dynamic_Table->Step03.Aggregation_With_Dynamic_Table] sample dynamic table task file. This dynamic table defines the ETL logic for aggregating and analyzing the intermediate data model.

Please refer to the following image for scheduling configuration and submitting the run:

Verify Incremental Update Results

In the [Development] module, open the Step04.Check_Data_Freshness file. Check the data freshness of the automatically refreshed dynamic table by executing the Query.

As shown in the image above, with the help of the dynamic table's automatic incremental refresh, real-time data processing can be completed with a delay of about 1 minute.

Environment Cleanup

In the [Task Operations] module, take the data import task offline from the periodic task list; take the two dynamic table model tasks offline from the dynamic table task list.