Create Dynamic Table
Description
A Dynamic Table is a special type of table that can update data in real-time based on query statements. Dynamic Tables allow you to handle data more flexibly and improve query efficiency. For more usage methods, refer to Dynamic Table Introduction and Incremental Calculation Section
Syntax
Required Parameters
- dtname: Specify the dynamic table name.
- AS query: The query statement contained in the dynamic table.
Optional Parameters
-
IF NOT EXISTS: Optional. If the specified materialized view name exists, the system will not report an error, but the materialized view will not be created successfully. Cannot be used with OR REPLACE at the same time.
-
OR REPLACE: In traditional databases, this option is used to replace the old object with a new one within the same transaction and delete the old object's data. However, in Lakehouse, to support the addition, deletion, and modification operations of Dynamic Tables, we retain data and metadata permission information. This means that even when modifying table structures or SQL logic, the original data will not be lost. This feature is particularly suitable for adding or deleting columns, adjusting SQL processing logic, and changing data types. It should be noted that if the user is not simply deleting columns / adding columns Adding column definitions: It can only be passed through the table via SELECT and cannot participate in any calculations that affect other columns, such as join key, group key, etc. After Create Or Replace occurs, the REFRESH task will degrade to a full refresh.
- <column_list>:
- You can specify the column names or add comment information to the columns of the dynamic table. You can specify the column names but cannot specify the column types. The types are inferred from the SELECT results in
AS <query>. If you want to specify the types, you can explicitly use CAST conversion in the SELECT results. - If any column in the table is based on an expression, it is recommended to provide a name for each column. Alternatively, use aliases in the <query>.
- You can specify the column names or add comment information to the columns of the dynamic table. You can specify the column names but cannot specify the column types. The types are inferred from the SELECT results in
- partitioned by (<col> ): Specify partitions, using the columns in <column_list> as partitions. Partitioning is a method to speed up queries by grouping similar rows together during writing. Using partitions can achieve data pruning and optimize queries.
- CLUSTERED BY: Optional, specifies the Hash Key. Lakehouse will perform a hash operation on the specified column and distribute the data into various data buckets based on the hash value. To avoid data skew and hotspots, and to improve parallel execution efficiency, it is recommended to choose columns with a large range of values and few duplicate key values as the Hash Key. This usually has a noticeable effect when performing join operations. It is recommended to use CLUSTERED BY in scenarios with large amounts of data, generally with a bucket size between 128MB and 1GB. If no buckets are specified, the default is 256 buckets.
- SORTED BY: Optional, specifies the sorting method for fields within the Bucket. It is recommended to keep SORTED BY consistent with CLUSTERED BY for better performance. When the SORTED BY clause is specified, row data will be sorted according to the specified columns.
-
comment: Specify dynamic comment information
-
refreshOption (optional), refresh options
- START WITH timestamp_exp Specify the start time, supporting a timestamp expression. If START WITH is not written, the refresh starts from the current time.
- The
timestamp_expressionreturns a standard timestamp type expression. The earliest timestamp specified by TIMESTAMP AS OF depends on the TIME TRAVEL (data_retention_days) parameter. If the specified version does not exist, an error will be reported. If not specified, the version data of the current timestamp will be used, for example: *'2023-11-07 14:49:18', which can be forcibly converted to a timestamp string. *cast('2023-11-07 14:49:18 Asia/Shanghai' as timestamp). *current_timestamp() - interval '12' hours. * Any other expression that is itself a timestamp or can be forcibly converted to a timestamp.
- The
- START WITH timestamp_exp Specify the start time, supporting a timestamp expression. If START WITH is not written, the refresh starts from the current time.
interval_timespecifies the time interval and supports interval types. Ifinterval_timeis not specified butSTART WITHis, it will only refresh once at the time specified bySTART WITH. Theinterval_timeintervals are as follows:
| Syntax | Description | Example |
|---|---|---|
| INTERVAL '[+ | -]' DAY | Specifies only DAY interval | INTERVAL '1' DAY means 1 day |
| INTERVAL '[+ | -]' HOUR | Specifies only HOUR interval | INTERVAL '23' HOUR means 23 hours |
| INTERVAL '[+ | -]' MINUTE | Specifies only MINUTE interval | INTERVAL '59' MINUTE means 59 minutes |
| INTERVAL '[+ | -]' SECOND | Specifies only SECOND interval | INTERVAL '59.999' SECOND means 59.999 seconds |
| INTERVAL '[+ | -] ' DAY TO HOUR | Specifies both DAY and HOUR intervals | INTERVAL '1 23' DAY TO HOUR means 1 day and 23 hours |
| INTERVAL '[+ | -] ' DAY TO MINUTE | Specifies DAY, HOUR, and MINUTE intervals | INTERVAL '1 23:59' DAY TO MINUTE means 1 day, 23 hours, and 59 minutes |
| INTERVAL '[+ | -] ' DAY TO SECOND | Specifies DAY, HOUR, MINUTE, and SECOND intervals | INTERVAL '1 23:59:59.999' DAY TO SECOND means 1 day, 23 hours, 59 minutes, and 59.999 seconds |
day: The range is [0, 2147483647]. hour: The range is [0, 23]. minute: The range is [0, 59]. second: The range is [0, 59.999999999].
-
The minimum value for INTERVAL is 1 minute, which can be represented as 60 SECOND or 1 MINUTE.
- INTERVAL can be specified with or without quotes, and the following are equivalent:
- INTERVAL "60 SECOND"
- INTERVAL '60 SECOND'
- INTERVAL 60 SECOND
- Supported units for INTERVAL: SECOND, MINUTE, HOUR, DAY
- INTERVAL units are case-insensitive, so HOUR and hour are equivalent.
- INTERVAL can be specified with or without quotes, and the following are equivalent:
-
Specify the compute cluster in
refreshOption. Automatic refresh consumes resources, so you need to explicitly specify the compute cluster. If not specified, the current session's compute cluster will be used by default. You can check the current session's compute cluster withSELECT current_vcluster().
Notes
- The incremental refresh of dynamic tables is based on the historical versions of the base table. Historical versions depend on the TIME TRAVEL(data_retention_days) parameter, and an error will occur if the specified version does not exist. This parameter defines the length of time that deleted data is retained, with the Lakehouse defaulting to retaining data for one day. Depending on your business needs, you can adjust the
data_retention_daysparameter to extend or shorten the data retention period. Please note that adjusting the data retention period may affect storage costs. Extending the retention period will increase storage requirements, which may increase associated costs.
Parameterized Definition Dynamic Table Creation Syntax
When the dynamic table is a partitioned table, parameters can be passed in. Parameters are defined through SESSION_CONFIGS()['dt.arg.xx']. SESSION_CONFIGS() is a built-in system function. For non-partitioned tables, if parameterized definitions are used and the passed parameter values remain unchanged, the system will perform an incremental refresh. However, once the parameter values change, the system will perform a full refresh, as this is equivalent to changing the table definition. For partitioned tables, the partition key must be a parameter field, but these parameter fields are not required to be partition columns of the table.
Creation Statement
The parameterized definition of Dynamic Table allows users to define parameters through SESSION_CONFIGS()['dt.arg.xx']. These parameters are referenced when creating the table, thereby dynamically affecting the content of the table.
SESSION_CONFIGS(): A built-in system function used to read parameters from the configuration.'dt.arg.xx': The name of the DT parameter, which must start withdt.arg.to avoid conflicts with internal system fields.
Example Code
In this example, dt.args.event_day is a DT parameter, which is read from the configuration using the SESSION_CONFIGS() function. If non-string type parameters are needed, the CAST function can be used for type conversion.
Refresh Statement
The refresh behavior of a parameterized Dynamic Table depends on whether the table is partitioned.
Non-Partitioned Table Refresh Syntax
- The parameter values of non-partitioned tables remain unchanged and will be incrementally refreshed.
- The parameter values of non-partitioned tables change and will be fully refreshed.
Partitioned Table Refresh Syntax
When refreshing a partitioned table, the partition_spec must be specified in the order of the table's partition hierarchy. This means that if the table is partitioned by multiple fields, these fields need to be specified from the highest level to the lowest level.
Example Description
Assuming a table is partitioned by day, hour, and min in three levels, the correct way to specify the partition_spec is as follows:
Valid Specification: You can specify higher-level and some lower-level partitions, but you cannot skip any intermediate level partitions.
In this example, day and hour are specified, while the min partition can be ignored.
Invalid Specification: Skipping the specification of any intermediate level partitions is not allowed.
Precautions
-
Parameter and Partition Consistency When performing a refresh operation on a Dynamic Table, you must ensure that the parameter values passed in are consistent with the specified partition values to be refreshed. If there is any inconsistency, the system will report an error during execution.
-
Error Example:
In the above situation, although the parameter event_day is set to 11, the refresh command attempts to write to partition event_day=19. This will cause the actual calculated partition result to be inconsistent with the specified partition, resulting in errors.
-
-
Correct Example:
-
Concurrent Refresh Tasks In these commands, the parameter values match the partition values, so they can be executed concurrently without conflicts. As long as there are no conflicts between partitions, the system allows multiple partition refresh tasks to be executed simultaneously.
Full Refresh and Incremental Refresh
- Full Refresh
Full refresh occurs in the following situations:
- The parameter values of a non-partitioned table change.
- The parameter values of a partitioned table change, causing the partition conditions to change. Example code
- Incremental Refresh
Incremental refresh occurs in the following situations:
- The parameter values of the non-partitioned table remain unchanged.
- The parameter values of the partitioned table remain unchanged, and the partition conditions have not changed. Example code
Scenario Case
Case 1: Converting Offline Tasks to Incremental Tasks This section will guide users on how to convert existing offline tasks into incremental tasks to achieve more efficient data processing. Below are specific operational steps based on a "traditional database," suitable for scenarios where business logic is aligned and refreshed daily.
- Step 1: Parameterize the original SQL. The original SQL is as follows
First, all parameters ${bizdate} passed in by the scheduling engine in the original SQL need to be replaced with SESSION_CONFIGS()['dt.args.bizdate']. This step will allow the parameter values to be dynamically passed through configuration, rather than hard-coded in the SQL.
Original SQL Parameter Replacement:
Replace all ${bizdate} with SESSION_CONFIGS()['dt.args.bizdate']:
-
Step 2: Schedule Refresh Command During each scheduling, the parameter
dt.args.bizdateneeds to be set to a specific date value, and the refresh command should be executed.Example of scheduling refresh command Sure, here is the translated content:
Case 2: Incremental Task Data Supplement, in some cases, users may need to supplement data to existing partitions.
-
Method 1: Supplement data to the source table, users can directly supplement data to the source table. These supplemented data will be automatically reflected in the Dynamic Table (DT) through the corresponding REFRESH task.
Operation Steps
- Directly insert or update data in the source table.
- Execute the REFRESH task to synchronize the changes to the DT.
-
Method 2: Use DML statements to directly supplement data to the DT, users can also use DML statements to insert data into specific partitions of the DT. Operation Steps
- Use DML statements to insert data into specific partitions of the DT.
- Note that directly modifying the DT will result in a full refresh of that partition the next time. If users do not want a full refresh result, they should avoid scheduling the REFRESH task for that partition. Example Code
Notes
- Data directly inserted into DT will participate in downstream calculations of DT. If the downstream old partitions do not need this data, please do not schedule REFRESH tasks for partitions involving this data.
- Other unaffected partitions can still be incrementally refreshed.
Case 3: Executing Incremental Tasks in Different VCs. For parameterized partitioned DT, refresh tasks for different partitions can be executed simultaneously. Users can allocate different REFRESH tasks to different Virtual Clusters (VC) as needed. Operation Steps
- Allocate different REFRESH tasks to different VCs based on timeliness requirements and resource needs.
- For example, for new partitions with high timeliness requirements, their REFRESH tasks can be executed in large VCs with more resources.
- For supplementary tasks of other old partitions, their REFRESH tasks can be executed in smaller VCs with fewer resources.
Reference Documents
- Dynamic Table Visualization Interface Development
- View Dynamic Table Details
- View All Dynamic Tables Under Schema
- Modify Dynamic Table
- Delete Dynamic Table
- View Dynamic Table Refresh History
- View Dynamic Table Creation Statement
- Restore Deleted Dynamic Table
- Restore Dynamic Table to Specified Version
- View Dynamic Table Version History
- View Data of Specified Version of Dynamic Table
- Introduction to Dynamic Table
