Materialized View Creation and Usage
Function Introduction
Based on the query of existing tables, create a new materialized view in the currently specified Schema and populate the view with query data.
For more detailed information, please refer to Materialized View.
Syntax Description
Required Parameters
- dtname: Specifies the name of the materialized view.
- AS query: The query statement contained in the materialized view.
Optional Parameters
-
IF NOT EXISTS: Optional. If the specified materialized view name exists, the system will not report an error, but the materialized view will not be created successfully. Cannot be used simultaneously with OR REPLACE.
-
OR REPLACE: In traditional databases, this option is used to replace the old object with a new one within the same transaction and delete the old object's data. However, in Lakehouse, to support the addition, deletion, and modification operations of materialized views, we retain data and metadata permission information. This means that even when modifying table structures or SQL logic, the original data will not be lost. This feature is particularly suitable for adding or deleting columns, adjusting SQL processing logic, and changing data types. Please note that you must use the BUILD DEFERRED ...DISABLE QUERY REWRITE syntax when modifying, which will disable the materialized view rewrite function. If the user is not simply deleting columns / adding columns, adding column definitions: it can only be passed through from the table via SELECT, and cannot participate in any calculations that affect other columns such as join key, group key, etc. Then after Create Or Replace occurs, the REFRESH task will degrade to a full refresh. If you want to experience incremental computing features, please use DYNAMIC TABLE
- <column_list>:
- You can specify the column names or add comment information to the columns of the materialized view. You can specify the column names but cannot specify the column types. The types are inferred from the SELECT results in
AS <query>
. If you want to specify the types, you can explicitly use CAST conversion in the SELECT results. - If any column in the table is based on an expression, it is recommended to provide a name for each column. Alternatively, use aliases in <query>.
-
partitioned by (<col> ): Specify partitioning, using the columns in <column_list> as partitions. Partitioning is a method to speed up queries by grouping similar rows together at the time of writing. Using partitions can achieve data pruning and optimize queries.
-
CLUSTERED BY: Optional, specifies the Hash Key. Lakehouse will perform a hash operation on the specified column and distribute the data into various data buckets based on the hash value. To avoid data skew and hotspots, and to improve parallel execution efficiency, it is recommended to choose columns with a large range of values and few duplicate key values as the Hash Key. This usually has a noticeable effect when performing join operations. It is recommended to use CLUSTERED BY in scenarios with large amounts of data, generally with a bucket size between 128MB and 1GB. If no bucketing is specified, the default is 256 buckets.
- SORTED BY: Optional, specifies the sorting method of fields within the Bucket. It is recommended to keep SORTED BY consistent with CLUSTERED BY for better performance. When the SORTED BY clause is specified, the row data will be sorted according to the specified columns.
-
COMMENT: Specify the comment information of the materialized view
-
BUILD DEFERRED: This is a way to create a materialized view. Unlike
BUILD IMMEDIATE
which generates data immediately,BUILD DEFERRED
allows the creation of the materialized view without generating data immediately. The default value is BUILD IMMEDIATE, and when using the CREATE OR REPLACE syntax, it must be BUILD DEFERRED -
refreshOption Optional, refresh option
- START WITH timestamp_exp Specifies the start time, supports specifying a timestamp expression. If START WITH is not written, the refresh starts from the current time
- The
timestamp_expression
returns a standard timestamp type expression. The earliest timestamp specified by TIMESTAMP AS OF depends on the TIME TRAVEL(data_retention_days) parameter. If the specified version does not exist, an error will be reported. If not specified, the version data of the current timestamp will be used, for example: *'2023-11-07 14:49:18'
, a string that can be forcibly converted to a timestamp. *cast('2023-11-07 14:49:18 Asia/Shanghai' as timestamp)
. *current_timestamp() - interval '12' hours
. * Any other expression that is itself a timestamp or can be forcibly converted to a timestamp.
- The
interval_time
specifies the time interval and supports interval types. Ifinterval_time
is not specified butSTART WITH
is provided, the refresh will only occur once at the time specified bySTART WITH
. Theinterval_time
intervals are as follows:
Syntax | Description | Example |
---|---|---|
INTERVAL '[+ | -]' DAY | Specifies only the DAY interval | INTERVAL '1' DAY means 1 day |
INTERVAL '[+ | -]' HOUR | Specifies only the HOUR interval | INTERVAL '23' HOUR means 23 hours |
INTERVAL '[+ | -]' MINUTE | Specifies only the MINUTE interval | INTERVAL '59' MINUTE means 59 minutes |
INTERVAL '[+ | -]' SECOND | Specifies only the SECOND interval | INTERVAL '59.999' SECOND means 59.999 seconds |
INTERVAL '[+ | -] ' DAY TO HOUR | Specifies both DAY and HOUR intervals | INTERVAL '1 23' DAY TO HOUR means 1 day and 23 hours |
INTERVAL '[+ | -] ' DAY TO MINUTE | Specifies DAY, HOUR, and MINUTE intervals | INTERVAL '1 23:59' DAY TO MINUTE means 1 day, 23 hours, and 59 minutes |
INTERVAL '[+ | -] ' DAY TO SECOND | Specifies DAY, HOUR, MINUTE, and SECOND intervals | INTERVAL '1 23:59:59.999' DAY TO SECOND means 1 day, 23 hours, 59 minutes, and 59.999 seconds |
day: The range is [0, 2147483647]. hour: The range is [0, 23]. minute: The range is [0, 59]. second: The range is [0, 59.999999999].
-
The minimum value for INTERVAL is 1 minute, which can be represented as 60 SECOND or 1 MINUTE.
- INTERVAL supports both quoted and unquoted formats. The following are equivalent:
- INTERVAL "60 SECOND"
- INTERVAL '60 SECOND'
- INTERVAL 60 SECOND
- Supported units for INTERVAL: SECOND, MINUTE, HOUR, DAY
- INTERVAL units are case-insensitive, so HOUR and hour are equivalent.
- INTERVAL supports both quoted and unquoted formats. The following are equivalent:
-
Specify the compute cluster in
refreshOption
. Automatic refresh consumes resources, so you need to explicitly specify the compute cluster. If not specified, the current session's compute cluster will be used by default. You can check the current session's compute cluster withSELECT current_vcluster()
.
- DISABLE QUERY REWRITE: This means that the materialized view does not support the query rewrite feature. Query rewrite refers to the database optimizer automatically determining whether the result can be obtained by querying the materialized view when querying the base table of the materialized view. If possible, it avoids aggregation or join operations and directly reads data from the already calculated materialized view.
Notes
- The incremental refresh of the materialized view is based on the historical version of the base table. The historical version depends on the TIME TRAVEL (data_retention_days) parameter. If the specified version does not exist, an error will be reported. This parameter defines the length of time that deleted data is retained. By default, Lakehouse retains data for one day. Depending on your business needs, you can extend or shorten the data retention period by adjusting the
data_retention_days
parameter. Please note that adjusting the data retention period may affect storage costs. Extending the retention period will increase storage requirements, which may increase related costs.
Examples
Example 1 Simple Materialized View
This example demonstrates how to create a table (inventory
) and how to create a materialized view (mv_inventory_basic
) based on that table. Then, a piece of data is inserted, data is selected from the materialized view, the materialized view is refreshed, and finally, data is selected from the materialized view again to observe the effect of manually refreshing the materialized view.
The above SQL statements demonstrate how to create and use materialized views. In practice, materialized views can significantly improve query performance, especially when dealing with complex calculations or large amounts of data. By refreshing the materialized view, you ensure that the view data remains consistent with the source table data, providing the latest query results. Note: Since no scheduling parameters were used when creating the materialized view, manual refresh is required here.
Example 2 Materialized View with Column List and Comments
Example 3 Materialized View with Partitioning and Clustering
Example 4 Materialized View with Refresh Option and Virtual Compute Cluster
Modify the refresh period of a materialized view with refresh options:
View and modify results:
You can see that the value of refresh_interval_second
is already 60, indicating that the table name modification refresh cycle was successful.
For materialized views that automatically refresh periodically, if you need to pause or resume automatic refreshing, please refer to:
Modify Materialized View
Example 5 Comprehensive Example
Example 6: Create a Materialized View and Add Comments
Rebuilding MATERIALIZED VIEW can reuse the last result. The scenario includes adding columns, which can reuse the previous mv, and the new columns will show old data as null.
Case