CREATE TABLE
Description
This statement is used to create a new table. In the Lakehouse, a table is the basic unit for storing data. By creating a table, you can organize and manage data according to a specified structure.
Syntax
column_definition Description
Syntax
Column Type column_type
- column_type Column type, supports the following types:
- NOT NULL: Indicates that this column is not allowed to be NULL. It only supports being specified when creating the table and does not support adding using the ALTER syntax. If you need to remove the NOT NULL constraint, please use the modify table type syntax:
For example, remove the not null
constraint for the int
type
Primary Key (PRIMARY KEY)
- PRIMARY KEY: In the Lakehouse architecture, for tables with a defined primary key, the system will automatically deduplicate data based on the primary key value during real-time data ingestion. Once a primary key is set, you will not be able to perform insert, delete, or update operations through SQL statements, nor can you add or remove columns. You need to handle data through the real-time data interface. This can only be specified when creating the table. For more details, refer to the Primary Key Introduction.
Auto-increment Column (IDENTITY[(seed)])
- IDENTITY[(seed)]: Supports specifying auto-increment. It cannot guarantee that the values in the sequence are continuous (gapless), nor can it guarantee that the sequence values are allocated in a specific order. This is because other concurrent inserts may occur in the table. These limitations are part of the design to improve performance. For specific usage, refer to the IDENTITY Column documentation
Generated Columns (GENERATED ALWAYS AS)
- GENERATED ALWAYS AS (expr): Automatically generates the value of a column through the expression
expr
. The expression can include constants and built-in scalar deterministic SQL functions. Non-deterministic functions such as (current_date, random, current_timestamp, context functions) or operators are not supported. Aggregate functions, window functions, or table functions are also not supported. Partition columns using generated columns are supported.
Default Values (DEFAULT)
- DEFAULT default_expression: Defines a default value for a newly added column. If the value of this column is not specified in INSERT, UPDATE, or MERGE operations, this default value will be automatically used. For data rows that existed before the column was added, the column will be filled with null. Supports non-deterministic functions such as (current_date\random\current_timestamp\context functions) and constant values.
Index Definition (index_definition_list)
Syntax
columns_difinition: Defines the field information of the table, the last field must be separated by a comma
INDEX: Keyword
index_name: Custom name of the index
column_name: The name of the field that needs to be indexed
index_type: Index type, currently supports bloomfilter, inverted, vector
COMMENT: Specifies the description information of the index
PROPERTIES: Specifies the parameters of the INDEX, different indexes support different parameters, refer to the corresponding index documentation for details
Partitioned By (PARTITIONED BY) {#partitioned-by}
Partitioning is a method of speeding up queries by grouping similar rows together at the time of writing. Using partitioning can achieve data pruning and optimize queries. When querying a table, use the WHERE clause to query the specified partition to avoid full table scans, improve processing efficiency, and reduce computing resources. For details, refer to Partition Introduction Two writing methods are supported The first method declares the partition fields and types when creating the table, just declare the fields in the PARTITIONED BY clause
The second type of partition field and type is written in the PARTITIONED BY statement.
Bucketed Table (CLUSTERED BY)
- CLUSTERED BY: Specify the Hash Key. Singdata will perform a hash operation on the specified column and distribute the data into various data buckets based on the hash value. To avoid data skew and hotspots, and to improve parallel execution efficiency, it is recommended to choose columns with a large range of values and few duplicate keys as the Hash Key. This usually has a significant effect when performing join operations. It is recommended to use CLUSTERED BY in scenarios with large amounts of data, generally with a bucket size between 128MB and 1GB. If no bucketing is specified, the default is 256 buckets. It is recommended to keep SORTED BY and CLUSTERED BY consistent for better performance. When the SORTED BY clause is specified, row data will be sorted according to the specified columns. For more information, refer to Bucketing
SORTED BY
SORTED BY: Specifies the sorting method for fields within a file. SORTED BY in Lakehouse can be used independently, indicating sorting within the file when used alone. Specifying SORTED BY can speed up data retrieval, but since sorting is required during writing, it may increase the time taken for writing.
PROPERTIES
- Supports setting the lifecycle 'data_lifecycle'='day_num' Refer to lifecycle introduction
- Enabling TABLE STREAM, specifying 'change_tracking' = 'true' when creating a table will not take effect, please use alter to enable it ALTER table test_table set PROPERTIES ('change_tracking' = 'true');
- Set the table's caching policy. partiton.cache.policy.latest.count=num sets the table's caching policy. The computing cluster can configure
preload_table
to periodically or be triggered to pull the table data specified inpreload_table
to the local SSD hard drive of the computing cluster for caching. You can also set the caching policy on the table. For example,partition.cache.policy.latest.count=10
means caching the latest 10 partitions. When new partitions are added, the cache of old partitions will become invalid. This parameter can also be added via the alter command. For example, ALTER table test_table set PROPERTIES ('partition.cache.policy.latest.count' =10); - Set the table's retention period 'data_retention_days'='num', configure the data retention period for Time Travel to determine the length of time data is retained in the time travel window.
Using LIKE Statement to Create a Table
When creating a new table using the LIKE statement, the target table will have the same table structure as the source table, but the data will not be copied.
Creating a Table Using the AS Statement
The CREATE TABLE AS SELECT (CTAS) statement can be used to query the original table synchronously or asynchronously, create a new table based on the query results, and then insert the query results into the new table. It should be noted that tables created in this way do not copy partition information.
Example
- Create a partitioned table
Syntax 1:
Syntax Two:
- Create a product table with an auto-increment column as a unique identifier
- Create a Timestamp Conversion Table with Generated Columns
- Create a Search Optimized Table with Bloom Filter Index
- Create a new table similar to the existing table structure
- Create a Table Initialized by Query Results
- Create an Order Details Table Containing Array Types
User Guide
Partitioning and Bucketing
-
Choosing a Partitioning Strategy: Select appropriate partition fields based on the query pattern. Generally, columns that can effectively narrow the scan range should be chosen as partition keys. For example, a timestamp column is very suitable as a partition key for date range queries.
-
Setting the Number of Buckets: The number of buckets should be adjusted based on the expected data volume and hardware resource configuration. Typically, each bucket size should be between 256MB and 1GB. Too many or too few buckets will affect the overall performance of the system. To achieve the best results, it is recommended to test the performance under different configurations and make appropriate adjustments accordingly.
-
Choosing Sorting Fields: When using the
SORTED BY
clause, select columns that frequently appear in filter conditions as the sorting basis. Good sorting can help speed up point queries and range queries, but it also increases the write cost. Therefore, it is important to make a decision after weighing the pros and cons.
Indexing
Users can create indexes on multiple columns when creating a table. Indexes can also be added after the table is created. If indexes are added during later use and the table already contains data, all data needs to be rewritten, so the time to create the index depends on the current data volume.
Best Practices for Table Property Settings
- Enable Data Lifecycle Management: By setting the
data_lifecycle
property, the system can automatically clean up historical data that is no longer needed. This is very useful for saving storage, especially when dealing with datasets with a clear retention period, such as logs or transaction records. - Configure Change Tracking: If your application scenario involves obtaining data changes in the table, you need to enable
change_tracking
. Common scenarios include TABLE STREAM. - Set Data Retention Period for Deleted Data: This parameter defines the length of time that deleted data is retained, which is very important for scenarios that require historical data queries. For example, features like table stream, restore, and dynamic table rely on this retention period setting. By default, Lakehouse retains data for one day. Depending on your business needs, you can adjust the
data_retention_days
parameter to extend or shorten the data retention period. Please note that adjusting the data retention period may affect storage costs. Extending the retention period will increase storage requirements, which may increase related costs.