Lakehouse SQL DML Statement Usage Guide

1. INSERT Statement Specification

1.1 Basic Syntax

INSERT INTO|OVERWRITE [TABLE] table_name [ PARTITION partition_spec] [ (column1, column2, ...)] {VALUES(value1 [,...],(value2 [,...]),...) | subquery}

Bulk Data Import - Preferred Approach

  • Recommended: INSERT INTO...SELECT statement
  • Recommended: COPY INTO command with Volume storage
  • Recommended: Professional data import tools

-- ✅ Recommended: Use SELECT method INSERT INTO target_table SELECT col1, col2, col3 FROM source_table WHERE condition; -- ✅ Recommended: Use COPY INTO command COPY INTO target_table FROM VOLUME my_volume USING CSV OPTIONS ('header' = 'true');

Small Data Import - VALUES Method

-- ✅ Suitable for: Small data volume (recommended within 100 rows) INSERT INTO table_name VALUES (1, 'data1'), (2, 'data2'), (3, 'data3');

Advantages of Recommended Methods:

  • Higher import performance and throughput
  • Better resource utilization
  • Transactional guarantee support
  • Reduced network transmission overhead

1.3 Data Type Literal Syntax

Types Requiring Prefixes

date'2023-12-25' -- DATE type timestamp'2023-12-25 15:30:45' -- TIMESTAMP type timestamp'2023-12-25 15:30:45.123' -- Millisecond precision supported json'{"key": "value", "num": 123}' -- JSON type X'48656C6C6F' -- BINARY type (hexadecimal)

Types with Optional Suffixes

-- Numeric type suffixes are optional; both forms are valid 1 -- or 1l (BIGINT) 100 -- or 100s (SMALLINT) 200 -- INT type 89.5 -- or 89.5f (FLOAT) 3.14159 -- or 3.14159d (DOUBLE) 99.99 -- or 99.99bd (DECIMAL)

Composite Type Syntax

ARRAY(1,2,3) -- ARRAY type MAP('k1','v1','k2','v2') -- MAP type STRUCT(1, 'hello', 3.14) -- STRUCT type

1.4 INSERT OVERWRITE Behavior

  • Partitioned Table: Overwrites matching partition data
  • Non-partitioned Table: Overwrites entire table data
  • Prerequisite: Target table must exist

1.5 Partition Operation Limits

  • Partition Count Limit: Maximum 2048 partitions per task
  • Exceeding Limit: Import in batches or optimize partition strategy
  • Recommended Check: Count partitions before bulk import

1.6 Column Mapping and Type Matching

  • Explicit Specification: Recommended to explicitly specify target column names
  • Type Matching: Ensure precise data type correspondence
  • NULL Handling: Unspecified columns will be filled with NULL values

2. UPDATE Statement Specification

2.1 Basic Syntax

UPDATE target_table SET column_name1 = new_value1 [, column_name2 = new_value2, ...] [ WHERE condition ] [ORDER BY ...] [LIMIT row_count]

2.2 WHERE Condition Requirements

  • Necessity: Strongly recommended to use WHERE conditions to limit the update scope
  • Precision: Use precise conditions to avoid mistaken operations
  • Complex Queries: Subqueries and expressions are supported

2.3 Batch Update Optimization

  • Batch Processing: Use ORDER BY + LIMIT for batched updates
  • Determinism: ORDER BY ensures consistent update ordering
  • Performance Control: LIMIT controls the number of rows updated per batch

2.4 Safe Operation Recommendations

  • Test First: Validate in a test environment before production
  • Data Backup: Create backups before important updates
  • Rollback Preparation: Prepare a data recovery plan

3. DELETE Statement Specification

3.1 Basic Syntax

DELETE FROM table_name WHERE condition;

3.2 Safety Requirements

  • WHERE Condition: Avoid omitting WHERE to prevent full table deletion
  • Condition Validation: Verify condition accuracy before deletion
  • Backup Protection: Back up important data before deletion

3.3 Performance Optimization

  • Index Utilization: Fully leverage indexes in WHERE conditions
  • Partition Filtering: Use partition columns for filtering on partitioned tables
  • Batch Deletion: Consider batched execution for large-scale deletions

4. MERGE INTO Statement Specification

4.1 Basic Syntax

MERGE INTO target_table USING source_table ON merge_condition { WHEN MATCHED [AND matched_condition] THEN matched_action | WHEN NOT MATCHED [AND not_matched_condition] THEN not_matched_action } ...

4.2 Statement Order Requirements

WHEN MATCHED must precede WHEN NOT MATCHED

-- ✅ Correct order MERGE INTO target USING source ON target.key = source.key WHEN MATCHED THEN UPDATE SET target.col1 = source.col1 WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (source.col1, source.col2);

4.3 Match Condition Design

  • Uniqueness: Ensure the ON condition produces one-to-one matching
  • Determinism: Avoid multiple source rows matching the same target row
  • Filter Support: AND conditions are supported for additional filtering

4.4 Operation Types

  • MATCHED Operations: UPDATE SET or DELETE
  • NOT MATCHED Operations: INSERT statement
  • Conditional Execution: Multiple WHEN clauses are executed in the specified order

5. TRUNCATE Statement Specification

5.1 Basic Syntax

TRUNCATE TABLE [IF EXISTS] table_name;

5.2 Operational Characteristics

  • Data Clearing: Deletes all records but retains the table structure
  • Performance Advantage: More efficient than DELETE FROM
  • Irrecoverable: Data cannot be directly recovered after the operation

5.3 Usage Recommendations

  • IF EXISTS: Use the IF EXISTS clause to avoid errors
  • Permission Check: Ensure you have the appropriate operation permissions
  • Backup Protection: Back up data before operating on important tables

6. Dynamic Table DML Specification

6.1 Parameter Configuration

-- Recommended to explicitly enable DML operations set cz.optimizer.incremental.backfill.enabled=true;

6.2 Supported Operations

-- ✅ Fully supported INSERT INTO dynamic_table VALUES (1, 'data', 100); INSERT OVERWRITE dynamic_table SELECT * FROM source; DELETE FROM dynamic_table WHERE condition; TRUNCATE TABLE dynamic_table;

6.3 Operation Limitations

  • UPDATE Limitation: UPDATE operations have technical limitations; using DELETE + INSERT as a workaround is recommended
  • Refresh Impact: DML operations may cause the next refresh to switch to full mode
  • Performance Consideration: Full refresh incurs higher overhead than incremental refresh

7. Performance Optimization Strategies

7.1 Partition Design Principles

  • Partition Size: Follow industry standards; avoid overly small partitions that affect query performance
  • Partition Count: Control the total number of partitions, balancing storage and query efficiency
  • Filter Optimization: Partition columns should be commonly used filter conditions

7.2 Bucketing Configuration Strategy

  • Bucket Column Selection: Choose high-cardinality, evenly-distributed columns
  • Bucket Count: Determine a reasonable number based on data volume and query patterns
  • Sort Optimization: Choose frequently queried columns for SORTED BY

7.3 Index Optimization

  • BLOOM FILTER Index: Suitable for equality queries and high-cardinality columns
  • INVERTED Index: Suitable for full-text search; requires specifying an analyzer
  • VECTOR Index: Suitable for vector similarity search scenarios

7.4 Small File Management

-- Auto-compaction configuration SET cz.sql.compaction.after.commit = true; -- Manual compaction command OPTIMIZE table_name [WHERE predicate] [OPTIONS ('key' = 'value')];

8. Data Type Conversion

8.1 Conversion Methods

-- CAST function CAST(expression AS type) -- Conversion operator expression::type -- TYPE function (returns NULL on conversion failure) TYPE(expr)

8.2 Conversion Rules

  • Numeric Widening: Conversions that expand precision are supported
  • String Conversion: Conversions that increase length are supported
  • Date Conversion: Bidirectional conversion between strings and date types
  • Overflow Handling: Be aware of numeric conversion overflow risks

8.3 TIMESTAMP Handling

  • Format Support: Standard format, millisecond precision, ISO 8601
  • Timezone Handling: Default TIMESTAMP_LTZ type
  • Precision Support: Up to microsecond precision supported

9. Transaction and Version Control

9.1 Historical Version Query

View Table History

-- View complete operation history of a table DESCRIBE HISTORY table_name;

Returned information includes:

  • version: Version number
  • time: Operation time
  • total_rows: Total row count for that version
  • operation: Operation type (CREATE, INSERT_INTO, UPDATE, DELETE, TRUNCATE, etc.)
  • user: Executing user
  • job_id: Job ID

Time Travel Queries

-- Query historical data using relative time SELECT * FROM table_name TIMESTAMP AS OF (CURRENT_TIMESTAMP() - INTERVAL '1' HOUR); -- Query using absolute time (requires a precise timestamp) SELECT * FROM table_name TIMESTAMP AS OF '2025-06-18 10:30:45.123'; -- Use CAST function to specify timezone SELECT * FROM table_name TIMESTAMP AS OF CAST('2025-06-18 10:30:45 Asia/Shanghai' AS TIMESTAMP);

9.2 Data Recovery Operations

Table Data Recovery

-- Restore table to a specified point in time RESTORE TABLE table_name TO TIMESTAMP AS OF '2025-06-18 10:30:45'; -- Restore Dynamic Table RESTORE DYNAMIC TABLE table_name TO TIMESTAMP AS OF '2025-06-18 10:30:45';

Supported time formats:

  • Full timestamp: '2025-06-18 10:30:45.123'
  • Second-level precision: '2025-06-18 10:30:45'
  • With timezone: '2025-06-18 10:30:45 Asia/Shanghai'
  • Relative time: CURRENT_TIMESTAMP() - INTERVAL '1' DAY

Recover Dropped Objects

-- Recover a dropped table UNDROP TABLE table_name; -- Recover a dropped Dynamic Table UNDROP DYNAMIC TABLE table_name; -- Recover a dropped Materialized View UNDROP MATERIALIZED VIEW view_name;

9.3 Change Tracking Configuration

Enable Change Tracking

-- Enable change tracking for a table ALTER TABLE table_name SET PROPERTIES('change_tracking' = 'true');

Create Table Stream

-- Create a Table Stream in standard mode CREATE TABLE STREAM stream_name ON TABLE table_name WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD'); -- Create a Table Stream in append-only mode CREATE TABLE STREAM stream_name ON TABLE table_name WITH PROPERTIES ('TABLE_STREAM_MODE' = 'APPEND_ONLY');

9.4 Data Retention Policy

Set Data Retention Period

-- Set the Time Travel data retention period (unit: days) ALTER TABLE table_name SET PROPERTIES('data_retention_days' = '7'); -- Set the data lifecycle (auto-clean historical data) ALTER TABLE table_name SET PROPERTIES('data_lifecycle' = '365');

Query Historical Load Records

-- View historical load records for a table SELECT * FROM load_history('schema.table_name');

10. System Parameter Configuration

-- Enable DML for Dynamic Tables set cz.optimizer.incremental.backfill.enabled=true; -- Auto-compaction for small files SET cz.sql.compaction.after.commit = true; -- Query tag setting SET query_tag = 'dml_operation'; -- Session timezone configuration SET timezone = 'Asia/Shanghai';

10.2 Workspace-Level Configuration

Auto Index Recommendation

-- Enable workspace-level auto index recommendation ALTER WORKSPACE workspace_name SET properties (auto_index='day[,150,5,100]');

Parameter description:

  • day: Recommendation frequency (daily)
  • 150: Query count threshold
  • 5: Query duration threshold (seconds)
  • 100: Index recommendation count limit

11. Error Handling Guide

11.1 Common Error Types

Data Type Conversion Error

Error message: implicit cast not allowed for 'colX': string not null to date/timestamp/json/binary Solution: Use the correct type prefix syntax

Partition Count Exceeded Error

Error message: The count of dynamic partitions exceeds the maximum number 2048 Solution: Import in batches or optimize the partition strategy

MERGE Statement Order Error

Error message: Syntax error at or near 'WHEN' Solution: Adjust the order of WHEN clauses

Dynamic Table UPDATE Limitation

Error message: Not support hidden column :MV__KEY Solution: Use DELETE + INSERT instead of UPDATE

11.2 Performance Diagnosis

-- Query execution plan EXPLAIN SELECT * FROM table_name WHERE condition; -- Check partition information SHOW PARTITIONS EXTENDED table_name;

12. Best Practices

12.1 Data Type Usage Specification

Data TypePrefix RequiredSyntax Example
DATERequireddate'2023-12-25'
TIMESTAMPRequiredtimestamp'2023-12-25 15:30:45'
JSONRequiredjson'{"key": "value"}'
BINARYRequiredX'48656C6C6F'
BIGINTOptional1 or 1l
DECIMALOptional99.99 or 99.99bd
FLOATOptional89.5 or 89.5f
DOUBLEOptional3.14 or 3.14d

12.2 INSERT Statement Template

-- Recommended bulk data import INSERT INTO target_table SELECT col1, col2, col3 FROM source_table WHERE condition; -- Type-safe VALUES insertion INSERT INTO table_name ( bigint_col, decimal_col, date_col, timestamp_col, json_col, binary_col ) VALUES ( 1, 99.99, date'2023-12-25', timestamp'2023-12-25 15:30:45', json'{"key": "value"}', X'48656C6C6F' );

12.3 MERGE Statement Template

MERGE INTO target_table AS target USING source_table AS source ON target.key_column = source.key_column WHEN MATCHED THEN UPDATE SET target.col1 = source.col1, target.col2 = source.col2 WHEN NOT MATCHED THEN INSERT (key_column, col1, col2) VALUES (source.key_column, source.col1, source.col2);

12.4 Dynamic Table DML Template

-- Session configuration set cz.optimizer.incremental.backfill.enabled=true; -- Supported operations INSERT INTO dynamic_table VALUES (1, 'data', 100); DELETE FROM dynamic_table WHERE condition; -- Workaround for UPDATE DELETE FROM dynamic_table WHERE key_column = target_value; INSERT INTO dynamic_table VALUES (new_key, new_col1, new_col2);

12.5 Safe Operation Principles

  1. Test First: Complete test validation before production operations
  2. Backup Protection: Create backups before critical data operations
  3. Least Privilege: Use the minimum necessary permissions for operations
  4. Precise Conditions: Use precise WHERE conditions to limit operation scope
  5. Monitor and Audit: Record execution logs of important DML operations

12.6 Performance Optimization Principles

  1. Batch First: Prioritize batch operations for efficiency
  2. Index Utilization: Fully leverage indexes to accelerate queries and DML
  3. Partition Filtering: Use partition pruning to reduce data scanning
  4. Resource Management: Properly configure compute resources and concurrency
  5. File Management: Regularly perform small file compaction optimization

12.7 Version Control and Data Recovery Principles

  1. Set Data Retention: Configure the data retention period based on business requirements
  2. Enable Change Tracking: Enable change tracking for important tables to facilitate data auditing
  3. Regular History Review: Periodically review table operation history to detect anomalies
  4. Verify Recovery Operations: Validate recovery results in a test environment before executing
  5. Time Travel Queries: Use relative time queries to avoid timezone issues

12.8 Table Stream Usage Principles

-- Recommended Stream creation and usage pattern -- 1. Enable change tracking ALTER TABLE source_table SET PROPERTIES('change_tracking' = 'true'); -- 2. Create Stream CREATE TABLE STREAM change_stream ON TABLE source_table WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD'); -- 3. Query change data SELECT * FROM change_stream WHERE cz_stream_action IN ('INSERT', 'UPDATE', 'DELETE');

12.9 Historical Version and Data Recovery Template

-- View table history DESCRIBE HISTORY table_name; -- Time Travel query SELECT * FROM table_name TIMESTAMP AS OF (CURRENT_TIMESTAMP() - INTERVAL '1' HOUR); -- Data recovery RESTORE TABLE table_name TO TIMESTAMP AS OF '2025-06-18 10:30:45'; -- Recover dropped table UNDROP TABLE table_name; -- Enable change tracking ALTER TABLE table_name SET PROPERTIES('change_tracking' = 'true'); -- Create Table Stream CREATE TABLE STREAM stream_name ON TABLE table_name WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD'); -- Set data retention period ALTER TABLE table_name SET PROPERTIES('data_retention_days' = '7');


Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, always verify the correctness and performance impact of all operations in a test environment.