Lakehouse SQL DML Statement Usage Guide

1. INSERT Statement Specification

1.1 Basic Syntax


INSERT INTO|OVERWRITE [TABLE] table_name 
    [ PARTITION partition_spec] 
    [ (column1, column2, ...)] 
    {VALUES(value1 [,...],(value2 [,...]),...) | subquery}

1.2 Recommended Data Import Methods

Bulk Data Import - Preferred Approach

Recommended: INSERT INTO...SELECT statement
Recommended: COPY INTO command with Volume storage
Recommended: Professional data import tools


-- ✅ Recommended: Use SELECT method
INSERT INTO target_table 
SELECT col1, col2, col3 FROM source_table WHERE condition;

-- ✅ Recommended: Use COPY INTO command
COPY INTO target_table 
FROM VOLUME my_volume 
USING CSV OPTIONS ('header' = 'true');

Small Data Import - VALUES Method


-- ✅ Suitable for: Small data volume (recommended within 100 rows)
INSERT INTO table_name VALUES 
(1, 'data1'), (2, 'data2'), (3, 'data3');

Advantages of Recommended Methods:

Higher import performance and throughput
Better resource utilization
Transactional guarantee support
Reduced network transmission overhead

1.3 Data Type Literal Syntax

Types Requiring Prefixes


date'2023-12-25'                       -- DATE type
timestamp'2023-12-25 15:30:45'         -- TIMESTAMP type
timestamp'2023-12-25 15:30:45.123'     -- Millisecond precision supported
json'{"key": "value", "num": 123}'     -- JSON type
X'48656C6C6F'                          -- BINARY type (hexadecimal)

Types with Optional Suffixes


-- Numeric type suffixes are optional; both forms are valid
1       -- or 1l (BIGINT)
100     -- or 100s (SMALLINT)  
200     -- INT type
89.5    -- or 89.5f (FLOAT)
3.14159 -- or 3.14159d (DOUBLE)
99.99   -- or 99.99bd (DECIMAL)

Composite Type Syntax


ARRAY(1,2,3)                          -- ARRAY type
MAP('k1','v1','k2','v2')             -- MAP type
STRUCT(1, 'hello', 3.14)             -- STRUCT type

1.4 INSERT OVERWRITE Behavior

Partitioned Table: Overwrites matching partition data
Non-partitioned Table: Overwrites entire table data
Prerequisite: Target table must exist

1.5 Partition Operation Limits

Partition Count Limit: Maximum 2048 partitions per task
Exceeding Limit: Import in batches or optimize partition strategy
Recommended Check: Count partitions before bulk import

1.6 Column Mapping and Type Matching

Explicit Specification: Recommended to explicitly specify target column names
Type Matching: Ensure precise data type correspondence
NULL Handling: Unspecified columns will be filled with NULL values

2. UPDATE Statement Specification

2.1 Basic Syntax


UPDATE target_table 
SET column_name1 = new_value1 [, column_name2 = new_value2, ...] 
[ WHERE condition ] 
[ORDER BY ...] 
[LIMIT row_count]

2.2 WHERE Condition Requirements

Necessity: Strongly recommended to use WHERE conditions to limit the update scope
Precision: Use precise conditions to avoid mistaken operations
Complex Queries: Subqueries and expressions are supported

2.3 Batch Update Optimization

Batch Processing: Use ORDER BY + LIMIT for batched updates
Determinism: ORDER BY ensures consistent update ordering
Performance Control: LIMIT controls the number of rows updated per batch

2.4 Safe Operation Recommendations

Test First: Validate in a test environment before production
Data Backup: Create backups before important updates
Rollback Preparation: Prepare a data recovery plan

3. DELETE Statement Specification

3.1 Basic Syntax


DELETE FROM table_name WHERE condition;

3.2 Safety Requirements

WHERE Condition: Avoid omitting WHERE to prevent full table deletion
Condition Validation: Verify condition accuracy before deletion
Backup Protection: Back up important data before deletion

3.3 Performance Optimization

Index Utilization: Fully leverage indexes in WHERE conditions
Partition Filtering: Use partition columns for filtering on partitioned tables
Batch Deletion: Consider batched execution for large-scale deletions

4. MERGE INTO Statement Specification

4.1 Basic Syntax


MERGE INTO target_table USING source_table ON merge_condition 
{ WHEN MATCHED [AND matched_condition] THEN matched_action |
  WHEN NOT MATCHED [AND not_matched_condition] THEN not_matched_action } ...

4.2 Statement Order Requirements

WHEN MATCHED must precede WHEN NOT MATCHED


-- ✅ Correct order
MERGE INTO target USING source ON target.key = source.key 
WHEN MATCHED THEN UPDATE SET target.col1 = source.col1
WHEN NOT MATCHED THEN INSERT (col1, col2) VALUES (source.col1, source.col2);

4.3 Match Condition Design

Uniqueness: Ensure the ON condition produces one-to-one matching
Determinism: Avoid multiple source rows matching the same target row
Filter Support: AND conditions are supported for additional filtering

4.4 Operation Types

MATCHED Operations: UPDATE SET or DELETE
NOT MATCHED Operations: INSERT statement
Conditional Execution: Multiple WHEN clauses are executed in the specified order

5. TRUNCATE Statement Specification

5.1 Basic Syntax


TRUNCATE TABLE [IF EXISTS] table_name;

5.2 Operational Characteristics

Data Clearing: Deletes all records but retains the table structure
Performance Advantage: More efficient than DELETE FROM
Irrecoverable: Data cannot be directly recovered after the operation

5.3 Usage Recommendations

IF EXISTS: Use the IF EXISTS clause to avoid errors
Permission Check: Ensure you have the appropriate operation permissions
Backup Protection: Back up data before operating on important tables

6. Dynamic Table DML Specification

6.1 Parameter Configuration


-- Recommended to explicitly enable DML operations
set cz.optimizer.incremental.backfill.enabled=true;

6.2 Supported Operations


-- ✅ Fully supported
INSERT INTO dynamic_table VALUES (1, 'data', 100);
INSERT OVERWRITE dynamic_table SELECT * FROM source;
DELETE FROM dynamic_table WHERE condition;
TRUNCATE TABLE dynamic_table;

6.3 Operation Limitations

UPDATE Limitation: UPDATE operations have technical limitations; using DELETE + INSERT as a workaround is recommended
Refresh Impact: DML operations may cause the next refresh to switch to full mode
Performance Consideration: Full refresh incurs higher overhead than incremental refresh

7. Performance Optimization Strategies

7.1 Partition Design Principles

Partition Size: Follow industry standards; avoid overly small partitions that affect query performance
Partition Count: Control the total number of partitions, balancing storage and query efficiency
Filter Optimization: Partition columns should be commonly used filter conditions

7.2 Bucketing Configuration Strategy

Bucket Column Selection: Choose high-cardinality, evenly-distributed columns
Bucket Count: Determine a reasonable number based on data volume and query patterns
Sort Optimization: Choose frequently queried columns for SORTED BY

7.3 Index Optimization

BLOOM FILTER Index: Suitable for equality queries and high-cardinality columns
INVERTED Index: Suitable for full-text search; requires specifying an analyzer
VECTOR Index: Suitable for vector similarity search scenarios

7.4 Small File Management


-- Auto-compaction configuration
SET cz.sql.compaction.after.commit = true;

-- Manual compaction command
OPTIMIZE table_name [WHERE predicate] [OPTIONS ('key' = 'value')];

8. Data Type Conversion

8.1 Conversion Methods


-- CAST function
CAST(expression AS type)

-- Conversion operator
expression::type

-- TYPE function (returns NULL on conversion failure)
TYPE(expr)

8.2 Conversion Rules

Numeric Widening: Conversions that expand precision are supported
String Conversion: Conversions that increase length are supported
Date Conversion: Bidirectional conversion between strings and date types
Overflow Handling: Be aware of numeric conversion overflow risks

8.3 TIMESTAMP Handling

Format Support: Standard format, millisecond precision, ISO 8601
Timezone Handling: Default TIMESTAMP_LTZ type
Precision Support: Up to microsecond precision supported

9. Transaction and Version Control

9.1 Historical Version Query

View Table History


-- View complete operation history of a table
DESCRIBE HISTORY table_name;

Returned information includes:

version: Version number
time: Operation time
total_rows: Total row count for that version
operation: Operation type (CREATE, INSERT_INTO, UPDATE, DELETE, TRUNCATE, etc.)
user: Executing user
job_id: Job ID

Time Travel Queries


-- Query historical data using relative time
SELECT * FROM table_name 
TIMESTAMP AS OF (CURRENT_TIMESTAMP() - INTERVAL '1' HOUR);

-- Query using absolute time (requires a precise timestamp)
SELECT * FROM table_name 
TIMESTAMP AS OF '2025-06-18 10:30:45.123';

-- Use CAST function to specify timezone
SELECT * FROM table_name 
TIMESTAMP AS OF CAST('2025-06-18 10:30:45 Asia/Shanghai' AS TIMESTAMP);

9.2 Data Recovery Operations

Table Data Recovery


-- Restore table to a specified point in time
RESTORE TABLE table_name TO TIMESTAMP AS OF '2025-06-18 10:30:45';

-- Restore Dynamic Table
RESTORE DYNAMIC TABLE table_name TO TIMESTAMP AS OF '2025-06-18 10:30:45';

Supported time formats:

Full timestamp: '2025-06-18 10:30:45.123'
Second-level precision: '2025-06-18 10:30:45'
With timezone: '2025-06-18 10:30:45 Asia/Shanghai'
Relative time: CURRENT_TIMESTAMP() - INTERVAL '1' DAY

Recover Dropped Objects


-- Recover a dropped table
UNDROP TABLE table_name;

-- Recover a dropped Dynamic Table
UNDROP DYNAMIC TABLE table_name;

-- Recover a dropped Materialized View
UNDROP MATERIALIZED VIEW view_name;

9.3 Change Tracking Configuration

Enable Change Tracking


-- Enable change tracking for a table
ALTER TABLE table_name SET PROPERTIES('change_tracking' = 'true');

Create Table Stream


-- Create a Table Stream in standard mode
CREATE TABLE STREAM stream_name 
ON TABLE table_name
WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD');

-- Create a Table Stream in append-only mode
CREATE TABLE STREAM stream_name 
ON TABLE table_name
WITH PROPERTIES ('TABLE_STREAM_MODE' = 'APPEND_ONLY');

9.4 Data Retention Policy

Set Data Retention Period


-- Set the Time Travel data retention period (unit: days)
ALTER TABLE table_name SET PROPERTIES('data_retention_days' = '7');

-- Set the data lifecycle (auto-clean historical data)
ALTER TABLE table_name SET PROPERTIES('data_lifecycle' = '365');

Query Historical Load Records


-- View historical load records for a table
SELECT * FROM load_history('schema.table_name');

10. System Parameter Configuration


-- Enable DML for Dynamic Tables
set cz.optimizer.incremental.backfill.enabled=true;

-- Auto-compaction for small files
SET cz.sql.compaction.after.commit = true;

-- Query tag setting
SET query_tag = 'dml_operation';

-- Session timezone configuration
SET timezone = 'Asia/Shanghai';

10.2 Workspace-Level Configuration

Auto Index Recommendation


-- Enable workspace-level auto index recommendation
ALTER WORKSPACE workspace_name SET properties (auto_index='day[,150,5,100]');

Parameter description:

day: Recommendation frequency (daily)
150: Query count threshold
5: Query duration threshold (seconds)
100: Index recommendation count limit

11. Error Handling Guide

11.1 Common Error Types

Data Type Conversion Error


Error message: implicit cast not allowed for 'colX': string not null to date/timestamp/json/binary
Solution: Use the correct type prefix syntax

Partition Count Exceeded Error


Error message: The count of dynamic partitions exceeds the maximum number 2048
Solution: Import in batches or optimize the partition strategy

MERGE Statement Order Error


Error message: Syntax error at or near 'WHEN'
Solution: Adjust the order of WHEN clauses

Dynamic Table UPDATE Limitation


Error message: Not support hidden column :MV__KEY
Solution: Use DELETE + INSERT instead of UPDATE

11.2 Performance Diagnosis


-- Query execution plan
EXPLAIN SELECT * FROM table_name WHERE condition;

-- Check partition information
SHOW PARTITIONS EXTENDED table_name;

12. Best Practices

12.1 Data Type Usage Specification

Data Type	Prefix Required	Syntax Example
DATE	Required	`date'2023-12-25'`
TIMESTAMP	Required	`timestamp'2023-12-25 15:30:45'`
JSON	Required	`json'{"key": "value"}'`
BINARY	Required	`X'48656C6C6F'`
BIGINT	Optional	`1` or `1l`
DECIMAL	Optional	`99.99` or `99.99bd`
FLOAT	Optional	`89.5` or `89.5f`
DOUBLE	Optional	`3.14` or `3.14d`

12.2 INSERT Statement Template


-- Recommended bulk data import
INSERT INTO target_table 
SELECT col1, col2, col3 FROM source_table WHERE condition;

-- Type-safe VALUES insertion
INSERT INTO table_name (
    bigint_col, decimal_col, date_col, 
    timestamp_col, json_col, binary_col
) VALUES (
    1, 99.99, date'2023-12-25',
    timestamp'2023-12-25 15:30:45',
    json'{"key": "value"}', X'48656C6C6F'
);

12.3 MERGE Statement Template


MERGE INTO target_table AS target 
USING source_table AS source 
ON target.key_column = source.key_column 
WHEN MATCHED THEN 
    UPDATE SET target.col1 = source.col1, target.col2 = source.col2
WHEN NOT MATCHED THEN 
    INSERT (key_column, col1, col2) 
    VALUES (source.key_column, source.col1, source.col2);

12.4 Dynamic Table DML Template


-- Session configuration
set cz.optimizer.incremental.backfill.enabled=true;

-- Supported operations
INSERT INTO dynamic_table VALUES (1, 'data', 100);
DELETE FROM dynamic_table WHERE condition;

-- Workaround for UPDATE
DELETE FROM dynamic_table WHERE key_column = target_value;
INSERT INTO dynamic_table VALUES (new_key, new_col1, new_col2);

12.5 Safe Operation Principles

Test First: Complete test validation before production operations
Backup Protection: Create backups before critical data operations
Least Privilege: Use the minimum necessary permissions for operations
Precise Conditions: Use precise WHERE conditions to limit operation scope
Monitor and Audit: Record execution logs of important DML operations

12.6 Performance Optimization Principles

Batch First: Prioritize batch operations for efficiency
Index Utilization: Fully leverage indexes to accelerate queries and DML
Partition Filtering: Use partition pruning to reduce data scanning
Resource Management: Properly configure compute resources and concurrency
File Management: Regularly perform small file compaction optimization

12.7 Version Control and Data Recovery Principles

Set Data Retention: Configure the data retention period based on business requirements
Enable Change Tracking: Enable change tracking for important tables to facilitate data auditing
Regular History Review: Periodically review table operation history to detect anomalies
Verify Recovery Operations: Validate recovery results in a test environment before executing
Time Travel Queries: Use relative time queries to avoid timezone issues

12.8 Table Stream Usage Principles


-- Recommended Stream creation and usage pattern
-- 1. Enable change tracking
ALTER TABLE source_table SET PROPERTIES('change_tracking' = 'true');

-- 2. Create Stream
CREATE TABLE STREAM change_stream 
ON TABLE source_table
WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD');

-- 3. Query change data
SELECT * FROM change_stream WHERE cz_stream_action IN ('INSERT', 'UPDATE', 'DELETE');

12.9 Historical Version and Data Recovery Template


-- View table history
DESCRIBE HISTORY table_name;

-- Time Travel query
SELECT * FROM table_name 
TIMESTAMP AS OF (CURRENT_TIMESTAMP() - INTERVAL '1' HOUR);

-- Data recovery
RESTORE TABLE table_name TO TIMESTAMP AS OF '2025-06-18 10:30:45';

-- Recover dropped table
UNDROP TABLE table_name;

-- Enable change tracking
ALTER TABLE table_name SET PROPERTIES('change_tracking' = 'true');

-- Create Table Stream
CREATE TABLE STREAM stream_name 
ON TABLE table_name
WITH PROPERTIES ('TABLE_STREAM_MODE' = 'STANDARD');

-- Set data retention period
ALTER TABLE table_name SET PROPERTIES('data_retention_days' = '7');

Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, always verify the correctness and performance impact of all operations in a test environment.