Lakehouse SQL DML Statement Usage Guide
1. INSERT Statement Specification
1.1 Basic Syntax
1.2 Recommended Data Import Methods
Bulk Data Import - Preferred Approach
- Recommended:
INSERT INTO...SELECTstatement - Recommended: COPY INTO command with Volume storage
- Recommended: Professional data import tools
Small Data Import - VALUES Method
Advantages of Recommended Methods:
- Higher import performance and throughput
- Better resource utilization
- Transactional guarantee support
- Reduced network transmission overhead
1.3 Data Type Literal Syntax
Types Requiring Prefixes
Types with Optional Suffixes
Composite Type Syntax
1.4 INSERT OVERWRITE Behavior
- Partitioned Table: Overwrites matching partition data
- Non-partitioned Table: Overwrites entire table data
- Prerequisite: Target table must exist
1.5 Partition Operation Limits
- Partition Count Limit: Maximum 2048 partitions per task
- Exceeding Limit: Import in batches or optimize partition strategy
- Recommended Check: Count partitions before bulk import
1.6 Column Mapping and Type Matching
- Explicit Specification: Recommended to explicitly specify target column names
- Type Matching: Ensure precise data type correspondence
- NULL Handling: Unspecified columns will be filled with NULL values
2. UPDATE Statement Specification
2.1 Basic Syntax
2.2 WHERE Condition Requirements
- Necessity: Strongly recommended to use WHERE conditions to limit the update scope
- Precision: Use precise conditions to avoid mistaken operations
- Complex Queries: Subqueries and expressions are supported
2.3 Batch Update Optimization
- Batch Processing: Use ORDER BY + LIMIT for batched updates
- Determinism: ORDER BY ensures consistent update ordering
- Performance Control: LIMIT controls the number of rows updated per batch
2.4 Safe Operation Recommendations
- Test First: Validate in a test environment before production
- Data Backup: Create backups before important updates
- Rollback Preparation: Prepare a data recovery plan
3. DELETE Statement Specification
3.1 Basic Syntax
3.2 Safety Requirements
- WHERE Condition: Avoid omitting WHERE to prevent full table deletion
- Condition Validation: Verify condition accuracy before deletion
- Backup Protection: Back up important data before deletion
3.3 Performance Optimization
- Index Utilization: Fully leverage indexes in WHERE conditions
- Partition Filtering: Use partition columns for filtering on partitioned tables
- Batch Deletion: Consider batched execution for large-scale deletions
4. MERGE INTO Statement Specification
4.1 Basic Syntax
4.2 Statement Order Requirements
WHEN MATCHED must precede WHEN NOT MATCHED
4.3 Match Condition Design
- Uniqueness: Ensure the ON condition produces one-to-one matching
- Determinism: Avoid multiple source rows matching the same target row
- Filter Support: AND conditions are supported for additional filtering
4.4 Operation Types
- MATCHED Operations: UPDATE SET or DELETE
- NOT MATCHED Operations: INSERT statement
- Conditional Execution: Multiple WHEN clauses are executed in the specified order
5. TRUNCATE Statement Specification
5.1 Basic Syntax
5.2 Operational Characteristics
- Data Clearing: Deletes all records but retains the table structure
- Performance Advantage: More efficient than DELETE FROM
- Irrecoverable: Data cannot be directly recovered after the operation
5.3 Usage Recommendations
- IF EXISTS: Use the IF EXISTS clause to avoid errors
- Permission Check: Ensure you have the appropriate operation permissions
- Backup Protection: Back up data before operating on important tables
6. Dynamic Table DML Specification
6.1 Parameter Configuration
6.2 Supported Operations
6.3 Operation Limitations
- UPDATE Limitation: UPDATE operations have technical limitations; using DELETE + INSERT as a workaround is recommended
- Refresh Impact: DML operations may cause the next refresh to switch to full mode
- Performance Consideration: Full refresh incurs higher overhead than incremental refresh
7. Performance Optimization Strategies
7.1 Partition Design Principles
- Partition Size: Follow industry standards; avoid overly small partitions that affect query performance
- Partition Count: Control the total number of partitions, balancing storage and query efficiency
- Filter Optimization: Partition columns should be commonly used filter conditions
7.2 Bucketing Configuration Strategy
- Bucket Column Selection: Choose high-cardinality, evenly-distributed columns
- Bucket Count: Determine a reasonable number based on data volume and query patterns
- Sort Optimization: Choose frequently queried columns for SORTED BY
7.3 Index Optimization
- BLOOM FILTER Index: Suitable for equality queries and high-cardinality columns
- INVERTED Index: Suitable for full-text search; requires specifying an analyzer
- VECTOR Index: Suitable for vector similarity search scenarios
7.4 Small File Management
8. Data Type Conversion
8.1 Conversion Methods
8.2 Conversion Rules
- Numeric Widening: Conversions that expand precision are supported
- String Conversion: Conversions that increase length are supported
- Date Conversion: Bidirectional conversion between strings and date types
- Overflow Handling: Be aware of numeric conversion overflow risks
8.3 TIMESTAMP Handling
- Format Support: Standard format, millisecond precision, ISO 8601
- Timezone Handling: Default TIMESTAMP_LTZ type
- Precision Support: Up to microsecond precision supported
9. Transaction and Version Control
9.1 Historical Version Query
View Table History
Returned information includes:
- version: Version number
- time: Operation time
- total_rows: Total row count for that version
- operation: Operation type (CREATE, INSERT_INTO, UPDATE, DELETE, TRUNCATE, etc.)
- user: Executing user
- job_id: Job ID
Time Travel Queries
9.2 Data Recovery Operations
Table Data Recovery
Supported time formats:
- Full timestamp: '2025-06-18 10:30:45.123'
- Second-level precision: '2025-06-18 10:30:45'
- With timezone: '2025-06-18 10:30:45 Asia/Shanghai'
- Relative time: CURRENT_TIMESTAMP() - INTERVAL '1' DAY
Recover Dropped Objects
9.3 Change Tracking Configuration
Enable Change Tracking
Create Table Stream
9.4 Data Retention Policy
Set Data Retention Period
Query Historical Load Records
10. System Parameter Configuration
10.1 DML-Related Parameters
10.2 Workspace-Level Configuration
Auto Index Recommendation
Parameter description:
- day: Recommendation frequency (daily)
- 150: Query count threshold
- 5: Query duration threshold (seconds)
- 100: Index recommendation count limit
11. Error Handling Guide
11.1 Common Error Types
Data Type Conversion Error
Partition Count Exceeded Error
MERGE Statement Order Error
Dynamic Table UPDATE Limitation
11.2 Performance Diagnosis
12. Best Practices
12.1 Data Type Usage Specification
| Data Type | Prefix Required | Syntax Example |
|---|---|---|
| DATE | Required | date'2023-12-25' |
| TIMESTAMP | Required | timestamp'2023-12-25 15:30:45' |
| JSON | Required | json'{"key": "value"}' |
| BINARY | Required | X'48656C6C6F' |
| BIGINT | Optional | 1 or 1l |
| DECIMAL | Optional | 99.99 or 99.99bd |
| FLOAT | Optional | 89.5 or 89.5f |
| DOUBLE | Optional | 3.14 or 3.14d |
12.2 INSERT Statement Template
12.3 MERGE Statement Template
12.4 Dynamic Table DML Template
12.5 Safe Operation Principles
- Test First: Complete test validation before production operations
- Backup Protection: Create backups before critical data operations
- Least Privilege: Use the minimum necessary permissions for operations
- Precise Conditions: Use precise WHERE conditions to limit operation scope
- Monitor and Audit: Record execution logs of important DML operations
12.6 Performance Optimization Principles
- Batch First: Prioritize batch operations for efficiency
- Index Utilization: Fully leverage indexes to accelerate queries and DML
- Partition Filtering: Use partition pruning to reduce data scanning
- Resource Management: Properly configure compute resources and concurrency
- File Management: Regularly perform small file compaction optimization
12.7 Version Control and Data Recovery Principles
- Set Data Retention: Configure the data retention period based on business requirements
- Enable Change Tracking: Enable change tracking for important tables to facilitate data auditing
- Regular History Review: Periodically review table operation history to detect anomalies
- Verify Recovery Operations: Validate recovery results in a test environment before executing
- Time Travel Queries: Use relative time queries to avoid timezone issues
12.8 Table Stream Usage Principles
12.9 Historical Version and Data Recovery Template
Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, always verify the correctness and performance impact of all operations in a test environment.
