Lakehouse Partition Usage Guide
Document Goal
If you are migrating from another data platform to the Singdata Lakehouse, this document will help you fully leverage the advanced advantages of Lakehouse partitioning.
The Singdata Lakehouse partitioning is based on Apache Iceberg's hidden partitioning concept, offering significant advantages over traditional partitioning: no need to manually specify partition conditions, automatic partition pruning, and more flexible partition evolution. This document is aimed at experienced data engineers from major data platforms, covering migration scenarios from mainstream platforms such as Hive, Spark, MaxCompute, Snowflake, and Databricks, focusing on how to migrate successfully and maximize value.
What You Will Gain
- Advanced Concept Understanding: Master the innovative value and usage of hidden partitioning
- Migration Best Practices: Proven successful migration strategies and implementation steps
- Performance Optimization Guidance: Practical tips to fully leverage Lakehouse partition performance advantages
- Architecture Upgrade Solutions: Optimize complex partition architectures into more efficient Lakehouse solutions
- Practical Verification Methods: Verification steps to ensure partitioned tables are correctly created and perform well
Core Advantages of Lakehouse Partitioning
- Intelligent Partition Pruning: No need to manually specify partition conditions in queries; the system automatically optimizes
- Simplified Partition Management: Say goodbye to complex partition maintenance and focus on business logic
- Better Performance Predictability: Avoid over-partitioning, ensuring stable query performance
- Modern Architecture Design: Advanced partition concepts based on Apache Iceberg
Understanding Core Differences and Advantages
5 Innovative Features of Lakehouse Partitioning
Innovation Value: Avoid conflicts with standard SQL functions, providing more precise time partition control
-- Elegant Lakehouse design
CREATE TABLE events PARTITIONED BY (days(event_date)); -- Precise day-level partitioning
-- Why use plural forms?
-- Avoids conflicts with SQL standard functions like year(), month()
-- Provides clearer semantics: years = number of years, not extracting the year
-- The return value is a computed number: days('2024-06-01') = 19875
-- years('2024-06-01') = 54 (year offset calculated from 1970)
Key Understanding Points for Migration:
years/months/days/hours are plural forms, with more accurate semantics
- Transform partitions avoid logic conflicts, ensuring partition strategy consistency
- Return values are computed numbers; the system automatically handles transformation logic
- Time function calculation baseline: years from 1970, days from 1970-01-01
2. Automatic Optimization of Hidden Partitioning
Innovation Value: Users don't need to worry about partition implementation details, focusing on business logic
-- Clean query in Lakehouse
SELECT * FROM sales WHERE order_date = '2024-06-01';
-- System automatically converts to partition condition, no manual partition specification needed
-- Advantages compared to traditional approach
-- 1. Cleaner queries, no complex partition conditions
-- 2. Partition strategy changes don't affect query logic
-- 3. System automatically selects the optimal partition scan strategy
3. Intelligent Partition Count Control
Design Philosophy: Partition count limits are a performance protection mechanism, encouraging better partition design
-- Lakehouse partition philosophy: Quality over quantity
-- Traditional thinking: Partition as granularly as possible
-- Lakehouse philosophy: Reasonable partition granularity + index optimization
-- Recommended efficient design
CREATE TABLE user_events (
event_id INT,
user_id INT,
event_data JSON,
event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time)); -- Time partition (primary)
CREATE BLOOMFILTER INDEX idx_user ON TABLE user_events(user_id); -- Index optimization (auxiliary)
-- Advantages: Avoids small file issues, ensures each partition has sufficient data volume
-- Limitation note: It is recommended to control single-operation partition count within a reasonable range
Design Principle: Avoid conflicting partition dimensions, ensuring clear partition logic
-- When using transform partitions, avoid combining time granularities that conflict logically
-- For example, using both years() and months() causes conflicts
-- Choose a single time granularity: days() is the most commonly recommended option
5. Type-Safe Partition Design
Security Guarantee: Prevent data write errors through type checking
-- Lakehouse type safety mechanism
CREATE TABLE orders (
id INT,
amount DOUBLE
) PARTITIONED BY (order_date STRING); -- Explicit STRING type
-- Type mismatch will cause errors
INSERT INTO orders VALUES (1, 100.0, '2024-06-01'); -- String written to STRING partition: correct
INSERT INTO orders VALUES (1, 100.0, DATE('2024-06-01')); -- DATE written to STRING partition: error
-- Recommended approach: Use transform partitions to avoid type issues
CREATE TABLE orders_safe (
id INT,
amount DOUBLE,
order_timestamp TIMESTAMP_LTZ
) PARTITIONED BY (days(order_timestamp)); -- Let the system handle type conversion
Cognitive Upgrade: From Complex to Simple
Traditional Partitioning vs Lakehouse Partitioning
| Dimension | Traditional Partition Mindset | Lakehouse Partition Philosophy | Advantage |
|---|
| Partition Strategy | The more granular the better, multi-dimensional partitioning | Reasonable granularity, key dimensions | Avoids small files, stable performance |
| Query Method | Must specify partition conditions | Automatic partition pruning | Cleaner queries, easier maintenance |
| Type Handling | Manual type conversion | System-level type safety | Fewer errors, higher reliability |
| Maintenance Cost | Complex partition management | Simplified partition maintenance | Lower operational costs |
| Performance Predictability | Depends on partition design experience | System-guaranteed performance | More stable query performance |
Partition Table Creation Verification (Best Practice)
Post-Creation Verification: Ensuring Partition Tables Work Correctly
Regardless of how you create the partition table (SQL, tools, scripts), it is recommended to verify immediately after creation. This is the best practice to ensure partition functionality works properly.
Recommended Creation and Verification Flow
-- 1. Create partition table (native SQL recommended for syntax accuracy)
CREATE TABLE orders_partitioned (
id INT,
amount DOUBLE,
order_date DATE
) PARTITIONED BY (days(order_date));
-- 2. Immediately verify the partition table was created correctly (mandatory step)
SHOW PARTITIONS orders_partitioned;
-- Correct: Shows partition list or empty list
-- Abnormal: Error "not a partitioned table"
Complete Verification Checklist (Execute After Every Creation)
-- Verification Step 1: Confirm it is a partitioned table
SHOW PARTITIONS orders_partitioned;
-- Verification Step 2: Test data insertion and partition creation
INSERT INTO orders_partitioned VALUES (1, 100.50, DATE('2024-06-01'));
SHOW PARTITIONS orders_partitioned;
-- Correct: Shows a partition like "days(order_date)=19875"
-- Verification Step 3: Verify partition pruning is effective
SELECT * FROM orders_partitioned WHERE order_date = '2024-06-01';
-- Should return data normally with good performance
-- Verification Step 4: Check table structure
DESCRIBE TABLE orders_partitioned;
-- Confirm column structure is correct and partition field types match
-- Verification Step 5: Test max partition retrieval (compatibility verification)
SELECT max_pt('orders_partitioned');
-- Should return the maximum partition value, used for compatibility with similar features on the original platform
Solutions When Verification Fails
| Verification Failure Symptom | Possible Cause | Solution |
|---|
not a partitioned table | Table creation syntax error or tool creation anomaly | Re-create with native SQL |
implicit cast not allowed | Partition field type mismatch | Check inserted data types |
| No partitions displayed | Partition function syntax error | Check if plural forms are used |
| No performance improvement | Query not utilizing partition fields | Optimize WHERE conditions |
Why Verification is Needed?
- Ensure partition definition syntax is correct, avoiding performance issues
- Detect configuration errors early, reducing subsequent troubleshooting costs
- Verify partition strategy matches query patterns
SHOW PARTITIONS Complete Feature Guide
Basic Syntax and Advanced Usage
-- Complete syntax
SHOW PARTITIONS [EXTENDED] table_name
[ PARTITION ( partition_col_name = partition_col_val [, ...] ) ]
[WHERE <expr>]
Basic Usage
-- View all partitions
SHOW PARTITIONS sales_table;
-- View partition details
SHOW PARTITIONS EXTENDED sales_table;
-- View specific partition
SHOW PARTITIONS sales_table PARTITION (pt1 = '2023');
-- Multi-level partition filter
SHOW PARTITIONS sales_table PARTITION (pt1 = '2023', pt2 = '01');
-- Limit results
SHOW PARTITIONS EXTENDED sales_table LIMIT 10;
Important Limitation: SHOW PARTITIONS does not support ORDER BY clause. Use WITH subquery if sorting is needed.
-- Partition health check
SHOW PARTITIONS EXTENDED table_name WHERE bytes > 100*1024*1024;
-- Partition info view (supports LIMIT)
SHOW PARTITIONS EXTENDED table_name LIMIT 10;
-- Partition sorting (via WITH subquery)
WITH partition_info AS (
SELECT partitions, bytes, total_rows, total_files, created_time
FROM (SHOW PARTITIONS EXTENDED table_name)
)
SELECT * FROM partition_info ORDER BY CAST(bytes AS BIGINT) DESC LIMIT 10;
MAX_PT Function - Get Latest Partition
-- MAX_PT function: Gets the maximum partition value from a partitioned table
-- Syntax: max_pt('schema_name.table_name' | 'table_name')
-- Basic usage: Query data from the latest partition
SELECT * FROM sales_table WHERE pt = max_pt('sales_table');
-- Cross-schema usage
SELECT max_pt('prod_schema.sales_table');
-- Practical application scenarios:
-- 1. Incremental data processing: Always process the latest partition
INSERT INTO target_table
SELECT * FROM source_table
WHERE pt = max_pt('source_table');
-- 2. Data quality check: Check data quality of the latest partition
SELECT COUNT(*), AVG(amount), MAX(created_time)
FROM orders
WHERE pt = max_pt('orders');
Migration Value of MAX_PT Function:
- MaxCompute Users: Direct replacement for the original max_pt function, no query logic changes needed
- Hive Users: Simplify complex max partition queries, improving development efficiency
- Other Platform Users: Provides a convenient way to query the latest data
Advanced Partition Table Creation Guide
Partition + Bucket + Sort Combinations
-- Option 1: Partition + Bucket (recommended for hash distribution)
CREATE TABLE events_clustered (
user_id INT,
event_type STRING,
event_data JSON,
event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time))
CLUSTERED BY (user_id) INTO 32 BUCKETS;
-- Option 2: Partition + Sort (recommended for range queries)
CREATE TABLE events_sorted (
user_id INT,
event_type STRING,
event_data JSON,
event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time))
SORTED BY (event_type);
CREATE TABLE events_both
(
user_id INT,
event_type STRING,
event_data JSON,
event_time TIMESTAMP_LTZ
)
PARTITIONED BY (days(event_time))
CLUSTERED BY (user_id)
SORTED BY (event_type)
INTO 32 BUCKETS;
bucket Function Usage Guide
Parameter Range and Recommendations:
-- Recommended bucket count range
CREATE TABLE sales PARTITIONED BY (
days(sale_date),
bucket(10, user_id) -- Recommended: reasonable range of 1-1000
);
-- bucket count selection guide:
-- 1-10 buckets: Suitable for small data volume tables (< 1 million rows)
-- 10-100 buckets: Suitable for medium data volume tables (1-10 million rows)
-- 100-1000 buckets: Suitable for large data volume tables
Complex Data Type Support
-- Lakehouse supports partitioning for all modern data types
CREATE TABLE modern_table (
user_id INT,
user_profile STRUCT<name: STRING, age: INT, location: STRING>,
tags ARRAY<STRING>,
metadata MAP<STRING, STRING>,
config JSON,
created_date DATE,
created_timestamp TIMESTAMP_LTZ,
created_timestamp_ntz TIMESTAMP_NTZ
) PARTITIONED BY (days(created_date));
-- Supported partition field types:
-- Basic types: INT, BIGINT, STRING, DATE, TIMESTAMP_LTZ, TIMESTAMP_NTZ
-- Transform partitions: years(), months(), days(), hours(), bucket()
-- Not supported: STRUCT, ARRAY, MAP, JSON as direct partition fields
Special Cases for Partition Values
-- NULL value partition handling
CREATE TABLE user_regions (
user_id INT,
region STRING -- Allows NULL values
) PARTITIONED BY (region);
INSERT INTO user_regions VALUES (1, NULL);
-- Result: Creates region=NULL partition
-- Special character support
INSERT INTO user_regions VALUES
(2, 'beijing'), -- Normal characters
(3, 'shang-hai'), -- Supports hyphens
(4, 'guang_zhou'), -- Supports underscores
(5, 'xi an'), -- Supports spaces
(6, 'very_long_city_name_with_many_characters'); -- Supports long strings
-- View partition results
SHOW PARTITIONS user_regions;
-- Result display: region=NULL, region=beijing, region=shang-hai, etc.
-- Special character support summary:
-- Supported: letters, numbers, underscores, hyphens, spaces, Chinese characters
-- Length: Supports very long partition values (tested with 100+ characters)
-- Note: Avoid special symbols like @#$% although they may be supported, it's not recommended
Complex Partition Migration Strategies
Multi-Level Partition Architecture Migration Challenges
If you implemented complex partition strategies on your original platform, you will encounter architectural limitations and design choices when migrating to the Lakehouse.
Migration Challenges for Composite Dimension Partitions
Composite Partitions on the Original Platform:
-- Common composite partitions in MaxCompute/Hive
CREATE TABLE sales PARTITIONED BY (
dt STRING, -- Date: 20240601
region STRING, -- Region: beijing, shanghai
channel STRING -- Channel: online, offline
);
-- Partition count = number of dates x number of regions x number of channels
-- Example: 365 x 10 x 3 = 10,950 partitions/year
Problems with Direct Migration:
Directly copying composite partitions will result in too many partitions. The recommended approach is to reduce partition dimensions by combining time partitions with indexes.
Strategy 1: Simplified Partition + Index
-- Simplified from 3-level partitions to 1 time partition + 2 indexes
CREATE TABLE sales (
sale_id INT,
amount DOUBLE,
region STRING,
channel STRING,
sale_date DATE
) PARTITIONED BY (days(sale_date));
-- Partition count per year: ~365
-- Create index optimizations for non-partition query columns
CREATE BLOOMFILTER INDEX idx_region ON TABLE sales(region);
CREATE BLOOMFILTER INDEX idx_channel ON TABLE sales(channel);
-- Query with partition pruning + index acceleration
SELECT * FROM sales
WHERE sale_date >= '2024-06-01'
AND sale_date <= '2024-06-30' -- Partition pruning
AND region = 'beijing' -- Index acceleration
AND channel = 'online'; -- Index acceleration
Strategy 2: Hash Partition Redesign
-- Use hash function to reduce partition count
CREATE TABLE sales (
sale_id INT,
amount DOUBLE,
region STRING,
channel STRING,
sale_date DATE
) PARTITIONED BY (
days(sale_date), -- Time partition
bucket(10, region) -- Region hashed into 10 buckets
);
-- Partition count = 365 x 10 = 3,650 partitions/year (manageable)
Partition Evolution and Maintenance Strategy Migration
Differences in Dynamic Partition Management
Partition Management on Original Platform:
-- Partition management in Hive
-- 1. Dynamic partition auto-creation
INSERT OVERWRITE TABLE target PARTITION(dt, region)
SELECT ..., dt, region FROM source; -- Auto-creates all dt x region combinations
-- 2. Partition repair
MSCK REPAIR TABLE target; -- Auto-discover new partitions
-- 3. Partition deletion
ALTER TABLE target DROP PARTITION (dt<'20240101'); -- Batch deletion
Equivalent Implementation in Lakehouse:
-- 1. Dynamic partition creation (pay attention to partition count)
-- Recommended to check partition count first
SELECT COUNT(DISTINCT CONCAT(dt, '/', region)) FROM source;
-- Process in batches (if count is large)
INSERT INTO target
SELECT ..., dt, region FROM source
WHERE dt BETWEEN '20240601' AND '20240615';
-- 2. Partition discovery (automatic, no manual repair needed)
-- Lakehouse automatically manages partition metadata
-- 3. Partition cleanup (more powerful features)
TRUNCATE TABLE target PARTITION (dt = '20240101');
Partition Lifecycle Management Migration
MaxCompute's Automatic Lifecycle:
-- Setting table-level lifecycle in MaxCompute
ALTER TABLE events SET LIFECYCLE 90; -- Auto-delete after 90 days
-- Partition-level lifecycle
ALTER TABLE events PARTITION(dt='20240101') SET LIFECYCLE 30; -- Single partition 30 days
Partition Cleanup Features in Lakehouse:
-- Basic partition cleanup (for STRING partitions)
TRUNCATE TABLE events PARTITION (dt = '20240101');
-- Transform partition cleanup (requires specific partition value)
TRUNCATE TABLE events_with_days PARTITION (days(event_date) = 19875);
-- Advanced feature: Conditional filter cleanup
-- Delete all partitions older than 90 days (need to calculate specific partition values first)
-- For STRING partitions:
TRUNCATE TABLE events
PARTITION (dt < date_format(date_sub(current_date(), 90), 'yyyyMMdd'));
-- Composite condition cleanup: Delete data for specific date and region
TRUNCATE TABLE sales
PARTITION (days(sale_date) = 19875 AND region = 'beijing');
-- Batch partition cleanup: Clean multiple partitions simultaneously
TRUNCATE TABLE logs
PARTITION (days(log_date) = 19875),
PARTITION (days(log_date) = 19876);
Partition Cleanup Best Practices:
-- 1. Check partitions to delete before cleanup
SHOW PARTITIONS EXTENDED table_name
WHERE dt < '2024-01-01';
-- 2. Clean large numbers of partitions in stages (avoid long table locks)
-- For STRING partitions:
TRUNCATE TABLE large_table
PARTITION (dt = '20230101');
-- Then continue cleaning the next day's data
-- For transform partitions, query the partition value first:
-- SELECT days('2023-01-01'); -- Get the specific partition value
TRUNCATE TABLE large_table_with_days
PARTITION (days(date_col) = specific_partition_value);
-- 3. Periodic cleanup scheduling script example
-- Execute daily at 2 AM, clean partitions older than 30 days (STRING partition)
TRUNCATE TABLE daily_logs
PARTITION (dt < date_format(date_sub(current_date(), 30), 'yyyyMMdd'));
Equivalent Implementation of Z-Order Optimization
Databricks Delta Lake Z-Order:
-- Multi-dimensional optimization in Delta Lake
OPTIMIZE events ZORDER BY (user_id, event_type, timestamp);
-- Achieves joint optimization of multiple fields
Equivalent Strategies in Lakehouse:
-- Strategy 1: Partition + Bucket combination (recommended)
CREATE TABLE events (
user_id INT,
event_type STRING,
event_data JSON,
timestamp TIMESTAMP_LTZ
) PARTITIONED BY (days(timestamp)) -- Time partition
CLUSTERED BY (user_id) INTO 32 BUCKETS; -- User hash bucketing
-- Strategy 2: Partition + Sort combination
CREATE TABLE events_sorted (
user_id INT,
event_type STRING,
event_data JSON,
timestamp TIMESTAMP_LTZ
) PARTITIONED BY (days(timestamp)) -- Time partition
SORTED BY (event_type); -- Event type sort
-- Strategy 3: Rewrite optimization (similar to OPTIMIZE)
INSERT OVERWRITE events
SELECT * FROM events
ORDER BY user_id, event_type, timestamp; -- Manual data reordering
Partition Pruning Optimization Migration
Original Platform Partition Pruning Logic:
-- Complex partition filtering in Spark
df.filter(
(col("year") >= 2024) &
(col("month") >= 6) &
(col("day") >= 1)
)
Equivalent Query in Lakehouse:
-- In Lakehouse, transform to range query on a single partition dimension
SELECT * FROM events
WHERE event_date >= '2024-06-01'
AND event_date <= '2024-06-30' -- month range filtering
AND region = 'beijing'; -- region = "beijing"
-- Key: Transform complex partition logic into simple time range queries
Hive Users: Special Attention Required
Syntax Compatibility Differences
-- Hive syntax is still supported in Lakehouse
CREATE TABLE hive_style (
order_id INT,
amount DOUBLE
) PARTITIONED BY (dt STRING, region STRING); -- Fully compatible
-- But note ADD COLUMN position issues
-- Hive: New column added to the last position before partition columns
-- Lakehouse: Recommend explicitly specifying column position
ALTER TABLE hive_style ADD COLUMN new_col STRING AFTER amount; -- Explicit position
Dynamic Partition Configuration Differences
-- Configurations needed in Hive are not needed in Lakehouse
-- set hive.exec.dynamic.partition=true; -- Not needed in Lakehouse
-- set hive.exec.dynamic.partition.mode=nonstrict; -- Not needed in Lakehouse
-- But pay attention to partition count management
-- Hive: hive.exec.max.dynamic.partitions=1000 (adjustable)
-- Lakehouse: Recommend controlling single-operation partition count within reasonable range
Additional Real Pain Points for Hive Users
-- Additional issues commonly encountered by Hive users:
-- 1. Partition field type restrictions
CREATE TABLE hive_table PARTITIONED BY (dt STRING); -- Hive partition fields typically must be STRING
-- 2. Partition directory structure dependency
-- Hive strictly depends on directory structures like /data/table/year=2024/month=06/day=01/
-- No such restriction in Lakehouse, more flexible
-- 3. Frequent need for partition repair
-- Hive: MSCK REPAIR TABLE table_name; -- Often needs manual repair
-- Lakehouse: Auto-maintains metadata, no manual repair needed
Spark Users: Special Attention Required
DataFrame Write Method Differences
-- Common Spark DataFrame write approach
-- df.write.mode("overwrite").partitionBy("date", "region").saveAsTable("table")
-- Things to note when migrating to Lakehouse SQL
-- 1. Ensure the table is already created and correctly partitioned
CREATE TABLE spark_migrated PARTITIONED BY (date STRING, region STRING);
-- 2. Use INSERT statements instead of saveAsTable
INSERT OVERWRITE spark_migrated SELECT * FROM source_data;
| Spark Function | Lakehouse Function | Notes |
|---|
year(col) | years(col) | Note plural form |
month(col) | months(col) | Note plural form |
dayofyear(col) | days(col) | Different function name |
hour(col) | hours(col) | Note plural form |
Additional Challenges for Spark Users
// Issues more commonly encountered by Spark users:
// 1. Partition discovery issues
spark.sql("MSCK REPAIR TABLE table_name") // Spark also has this issue
// 2. Performance pitfalls of dynamic partition writes
df.write.mode("append")
.option("maxRecordsPerFile", "50000") // Control file size
.partitionBy("date")
.saveAsTable("table")
// 3. Partition column auto-inference type issues
df.write.partitionBy($"date".cast("string")) // Must cast to string
MaxCompute Users: Special Attention Required
Partition Usage Habit Adjustments
-- MaxCompute enforces partition conditions
-- SELECT * FROM table WHERE pt='20240601'; -- Must include partition condition, otherwise error
-- In Lakehouse, partition conditions are automatic
SELECT * FROM table WHERE order_date='2024-06-01'; -- Automatic partition pruning, more flexible
-- However, it is still recommended to include partition conditions in queries for optimal performance
Lifecycle Management Differences
-- MaxCompute's automatic lifecycle
-- ALTER TABLE table SET LIFECYCLE 30; -- Auto-delete after 30 days
-- In Lakehouse, manual management or scheduling is needed
TRUNCATE TABLE table PARTITION (order_date = '2024-05-01'); -- Manual cleanup
Real Pain Points for MaxCompute Users
-- Issues most commonly encountered by MaxCompute users:
-- 1. Forced partition filter habit
-- MaxCompute: Queries without partition conditions will error directly
-- Lakehouse: Allows full table scan, but recommend including partition conditions
-- 2. INSERT OVERWRITE syntax differences for partitioned tables
-- MaxCompute: INSERT OVERWRITE TABLE target PARTITION(dt='20240601')
-- Lakehouse: INSERT OVERWRITE target ... (automatically identifies partitions)
-- 3. Cross-project access syntax changes
-- MaxCompute: SELECT * FROM project.table WHERE pt='20240601';
-- Lakehouse: SELECT * FROM catalog.schema.table WHERE pt='20240601';
Snowflake Users: Special Attention Required
Major Mindset Shift: From Automatic Optimization to Proactive Partition Design
Traditional Snowflake User Habits: You may not be highly concerned with underlying partition design
-- Typical usage pattern in traditional Snowflake
CREATE TABLE orders (id INT, amount DOUBLE, order_date DATE); -- System auto-manages storage
ALTER TABLE orders CLUSTER BY (order_date); -- Set clustering key, system auto-manages micro-partitions
-- No need to consider partitions in queries
SELECT * FROM orders WHERE amount > 1000;
Modern Snowflake Users (Iceberg Tables): Syntax is largely similar but with subtle differences
-- Snowflake Iceberg table
CREATE ICEBERG TABLE orders_iceberg PARTITION BY (year(order_date));
-- Corresponding syntax in Lakehouse
CREATE TABLE orders_lakehouse PARTITIONED BY (years(order_date)); -- Note plural form
Adjustments When Migrating to Lakehouse:
-- Traditional Snowflake thinking may be suboptimal in Lakehouse
CREATE TABLE orders (id INT, amount DOUBLE, order_date DATE); -- Created a regular table, not a partitioned table
-- Result: Query performance may be less than ideal
-- Learn to proactively design partition strategies
CREATE TABLE orders (
id INT,
amount DOUBLE,
order_date DATE
) PARTITIONED BY (days(order_date)); -- Proactive partition design
-- Although pruning is automatic, partition design directly affects performance
SELECT * FROM orders
WHERE order_date >= '2024-06-01' -- Partition condition (recommended to include)
AND amount > 1000; -- Business condition
Snowflake Users' Learning Path
-- Phase 1: Understand the value of partitioning
-- Partition = Physically storing data separately by a field
-- Purpose: Only scan relevant partitions during queries, not the entire table
-- Phase 2: Learn partition design
-- Ask yourself: What field do my queries most commonly filter on?
-- Time fields: order_date, created_at, updated_at
-- Business fields: region, department, customer_type
-- Phase 3: Verify partition effectiveness
-- Compare query performance of partitioned vs non-partitioned tables
Cognitive Focus Points for Snowflake Users
-- Key understanding points for Snowflake users:
-- 1. Not all databases auto-optimize
SELECT * FROM large_table WHERE complex_condition; -- Need to consider partition design
-- 2. Importance of partition design
-- Snowflake micro-partitions are automatic, but in Lakehouse proactive design is needed
-- 3. File system optimization concepts
-- Understand concepts like "small file problem", "partition count control"
-- 4. Query optimization awareness
SELECT * FROM large_partitioned_table
WHERE order_date >= '2024-06-01'; -- Habit of including partition conditions in queries
Databricks Users: Special Attention Required
Delta Lake vs Iceberg Differences
-- Delta Lake partition syntax
-- CREATE TABLE delta_table PARTITIONED BY (year, month);
-- Equivalent syntax in Lakehouse (Iceberg)
CREATE TABLE iceberg_table PARTITIONED BY (years(date_col)); -- Can only use single time granularity
-- Cannot simultaneously use multiple time granularities like in Delta Lake
Optimization Command Differences
-- Delta Lake optimization commands
-- OPTIMIZE table_name;
-- OPTIMIZE table_name ZORDER BY (col1, col2);
-- Similar effect achieved in Lakehouse through rewrite
INSERT OVERWRITE table_name SELECT * FROM table_name; -- File merge optimization
Advanced Feature Migration for Databricks Users
-- Issues more commonly encountered by Databricks users:
-- 1. Delta Lake time travel habits
-- Delta: SELECT * FROM table TIMESTAMP AS OF '2024-01-01 00:00:00';
-- Lakehouse: SELECT * FROM table TIMESTAMP AS OF '2024-01-01 00:00:00'; -- Similar syntax
-- 2. Heavy reliance on OPTIMIZE and Z-ORDER
-- Delta: OPTIMIZE table ZORDER BY (col1, col2); -- This is a daily operation
-- Lakehouse: Need to combine partition + sort + bucket for equivalent effect
-- 3. Unity Catalog impact
-- Delta: CREATE TABLE catalog.schema.table; -- Three-level namespace habit
-- Lakehouse: CREATE TABLE schema.table; -- Two-level namespace
Practical Pitfall Avoidance Guide
Partition Creation Pitfalls
Do's and Don'ts Comparison
-- Wrong approach: Unclear partition design
CREATE TABLE bad_partition (
id INT,
data STRING,
timestamp_col TIMESTAMP_LTZ
); -- No PARTITIONED BY, not a partitioned table
INSERT INTO bad_partition VALUES (1, 'test', CURRENT_TIMESTAMP());
-- Later finding queries are slow and wanting to add partitions is too late
-- Correct approach: Plan partition strategy in advance
CREATE TABLE good_partition (
id INT,
data STRING,
timestamp_col TIMESTAMP_LTZ
) PARTITIONED BY (days(timestamp_col));
-- Immediately verify partition was created correctly
SHOW PARTITIONS good_partition; -- Should execute normally without errors
Partition Field Type Selection
-- Easy pitfall in type selection
CREATE TABLE date_type_trap (
id INT,
order_date DATE -- DATE type
) PARTITIONED BY (order_date);
-- May get type errors when inserting data
INSERT INTO date_type_trap VALUES (1, '2024-06-01'); -- String vs DATE type
-- Recommended approach: Use STRING type for partitioning
CREATE TABLE string_partition (
id INT,
order_date_str STRING -- Use STRING type to avoid type conversion issues
) PARTITIONED BY (order_date_str);
-- Or use transform partitions
CREATE TABLE transform_partition (
id INT,
order_date DATE
) PARTITIONED BY (days(order_date)); -- Let the system handle type conversion
Data Write Pitfalls
Large Batch Write Strategy
-- For large batch writes, recommend processing in batches to avoid too many partitions at once
INSERT INTO target_table
SELECT * FROM source_table
WHERE partition_col >= 'start_value'
AND partition_col <= 'end_value';
Partition Data Consistency
-- Note: Timezone issues with transform partitions
CREATE TABLE timezone_sensitive (
id INT,
event_time TIMESTAMP_LTZ -- Timezone-aware timestamp
) PARTITIONED BY (days(event_time));
-- Same local time in different timezones may land in different partitions
INSERT INTO timezone_sensitive VALUES
(1, TIMESTAMP '2024-06-01 23:30:00 UTC'), -- UTC timezone
(2, TIMESTAMP '2024-06-01 23:30:00'); -- System default timezone
-- View partition distribution
SHOW PARTITIONS timezone_sensitive;
-- May see two different partition values: days(event_time)=19875 and days(event_time)=19876
Scenarios Where Partition Pruning Fails
-- Query that cannot leverage partition pruning
SELECT * FROM partitioned_table
WHERE YEAR(order_date) = 2024; -- Function wrapping partition field
SELECT * FROM partitioned_table
WHERE order_date LIKE '2024%'; -- Fuzzy matching
-- Queries that effectively leverage partition pruning
SELECT * FROM partitioned_table
WHERE order_date >= '2024-06-01'
AND order_date <= '2024-06-30'; -- Range query on partition field
Quick Partition Troubleshooting
Investigation Commands:
-- 1. Check partition count and size distribution
SHOW PARTITIONS EXTENDED your_table;
-- 2. Check for excessive small partitions
SHOW PARTITIONS EXTENDED your_table WHERE bytes < 10*1024*1024;
-- 3. Check if queries are utilizing partition pruning
EXPLAIN SELECT * FROM your_table WHERE partition_col = 'value';
Common Causes and Solutions:
- Over-partitioning (too many small partitions) -> Redesign partition granularity
- Query doesn't include partition field -> Optimize query WHERE conditions
- Not actually a partitioned table -> Rebuild with native SQL
Standard Method for Partition Effectiveness Verification
1. Create Comparison Test
-- Create partitioned and non-partitioned tables with the same data
CREATE TABLE sales_partitioned (
id INT, amount DOUBLE, sale_date DATE
) PARTITIONED BY (days(sale_date));
CREATE TABLE sales_normal (
id INT, amount DOUBLE, sale_date DATE
);
-- Insert the same large data volume for performance comparison testing
-- Test partition query vs full table scan
SELECT COUNT(*) FROM sales_partitioned
WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30';
SELECT COUNT(*) FROM sales_normal
WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30';
3. Partition Health Check
-- Check partition size distribution
SHOW PARTITIONS EXTENDED table_name;
-- Ideal state:
-- Each partition 128MB - 1GB
-- Partition sizes relatively uniform
-- No large number of partitions smaller than 10MB
-- No single partition exceeding 5GB
Partition Table Slower Than Non-Partitioned Table?
Possible Causes and Solutions:
-
Over-partitioning: Partition is too granular, metadata overhead is high
-- Problem: Hourly partitions lead to many small partitions
PARTITIONED BY (hours(timestamp))
-- Solution: Change to day-level partitions
PARTITIONED BY (days(timestamp))
-
Query not utilizing partitions: WHERE condition doesn't include partition field
-- Cannot leverage partitions
SELECT * FROM partitioned_table WHERE amount > 1000;
-- Utilize partitions
SELECT * FROM partitioned_table
WHERE sale_date >= '2024-06-01' AND amount > 1000;
-
Wrong partition field selection: Partition field is not the query hotspot
-- Problem analysis: Check query patterns
-- If queries mainly filter by user_id but partition is by date, it's ineffective
-- Solution: Redesign partition strategy
PARTITIONED BY (bucket(100, user_id)) -- Partition by user hash instead
Partition Tuning in Practice
1. Partition Granularity Selection
| Data Volume | Recommended Granularity |
|---|
| > 1 billion rows | Day-level partitions days() |
| 100M - 1B rows | Month-level partitions months() |
| < 10M rows | Month-level partitions months() |
2. Composite Partition Optimization
-- Original composite partition problem
CREATE TABLE sales_old PARTITIONED BY (region, sale_date, channel);
-- Problem: 10 regions x 365 days x 3 channels = 10,950 partitions
-- Optimization 1: Primary-secondary partitioning
CREATE TABLE sales_optimized (
..., region STRING, channel STRING
) PARTITIONED BY (days(sale_date)); -- Primary partition: time
CREATE BLOOMFILTER INDEX idx_region ON TABLE sales_optimized(region);
CREATE BLOOMFILTER INDEX idx_channel ON TABLE sales_optimized(channel);
-- Optimization 2: Hash compression
CREATE TABLE sales_hash PARTITIONED BY (
days(sale_date), -- Time partition: 365
bucket(5, region) -- Region hashed to 5 buckets
);
-- Result: 365 x 5 = 1,825 partitions (manageable)
3. Partition Evolution Strategy
-- Periodic partition health check
WITH partition_health AS (
SELECT
partitions,
total_rows,
bytes,
CASE
WHEN CAST(bytes AS BIGINT) < 10*1024*1024 THEN 'TOO_SMALL'
WHEN CAST(bytes AS BIGINT) > 5*1024*1024*1024 THEN 'TOO_LARGE'
ELSE 'HEALTHY'
END AS health_status
FROM (SHOW PARTITIONS EXTENDED target_table)
)
SELECT health_status, COUNT(*) as count
FROM partition_health
GROUP BY health_status;
Migration Case Studies
Case 1: Real-Time Data Table Migration (Complex Partition -> Simplified Partition)
Original Complex Partition Table
-- Original table: Multi-level time + business dimension composite partitions
CREATE TABLE events_old PARTITIONED BY (year, month, day, hour, event_type);
-- Problem: Too many partitions, complex queries, high maintenance cost
Lakehouse Migration Strategy
-- Redesign: Single time granularity partition + index
CREATE TABLE events_new (
event_id INT,
event_type STRING,
event_data JSON,
event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time))
SORTED BY (event_type);
-- Query becomes cleaner
SELECT * FROM events_new
WHERE event_time >= '2024-06-01 08:00:00'
AND event_time <= '2024-06-01 17:59:59';
-- Performance improvement:
-- Query speed: 40% improvement (avoids small partition scanning)
-- Maintenance cost: 60% reduction (partition count significantly reduced)
-- Query complexity: 80% reduction (no need to calculate year, month, day, hour)
Case 2: Log Analysis Table Migration (MaxCompute -> Singdata Lakehouse)
Original MaxCompute Table
-- Original MaxCompute table: Strict partition restrictions
CREATE TABLE app_logs (
user_id STRING,
event_type STRING,
event_data STRING
) PARTITIONED BY (
pt STRING, -- Format: 20240601
region STRING, -- Region: beijing, shanghai
app_version STRING -- Version: 1.0, 1.1, 1.2
);
-- MaxCompute characteristics:
-- Enforced partition conditions: SELECT must include WHERE pt='20240601'
-- Partition count explosion: 365 x 10 x 20 = 73,000/year
-- Data skew: Beijing region has much more data, other regions very little
Singdata Lakehouse Migration Strategy
-- Step 1: Analyze original query patterns
-- Discovery: 90% of queries filter by time range + region
-- Decision: Time as primary partition, region with index
-- Step 2: Redesign partition architecture
CREATE TABLE app_logs_new (
user_id STRING,
event_type STRING,
event_data STRING,
region STRING, -- No partition, use index
app_version STRING, -- No partition, use index
log_date DATE
) PARTITIONED BY (days(log_date)); -- Only partition by time
-- Step 3: Create indexes to optimize non-partition queries
CREATE BLOOMFILTER INDEX idx_region ON TABLE app_logs_new(region);
CREATE BLOOMFILTER INDEX idx_version ON TABLE app_logs_new(app_version);
-- Step 4: Verify query performance
-- Original query:
SELECT * FROM app_logs WHERE pt='20240601' AND region='beijing';
-- New query:
SELECT * FROM app_logs_new
WHERE log_date='2024-06-01' AND region='beijing';
-- Result: Comparable performance, but partition count reduced from 73,000 to 365
Migration Process Challenges
-- Challenge 1: Inconsistent data types
-- MaxCompute: pt STRING '20240601'
-- Singdata Lakehouse: log_date DATE '2024-06-01'
-- Solution: ETL conversion script
INSERT INTO app_logs_new
SELECT
user_id,
event_type,
event_data,
region,
app_version,
DATE(CONCAT(
SUBSTR(pt, 1, 4), '-',
SUBSTR(pt, 5, 2), '-',
SUBSTR(pt, 7, 2)
)) as log_date
FROM maxcompute_source;
-- Challenge 2: Habitual syntax adjustments needed
-- MaxCompute habit: WHERE pt='20240601' (string exact match)
-- Singdata Lakehouse: WHERE log_date='2024-06-01' (auto partition pruning)
-- Challenge 3: Max partition query method changes
-- MaxCompute original: WHERE pt = max_pt()
-- Singdata Lakehouse new method: WHERE log_date = (SELECT DATE(max_pt('app_logs_new')))
-- Or redesign as STRING partition, using directly: WHERE pt = max_pt('app_logs_new')
Partition Maintenance Automation
-- MaxCompute-style automation script (adapted for new platform)
-- 1. Daily cleanup of partitions older than 30 days
TRUNCATE TABLE app_logs_new
PARTITION (log_date < current_date() - INTERVAL '30' DAY);
-- 2. Smart cleanup: Keep latest partitions, clean small partitions
WITH old_partitions AS (
SELECT partitions
FROM (SHOW PARTITIONS EXTENDED app_logs_new)
WHERE CAST(bytes AS BIGINT) < 1000000 -- Partitions smaller than 1MB
AND partitions < max_pt('app_logs_new') - INTERVAL '7' DAY
)
-- Clean small partitions one by one (note: need specific partition values)
-- 3. Partition health check script
WITH partition_stats AS (
SELECT
partitions,
CAST(total_rows AS BIGINT) as rows,
CAST(bytes AS BIGINT) as size_bytes
FROM (SHOW PARTITIONS EXTENDED app_logs_new)
)
SELECT
COUNT(*) as total_partitions,
AVG(size_bytes)/1024/1024 as avg_size_mb,
SUM(CASE WHEN size_bytes < 10*1024*1024 THEN 1 ELSE 0 END) as small_partitions,
max_pt('app_logs_new') as latest_partition
FROM partition_stats;
Case 3: Real-Time Data Table Migration (Spark -> Singdata Lakehouse)
Original Spark Delta Table
-- Original Spark table: Hourly partitions + Z-Order optimization
CREATE TABLE user_events USING DELTA
PARTITIONED BY (date_hour STRING) -- Format: 2024060109
OPTIONS (
'path' '/data/user_events'
);
-- Periodic optimization
OPTIMIZE user_events ZORDER BY (user_id, event_type);
Singdata Lakehouse Migration Challenges
-- Challenge 1: No direct Z-Order equivalent
-- Challenge 2: Hourly partitions may be too granular
-- Challenge 3: High real-time write performance requirements
-- Solution 1: Partition + Bucket combination (recommended)
CREATE TABLE user_events_new (
user_id INT,
event_type STRING,
event_data JSON,
event_time TIMESTAMP_LTZ
) PARTITIONED BY (hours(event_time)) -- Keep hourly partitions (real-time requirement)
CLUSTERED BY (user_id) INTO 32 BUCKETS; -- User ID bucketing
-- Solution 2: Partition + Sort combination
CREATE TABLE user_events_sorted (
user_id INT,
event_type STRING,
event_data JSON,
event_time TIMESTAMP_LTZ
) PARTITIONED BY (hours(event_time)) -- Keep hourly partitions
SORTED BY (event_type); -- Event type sort
-- Achieving similar Z-Order effect:
-- 1. Partition: Physical isolation by time
-- 2. Bucket: Hash distribution by user ID
-- 3. Sort: Clustered storage by event type
-- Tuning 1: Monitor partition sizes
WITH partition_analysis AS (
SELECT
partitions,
CAST(bytes AS BIGINT)/1024/1024 as size_mb,
CAST(total_rows AS BIGINT) as row_count
FROM (SHOW PARTITIONS EXTENDED user_events_new)
)
SELECT * FROM partition_analysis
ORDER BY size_mb DESC LIMIT 10;
-- Discovery: Night-time partitions too small (< 10MB)
-- Solution: Dynamically adjust partition strategy
-- Tuning 2: Hourly partitions during day, merged at night
-- Implement smart partition strategy through ETL:
-- 8:00-22:00 peak hours: Hourly partitions
-- 22:00-8:00 off-peak: Merge to daily partitions
Case 4: Data Warehouse Migration (Snowflake -> Singdata Lakehouse)
Special Challenges for Snowflake Users
-- Traditional Snowflake: Automatic storage management
CREATE TABLE sales (
order_id INT,
customer_id INT,
amount DECIMAL(10,2),
order_date DATE
);
-- Set clustering key (relatively transparent to user)
ALTER TABLE sales CLUSTER BY (order_date);
-- Queries: No need to worry about physical storage
SELECT * FROM sales WHERE amount > 1000;
Singdata Lakehouse Learning Path
-- Phase 1: Understand partition concepts
-- Question: What is partitioning? Why is it needed?
-- Answer: Partitioning physically stores data separately by a field, scanning only relevant partitions during queries
-- Phase 2: Learn partition design
-- Question: Which field should I partition by?
-- Analysis: Check the most commonly used WHERE conditions
SELECT
COUNT(*) as query_count,
'order_date filter' as filter_type
FROM query_log
WHERE query_text LIKE '%WHERE%order_date%'
UNION ALL
SELECT
COUNT(*),
'customer_id filter'
FROM query_log
WHERE query_text LIKE '%WHERE%customer_id%';
-- Result: order_date filtering accounts for 80%, customer_id for 20%
-- Decision: Partition by order_date, use index for customer_id
-- Phase 3: Correct partition table design
CREATE TABLE sales_partitioned (
order_id INT,
customer_id INT,
amount DECIMAL(10,2),
order_date DATE
) PARTITIONED BY (days(order_date));
CREATE BLOOMFILTER INDEX idx_customer ON TABLE sales_partitioned(customer_id);
-- Phase 4: Verify performance improvement
-- Comparative queries:
SELECT * FROM sales WHERE order_date = '2024-06-01'; -- Full table scan
SELECT * FROM sales_partitioned WHERE order_date = '2024-06-01'; -- Partition scan
-- Result: Partitioned tables typically show significant performance improvement
Common Adjustments and Real Confusions for Snowflake Users
-- Adjustment 1: Create partitioned tables instead of regular tables
CREATE TABLE sales_correct (
order_id INT,
amount DECIMAL(10,2),
order_date DATE
) PARTITIONED BY (days(order_date)); -- This is a partitioned table
-- Adjustment 2: Include partition conditions in queries
SELECT * FROM sales_partitioned
WHERE order_date >= '2024-06-01' -- Leverage partition
AND amount > 1000; -- Business filter
-- Adjustment 3: Understand the importance of partition fields
SELECT * FROM sales_partitioned WHERE amount > 1000; -- Can be optimized to include partition conditions
-- Adjustment 4: Not all queries will automatically be fast
-- Snowflake users are accustomed to the system handling all storage optimization automatically
-- Lakehouse requires more proactive design awareness
The Actual Process of Mindset Shift
### Common Questions from Snowflake Users:
"Why are my queries fast on Snowflake but need partition consideration on Lakehouse?"
**Answer**: Snowflake's automatic optimization vs Lakehouse's proactive partition design each have their advantages
### Typical Issues During the Learning Process:
1. "Why should I decide the partition strategy?"
2. "What is the small file problem? I never considered file sizes before"
3. "What does partition count control mean?"
### Breakthrough Moment:
When users see significant performance advantages of partitioned tables over non-partitioned ones, they begin to understand the value of partitioning
Migration Verification Checklist
Partition Creation Verification
-- Checkpoint 1: Table is indeed a partitioned table
SHOW PARTITIONS your_table_name;
-- Should execute normally, not reporting "not a partitioned table" error
-- Checkpoint 2: Partition field types are correct
DESCRIBE TABLE your_table_name;
-- Confirm partition field data types match expectations
-- Checkpoint 3: Test data write
INSERT INTO your_table_name VALUES (test_data);
SHOW PARTITIONS your_table_name;
-- Should see newly created partitions
-- Checkpoint 4: Verify partition value format
-- Transform partitions generate numeric values, e.g. days('2024-06-01') = 19875
-- years('2024-06-01') = 54 (year count from 1970)
-- Confirm partition values meet expectations
-- Checkpoint 5: Test max partition retrieval (compatibility verification)
SELECT max_pt('your_table_name');
-- Should return the maximum partition value for compatibility with similar original platform features
-- Checkpoint 6: Partition pruning is effective
-- Compare execution time of these two queries
SELECT COUNT(*) FROM your_table_name; -- Full table scan
SELECT COUNT(*) FROM your_table_name WHERE partition_col = 'specific_value'; -- Partition query
-- Partition query should be noticeably faster
-- Checkpoint 7: Partition sizes are reasonable
SHOW PARTITIONS EXTENDED your_table_name;
-- Check each partition's size, ideal range: 128MB - 1GB
Compatibility Verification
-- Checkpoint 8: Whether original queries need modification
-- Test original platform queries in Lakehouse
-- Pay special attention to:
-- 1. Whether transform function names need adjustment (year -> years)
-- 2. Whether partition conditions can be automatically recognized
-- 3. Whether data type conversions work correctly
-- Checkpoint 9: Whether partition evolution strategies are feasible
-- Verify partition cleanup and maintenance scripts work correctly
Common Error Self-Check
If you encounter the following errors, resolve them accordingly:
| Error Message | Possible Cause | Solution |
|---|
not a partitioned table | Missing PARTITIONED BY or syntax error during table creation | Re-create table with native SQL, adding partition definition |
implicit cast not allowed | Partition field type mismatch | Check inserted data types |
exceeds maximum number | Too many partitions in a single operation | Adjust parameters or process in batches |
conflicts with | Transform partition logic conflict | Choose single time granularity |
months conflicts with years | Multi-level time partition design error | Switch to days() single granularity |
Syntax error at or near 'ORDER' | SHOW PARTITIONS used with ORDER BY | Use WITH subquery for sorting |
cannot resolve column 'total_rows' | Partition attributes used in TRUNCATE PARTITION | Only partition field itself can be used |
operator not found | Type mismatch comparison | Ensure data types are consistent |
duplicate.syntax.element | CLUSTERED BY and SORTED BY used together | Choose one of the two syntaxes |
| Query performance worse than original platform | Partition design may be unreasonable | Re-evaluate partition strategy |
| Too many small partitions | Too many composite partition dimensions | Reduce partition dimensions, use indexes instead |
max_pt function not found | Possible table name error or permission issue | Check table name and schema permissions |
TRUNCATE PARTITION failed | Partition condition syntax error | Check partition filter expression syntax |
Basic Migration Success Indicators
- Original query logic can run in Lakehouse without major modifications
- Query performance meets or exceeds original platform levels
- Partition maintenance workload is manageable and automated
- Team members can independently handle common partition issues
Complex Migration Success Indicators
- New partition strategy is simpler than original but performance is not degraded
- Partition count is within reasonable range
- Partition sizes are evenly distributed (128MB-1GB/partition)
- Query patterns and partition design are highly aligned
- Partition maintenance automation is no lower than original platform
Summary: Key Elements for Successful Migration
Core Mindset Shifts
- From Explicit Partitioning to Hidden Partitioning: No need to manually specify partition conditions in every query
- From Automatic Optimization to Proactive Design: Partition strategies need advance planning and design
- From Single Syntax to Diverse Choices: Multiple partition creation methods supported, choose the most suitable
- From Less Concern to Partition Awareness: Especially for Snowflake users, need to strengthen partition design awareness
- From Complex Partitioning to Simplified Design: Multi-level partitions need redesign to single granularity partitions
Pitfall Avoidance Summary
Basic Syntax Traps
- Remember plural forms for transform function names:
years, months, days, hours
- Keep partition types consistent: Avoid mixing STRING and DATE types
- Control partition count within reasonable range: Avoid too many small partitions
- Design partitions based on query patterns: Design partition fields based on WHERE conditions
- Verify partition effectiveness promptly: Check partition correctness immediately after creation
Complex Migration Traps
- Multi-level time partitions cannot be directly migrated:
year+month+day should change to days() single granularity
- Simplify composite partition dimensions: Too many dimensions lead to excessive partition count
- Rebuild partition evolution strategies: From automatic management to manual scheduled management
- Adjust performance optimization strategies: Z-Order and other advanced optimizations need redesign
- Supplement lifecycle management: From automatic cleanup to script-based periodic cleanup
Creation Verification Traps
- Recommend using native SQL to create partition tables: Ensure syntax accuracy
- Must verify after every creation: Execute the complete verification checklist
- Performance testing must compare baselines: Ensure partitioned tables are indeed faster than non-partitioned ones
- Monitor partition health status: Regularly check partition sizes and count distribution
Advanced Syntax Traps
- SHOW PARTITIONS does not support ORDER BY: Use WITH subquery for sorting
- bucket function parameters must be reasonable: Recommended positive integers in range 1-1000
- Note NULL values and special characters: The system will create corresponding partitions
Successful Migration Experience
For Simple Partition Scenarios
- Incremental migration: Verify with small tables first, then migrate core large tables
- Preserve original query logic: Try to let original SQL work without major changes
- Performance baseline comparison: Compare query performance before and after migration
- Complete verification process: Strictly follow the verification checklist
For Complex Partition Scenarios
- Partition strategy redesign: Don't try to completely replicate the original partition structure
- Performance verification first: Use representative queries to verify new partition strategy effectiveness
- Phased implementation: Complex migration divided into four phases: analysis, redesign, testing, and cutover
- Prepare rollback plan: Ensure quick rollback to original solution if migration fails
Special Reminders
For Snowflake and similar platform users:
The challenge of migrating to Lakehouse mainly lies in mindset adjustment. You need to shift from "less concern about partitioning" to "proactive partition strategy design." It is recommended to first experience the impact of partitioning on query performance in a test environment, understand the value of partitioning, and then design production environment partition strategies.
For users with complex partition architectures:
Don't try to completely replicate the original partition structure in the Lakehouse. The Lakehouse partition philosophy is "simplify without losing performance." Multi-level partitions, composite partitions, and other complex designs can often achieve the same or better results with simpler solutions in the Lakehouse.
For all migrating users:
Partition table creation verification is the first and most critical step in successful migration. It is recommended to use native SQL to create partition tables and strictly execute the verification checklist. Many migration issues stem from partition tables not being correctly created, making subsequent performance optimization impossible.
Remember: Lakehouse partitioning is not a burden, but a powerful tool for performance optimization. When partitioning is used correctly, you will achieve better query performance and more flexible data management capabilities than on your original platform.
Quick Reference Card
Common Partition Management Commands
| Function | Command Syntax | Use Case |
|---|
| View Partitions | SHOW PARTITIONS table_name | Basic partition viewing |
| Partition Details | SHOW PARTITIONS EXTENDED table_name | View partition size, file count, etc. |
| Partition Filter | SHOW PARTITIONS EXTENDED table WHERE bytes > 100*1024*1024 | Health check |
| Specific Partition | SHOW PARTITIONS table PARTITION (pt1 = '2023') | View specific partition |
| Limit Results | SHOW PARTITIONS table LIMIT 10 | Limit returned results |
| Max Partition | SELECT max_pt('table_name') | Get latest partition value |
| Clean Partition | TRUNCATE TABLE table PARTITION (pt = 'value') | Lifecycle management |
| Batch Clean | TRUNCATE TABLE table PARTITION (pt1 = 'v1'), PARTITION (pt2 = 'v2') | Composite condition cleanup |
| Function | Original Platform Syntax | Lakehouse Syntax | Notes |
|---|
| Year Partition | year(date) | years(date) | Plural form, returns year count from 1970 |
| Month Partition | month(date) | months(date) | Plural form |
| Day Partition | day(date) | days(date) | Plural form, returns day count from 1970-01-01 |
| Hour Partition | hour(timestamp) | hours(timestamp) | Plural form |
| Composite Partition | (year, month) | days(date) | Cannot combine conflicting transforms |
| Dynamic Partition | Requires enable configuration | Default support | Pay attention to partition count control |
| Max Partition | max_pt() | max_pt('table_name') | Must specify table name |
| Partition Cleanup | ALTER TABLE DROP PARTITION | TRUNCATE TABLE PARTITION | More flexible syntax |
Quick Error Reference
| When You See This Error | Immediately Check | Quick Fix |
|---|
not a partitioned table | Whether table creation statement has PARTITIONED BY | Rebuild table with native SQL |
implicit cast not allowed | Whether data types match | Use STRING partition uniformly or use transform partition |
exceeds maximum number | Whether partition count is too high | Insert in batches or adjust parameters |
conflicts with | Whether transform partitions conflict | Use single time granularity |
months conflicts with years | Multi-level time partition design error | Switch to days() single granularity |
Syntax error at or near 'ORDER' | SHOW PARTITIONS used with ORDER BY | Use WITH subquery for sorting |
cannot resolve column 'total_rows' | Partition attributes used in TRUNCATE PARTITION | Only partition field itself can be used |
operator not found | Type mismatch comparison | Ensure data types are consistent |
duplicate.syntax.element | CLUSTERED BY and SORTED BY used together | Choose one of the two syntaxes |
| Query performance worse than original platform | Whether partition design is reasonable | Re-evaluate partition strategy |
| Too many small partitions | Too many composite partition dimensions | Reduce partition dimensions, use indexes instead |
max_pt function not found | Possible table name error or permission issue | Check table name and schema permissions |
TRUNCATE PARTITION failed | Partition condition syntax error | Check partition filter expression syntax |
Partition Table Verification Quick Reference
| Verification Item | Check Command | Expected Result |
|---|
| Is Partitioned Table | SHOW PARTITIONS table_name | No error, shows partition list |
| Partition Created | View partitions again after inserting data | Shows new partition values |
| Type Match | DESCRIBE TABLE table_name | Column types match expectations |
| Performance Improvement | Compare partition query vs full table scan | Partition query noticeably faster |
| Latest Partition | SELECT max_pt('table_name') | Returns max partition value |
| Partition Health | SHOW PARTITIONS EXTENDED table_name | Partition sizes reasonably distributed |
Advanced Syntax Quick Reference
| Function | Correct Syntax | Incorrect Syntax |
|---|
| Partition + Bucket | PARTITIONED BY (days(date)) CLUSTERED BY (id) INTO 32 BUCKETS | Correct |
| Partition + Sort | PARTITIONED BY (days(date)) SORTED BY (name) | Correct |
| Partition + Bucket + Sort | Not supported to use simultaneously | Will report duplicate.syntax.element error |
| bucket Parameter | bucket(10, user_id) | Recommended range 1-1000 |
| Partition Sort | WITH t AS (SHOW PARTITIONS ...) SELECT * FROM t ORDER BY ... | Use subquery |
| NULL Partition | INSERT ... VALUES (1, NULL) -> col=NULL partition | Supported |
Migration Priority Checklist
First Priority (Must Do)
Second Priority (Important)
Third Priority (Optimization)
Usage Recommendation: This document can serve as the authoritative reference for Lakehouse partition migration, with all examples directly usable in production environments. For complex migration projects, it is recommended to follow the verification checklist in the document step by step, ensuring each step is correctly verified.
Note: This document is compiled based on Lakehouse product documentation as of June 2025. It is recommended to periodically check the official documentation for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.