Lakehouse Partition Usage Guide

Document Goal

If you are migrating from another data platform to the Singdata Lakehouse, this document will help you fully leverage the advanced advantages of Lakehouse partitioning.

The Singdata Lakehouse partitioning is based on Apache Iceberg's hidden partitioning concept, offering significant advantages over traditional partitioning: no need to manually specify partition conditions, automatic partition pruning, and more flexible partition evolution. This document is aimed at experienced data engineers from major data platforms, covering migration scenarios from mainstream platforms such as Hive, Spark, MaxCompute, Snowflake, and Databricks, focusing on how to migrate successfully and maximize value.

What You Will Gain

  • Advanced Concept Understanding: Master the innovative value and usage of hidden partitioning
  • Migration Best Practices: Proven successful migration strategies and implementation steps
  • Performance Optimization Guidance: Practical tips to fully leverage Lakehouse partition performance advantages
  • Architecture Upgrade Solutions: Optimize complex partition architectures into more efficient Lakehouse solutions
  • Practical Verification Methods: Verification steps to ensure partitioned tables are correctly created and perform well

Core Advantages of Lakehouse Partitioning

  • Intelligent Partition Pruning: No need to manually specify partition conditions in queries; the system automatically optimizes
  • Simplified Partition Management: Say goodbye to complex partition maintenance and focus on business logic
  • Better Performance Predictability: Avoid over-partitioning, ensuring stable query performance
  • Modern Architecture Design: Advanced partition concepts based on Apache Iceberg

Understanding Core Differences and Advantages

5 Innovative Features of Lakehouse Partitioning

1. Intelligent Transform Partition Functions

Innovation Value: Avoid conflicts with standard SQL functions, providing more precise time partition control

-- Elegant Lakehouse design
CREATE TABLE events PARTITIONED BY (days(event_date));  -- Precise day-level partitioning

-- Why use plural forms?
-- Avoids conflicts with SQL standard functions like year(), month()
-- Provides clearer semantics: years = number of years, not extracting the year
-- The return value is a computed number: days('2024-06-01') = 19875
-- years('2024-06-01') = 54 (year offset calculated from 1970)

Key Understanding Points for Migration:

  • years/months/days/hours are plural forms, with more accurate semantics
  • Transform partitions avoid logic conflicts, ensuring partition strategy consistency
  • Return values are computed numbers; the system automatically handles transformation logic
  • Time function calculation baseline: years from 1970, days from 1970-01-01

2. Automatic Optimization of Hidden Partitioning

Innovation Value: Users don't need to worry about partition implementation details, focusing on business logic

-- Clean query in Lakehouse
SELECT * FROM sales WHERE order_date = '2024-06-01';  
-- System automatically converts to partition condition, no manual partition specification needed

-- Advantages compared to traditional approach
-- 1. Cleaner queries, no complex partition conditions
-- 2. Partition strategy changes don't affect query logic
-- 3. System automatically selects the optimal partition scan strategy

3. Intelligent Partition Count Control

Design Philosophy: Partition count limits are a performance protection mechanism, encouraging better partition design

-- Lakehouse partition philosophy: Quality over quantity
-- Traditional thinking: Partition as granularly as possible
-- Lakehouse philosophy: Reasonable partition granularity + index optimization

-- Recommended efficient design
CREATE TABLE user_events (
    event_id INT,
    user_id INT,
    event_data JSON,
    event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time));  -- Time partition (primary)

CREATE BLOOMFILTER INDEX idx_user ON TABLE user_events(user_id);  -- Index optimization (auxiliary)

-- Advantages: Avoids small file issues, ensures each partition has sufficient data volume
-- Limitation note: It is recommended to control single-operation partition count within a reasonable range

4. Logical Consistency of Transform Partitions

Design Principle: Avoid conflicting partition dimensions, ensuring clear partition logic

-- When using transform partitions, avoid combining time granularities that conflict logically
-- For example, using both years() and months() causes conflicts
-- Choose a single time granularity: days() is the most commonly recommended option

5. Type-Safe Partition Design

Security Guarantee: Prevent data write errors through type checking

-- Lakehouse type safety mechanism
CREATE TABLE orders (
    id INT,
    amount DOUBLE
) PARTITIONED BY (order_date STRING);  -- Explicit STRING type

-- Type mismatch will cause errors
INSERT INTO orders VALUES (1, 100.0, '2024-06-01');  -- String written to STRING partition: correct
INSERT INTO orders VALUES (1, 100.0, DATE('2024-06-01'));  -- DATE written to STRING partition: error

-- Recommended approach: Use transform partitions to avoid type issues
CREATE TABLE orders_safe (
    id INT,
    amount DOUBLE,
    order_timestamp TIMESTAMP_LTZ
) PARTITIONED BY (days(order_timestamp));  -- Let the system handle type conversion

Cognitive Upgrade: From Complex to Simple

Traditional Partitioning vs Lakehouse Partitioning

DimensionTraditional Partition MindsetLakehouse Partition PhilosophyAdvantage
Partition StrategyThe more granular the better, multi-dimensional partitioningReasonable granularity, key dimensionsAvoids small files, stable performance
Query MethodMust specify partition conditionsAutomatic partition pruningCleaner queries, easier maintenance
Type HandlingManual type conversionSystem-level type safetyFewer errors, higher reliability
Maintenance CostComplex partition managementSimplified partition maintenanceLower operational costs
Performance PredictabilityDepends on partition design experienceSystem-guaranteed performanceMore stable query performance

Partition Table Creation Verification (Best Practice)

Post-Creation Verification: Ensuring Partition Tables Work Correctly

Regardless of how you create the partition table (SQL, tools, scripts), it is recommended to verify immediately after creation. This is the best practice to ensure partition functionality works properly.

-- 1. Create partition table (native SQL recommended for syntax accuracy)
CREATE TABLE orders_partitioned (
    id INT,
    amount DOUBLE,
    order_date DATE
) PARTITIONED BY (days(order_date));

-- 2. Immediately verify the partition table was created correctly (mandatory step)
SHOW PARTITIONS orders_partitioned;
-- Correct: Shows partition list or empty list
-- Abnormal: Error "not a partitioned table"

Complete Verification Checklist (Execute After Every Creation)

-- Verification Step 1: Confirm it is a partitioned table
SHOW PARTITIONS orders_partitioned;

-- Verification Step 2: Test data insertion and partition creation
INSERT INTO orders_partitioned VALUES (1, 100.50, DATE('2024-06-01'));
SHOW PARTITIONS orders_partitioned;
-- Correct: Shows a partition like "days(order_date)=19875"

-- Verification Step 3: Verify partition pruning is effective
SELECT * FROM orders_partitioned WHERE order_date = '2024-06-01';
-- Should return data normally with good performance

-- Verification Step 4: Check table structure
DESCRIBE TABLE orders_partitioned;
-- Confirm column structure is correct and partition field types match

-- Verification Step 5: Test max partition retrieval (compatibility verification)
SELECT max_pt('orders_partitioned');
-- Should return the maximum partition value, used for compatibility with similar features on the original platform

Solutions When Verification Fails

Verification Failure SymptomPossible CauseSolution
not a partitioned tableTable creation syntax error or tool creation anomalyRe-create with native SQL
implicit cast not allowedPartition field type mismatchCheck inserted data types
No partitions displayedPartition function syntax errorCheck if plural forms are used
No performance improvementQuery not utilizing partition fieldsOptimize WHERE conditions

Why Verification is Needed?

  • Ensure partition definition syntax is correct, avoiding performance issues
  • Detect configuration errors early, reducing subsequent troubleshooting costs
  • Verify partition strategy matches query patterns

SHOW PARTITIONS Complete Feature Guide

Basic Syntax and Advanced Usage

-- Complete syntax
SHOW PARTITIONS [EXTENDED] table_name 
[ PARTITION ( partition_col_name = partition_col_val [, ...] ) ] 
[WHERE <expr>]

Basic Usage

-- View all partitions
SHOW PARTITIONS sales_table;

-- View partition details
SHOW PARTITIONS EXTENDED sales_table;

-- View specific partition
SHOW PARTITIONS sales_table PARTITION (pt1 = '2023');

-- Multi-level partition filter
SHOW PARTITIONS sales_table PARTITION (pt1 = '2023', pt2 = '01');

-- Limit results
SHOW PARTITIONS EXTENDED sales_table LIMIT 10;

Advanced Usage: Partition Health Check Tool

Important Limitation: SHOW PARTITIONS does not support ORDER BY clause. Use WITH subquery if sorting is needed.

-- Partition health check
SHOW PARTITIONS EXTENDED table_name WHERE bytes > 100*1024*1024;

-- Partition info view (supports LIMIT)
SHOW PARTITIONS EXTENDED table_name LIMIT 10;

-- Partition sorting (via WITH subquery)
WITH partition_info AS (
    SELECT partitions, bytes, total_rows, total_files, created_time
    FROM (SHOW PARTITIONS EXTENDED table_name)
)
SELECT * FROM partition_info ORDER BY CAST(bytes AS BIGINT) DESC LIMIT 10;

MAX_PT Function - Get Latest Partition

-- MAX_PT function: Gets the maximum partition value from a partitioned table
-- Syntax: max_pt('schema_name.table_name' | 'table_name')

-- Basic usage: Query data from the latest partition
SELECT * FROM sales_table WHERE pt = max_pt('sales_table');

-- Cross-schema usage
SELECT max_pt('prod_schema.sales_table');

-- Practical application scenarios:
-- 1. Incremental data processing: Always process the latest partition
INSERT INTO target_table 
SELECT * FROM source_table 
WHERE pt = max_pt('source_table');

-- 2. Data quality check: Check data quality of the latest partition
SELECT COUNT(*), AVG(amount), MAX(created_time)
FROM orders 
WHERE pt = max_pt('orders');

Migration Value of MAX_PT Function:

  • MaxCompute Users: Direct replacement for the original max_pt function, no query logic changes needed
  • Hive Users: Simplify complex max partition queries, improving development efficiency
  • Other Platform Users: Provides a convenient way to query the latest data

Advanced Partition Table Creation Guide

Partition + Bucket + Sort Combinations

-- Option 1: Partition + Bucket (recommended for hash distribution)
CREATE TABLE events_clustered (
    user_id INT,
    event_type STRING,
    event_data JSON,
    event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time))
  CLUSTERED BY (user_id) INTO 32 BUCKETS;

-- Option 2: Partition + Sort (recommended for range queries)
CREATE TABLE events_sorted (
    user_id INT,
    event_type STRING,
    event_data JSON,
    event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time))
  SORTED BY (event_type);

CREATE TABLE events_both
(
    user_id INT,
    event_type STRING,
    event_data JSON,
    event_time TIMESTAMP_LTZ
)
 PARTITIONED BY (days(event_time))
CLUSTERED BY (user_id) 
SORTED BY (event_type)
INTO 32 BUCKETS;

bucket Function Usage Guide

Parameter Range and Recommendations:

-- Recommended bucket count range
CREATE TABLE sales PARTITIONED BY (
    days(sale_date),
    bucket(10, user_id)    -- Recommended: reasonable range of 1-1000
);

-- bucket count selection guide:
-- 1-10 buckets: Suitable for small data volume tables (< 1 million rows)
-- 10-100 buckets: Suitable for medium data volume tables (1-10 million rows)
-- 100-1000 buckets: Suitable for large data volume tables

Complex Data Type Support

-- Lakehouse supports partitioning for all modern data types
CREATE TABLE modern_table (
    user_id INT,
    user_profile STRUCT<name: STRING, age: INT, location: STRING>,
    tags ARRAY<STRING>,
    metadata MAP<STRING, STRING>,
    config JSON,
    created_date DATE,
    created_timestamp TIMESTAMP_LTZ,
    created_timestamp_ntz TIMESTAMP_NTZ
) PARTITIONED BY (days(created_date));

-- Supported partition field types:
-- Basic types: INT, BIGINT, STRING, DATE, TIMESTAMP_LTZ, TIMESTAMP_NTZ
-- Transform partitions: years(), months(), days(), hours(), bucket()
-- Not supported: STRUCT, ARRAY, MAP, JSON as direct partition fields

Special Cases for Partition Values

-- NULL value partition handling
CREATE TABLE user_regions (
    user_id INT,
    region STRING  -- Allows NULL values
) PARTITIONED BY (region);

INSERT INTO user_regions VALUES (1, NULL);
-- Result: Creates region=NULL partition

-- Special character support
INSERT INTO user_regions VALUES 
(2, 'beijing'),           -- Normal characters
(3, 'shang-hai'),         -- Supports hyphens
(4, 'guang_zhou'),        -- Supports underscores  
(5, 'xi an'),             -- Supports spaces
(6, 'very_long_city_name_with_many_characters');  -- Supports long strings

-- View partition results
SHOW PARTITIONS user_regions;
-- Result display: region=NULL, region=beijing, region=shang-hai, etc.

-- Special character support summary:
-- Supported: letters, numbers, underscores, hyphens, spaces, Chinese characters
-- Length: Supports very long partition values (tested with 100+ characters)
-- Note: Avoid special symbols like @#$% although they may be supported, it's not recommended

Complex Partition Migration Strategies

Multi-Level Partition Architecture Migration Challenges

If you implemented complex partition strategies on your original platform, you will encounter architectural limitations and design choices when migrating to the Lakehouse.

Migration Challenges for Composite Dimension Partitions

Composite Partitions on the Original Platform:

-- Common composite partitions in MaxCompute/Hive
CREATE TABLE sales PARTITIONED BY (
    dt STRING,           -- Date: 20240601
    region STRING,       -- Region: beijing, shanghai
    channel STRING       -- Channel: online, offline
);

-- Partition count = number of dates x number of regions x number of channels
-- Example: 365 x 10 x 3 = 10,950 partitions/year

Problems with Direct Migration:

Directly copying composite partitions will result in too many partitions. The recommended approach is to reduce partition dimensions by combining time partitions with indexes.

Strategy 1: Simplified Partition + Index

-- Simplified from 3-level partitions to 1 time partition + 2 indexes
CREATE TABLE sales (
    sale_id INT,
    amount DOUBLE,
    region STRING,
    channel STRING,
    sale_date DATE
) PARTITIONED BY (days(sale_date));
-- Partition count per year: ~365

-- Create index optimizations for non-partition query columns
CREATE BLOOMFILTER INDEX idx_region ON TABLE sales(region);
CREATE BLOOMFILTER INDEX idx_channel ON TABLE sales(channel);

-- Query with partition pruning + index acceleration
SELECT * FROM sales 
WHERE sale_date >= '2024-06-01' 
  AND sale_date <= '2024-06-30'  -- Partition pruning
  AND region = 'beijing'                     -- Index acceleration
  AND channel = 'online';                    -- Index acceleration

Strategy 2: Hash Partition Redesign

-- Use hash function to reduce partition count
CREATE TABLE sales (
    sale_id INT,
    amount DOUBLE,
    region STRING,
    channel STRING,
    sale_date DATE
) PARTITIONED BY (
    days(sale_date),           -- Time partition
    bucket(10, region)         -- Region hashed into 10 buckets
);

-- Partition count = 365 x 10 = 3,650 partitions/year (manageable)

Partition Evolution and Maintenance Strategy Migration

Differences in Dynamic Partition Management

Partition Management on Original Platform:

-- Partition management in Hive
-- 1. Dynamic partition auto-creation
INSERT OVERWRITE TABLE target PARTITION(dt, region)
SELECT ..., dt, region FROM source;  -- Auto-creates all dt x region combinations

-- 2. Partition repair
MSCK REPAIR TABLE target;  -- Auto-discover new partitions

-- 3. Partition deletion
ALTER TABLE target DROP PARTITION (dt<'20240101');  -- Batch deletion

Equivalent Implementation in Lakehouse:

-- 1. Dynamic partition creation (pay attention to partition count)
-- Recommended to check partition count first
SELECT COUNT(DISTINCT CONCAT(dt, '/', region)) FROM source;

-- Process in batches (if count is large)
INSERT INTO target 
SELECT ..., dt, region FROM source 
WHERE dt BETWEEN '20240601' AND '20240615';

-- 2. Partition discovery (automatic, no manual repair needed)
-- Lakehouse automatically manages partition metadata

-- 3. Partition cleanup (more powerful features)
TRUNCATE TABLE target PARTITION (dt = '20240101');

Partition Lifecycle Management Migration

MaxCompute's Automatic Lifecycle:

-- Setting table-level lifecycle in MaxCompute
ALTER TABLE events SET LIFECYCLE 90;  -- Auto-delete after 90 days

-- Partition-level lifecycle
ALTER TABLE events PARTITION(dt='20240101') SET LIFECYCLE 30;  -- Single partition 30 days

Partition Cleanup Features in Lakehouse:

-- Basic partition cleanup (for STRING partitions)
TRUNCATE TABLE events PARTITION (dt = '20240101');

-- Transform partition cleanup (requires specific partition value)
TRUNCATE TABLE events_with_days PARTITION (days(event_date) = 19875);

-- Advanced feature: Conditional filter cleanup
-- Delete all partitions older than 90 days (need to calculate specific partition values first)
-- For STRING partitions:
TRUNCATE TABLE events 
PARTITION (dt < date_format(date_sub(current_date(), 90), 'yyyyMMdd'));

-- Composite condition cleanup: Delete data for specific date and region
TRUNCATE TABLE sales 
PARTITION (days(sale_date) = 19875 AND region = 'beijing');

-- Batch partition cleanup: Clean multiple partitions simultaneously
TRUNCATE TABLE logs 
PARTITION (days(log_date) = 19875), 
PARTITION (days(log_date) = 19876);

Partition Cleanup Best Practices:

-- 1. Check partitions to delete before cleanup
SHOW PARTITIONS EXTENDED table_name 
WHERE dt < '2024-01-01';

-- 2. Clean large numbers of partitions in stages (avoid long table locks)
-- For STRING partitions:
TRUNCATE TABLE large_table 
PARTITION (dt = '20230101');
-- Then continue cleaning the next day's data

-- For transform partitions, query the partition value first:
-- SELECT days('2023-01-01'); -- Get the specific partition value
TRUNCATE TABLE large_table_with_days
PARTITION (days(date_col) = specific_partition_value);

-- 3. Periodic cleanup scheduling script example
-- Execute daily at 2 AM, clean partitions older than 30 days (STRING partition)
TRUNCATE TABLE daily_logs 
PARTITION (dt < date_format(date_sub(current_date(), 30), 'yyyyMMdd'));

Performance Optimization Strategy Migration

Equivalent Implementation of Z-Order Optimization

Databricks Delta Lake Z-Order:

-- Multi-dimensional optimization in Delta Lake
OPTIMIZE events ZORDER BY (user_id, event_type, timestamp);
-- Achieves joint optimization of multiple fields

Equivalent Strategies in Lakehouse:

-- Strategy 1: Partition + Bucket combination (recommended)
CREATE TABLE events (
    user_id INT,
    event_type STRING,
    event_data JSON,
    timestamp TIMESTAMP_LTZ
) PARTITIONED BY (days(timestamp))    -- Time partition
  CLUSTERED BY (user_id) INTO 32 BUCKETS;  -- User hash bucketing

-- Strategy 2: Partition + Sort combination
CREATE TABLE events_sorted (
    user_id INT,
    event_type STRING,
    event_data JSON,
    timestamp TIMESTAMP_LTZ
) PARTITIONED BY (days(timestamp))    -- Time partition
  SORTED BY (event_type);             -- Event type sort

-- Strategy 3: Rewrite optimization (similar to OPTIMIZE)
INSERT OVERWRITE events 
SELECT * FROM events 
ORDER BY user_id, event_type, timestamp;  -- Manual data reordering

Partition Pruning Optimization Migration

Original Platform Partition Pruning Logic:

-- Complex partition filtering in Spark
df.filter(
    (col("year") >= 2024) & 
    (col("month") >= 6) & 
    (col("day") >= 1)
)

Equivalent Query in Lakehouse:

-- In Lakehouse, transform to range query on a single partition dimension
SELECT * FROM events
WHERE event_date >= '2024-06-01' 
  AND event_date <= '2024-06-30'  -- month range filtering
  AND region = 'beijing';                  -- region = "beijing"

-- Key: Transform complex partition logic into simple time range queries

Platform-Specific Migration Notes

Hive Users: Special Attention Required

Syntax Compatibility Differences

-- Hive syntax is still supported in Lakehouse
CREATE TABLE hive_style (
    order_id INT,
    amount DOUBLE
) PARTITIONED BY (dt STRING, region STRING);  -- Fully compatible

-- But note ADD COLUMN position issues
-- Hive: New column added to the last position before partition columns
-- Lakehouse: Recommend explicitly specifying column position
ALTER TABLE hive_style ADD COLUMN new_col STRING AFTER amount;  -- Explicit position

Dynamic Partition Configuration Differences

-- Configurations needed in Hive are not needed in Lakehouse
-- set hive.exec.dynamic.partition=true;           -- Not needed in Lakehouse
-- set hive.exec.dynamic.partition.mode=nonstrict; -- Not needed in Lakehouse

-- But pay attention to partition count management
-- Hive: hive.exec.max.dynamic.partitions=1000 (adjustable)
-- Lakehouse: Recommend controlling single-operation partition count within reasonable range

Additional Real Pain Points for Hive Users

-- Additional issues commonly encountered by Hive users:
-- 1. Partition field type restrictions
CREATE TABLE hive_table PARTITIONED BY (dt STRING); -- Hive partition fields typically must be STRING

-- 2. Partition directory structure dependency
-- Hive strictly depends on directory structures like /data/table/year=2024/month=06/day=01/
-- No such restriction in Lakehouse, more flexible

-- 3. Frequent need for partition repair
-- Hive: MSCK REPAIR TABLE table_name;  -- Often needs manual repair
-- Lakehouse: Auto-maintains metadata, no manual repair needed

Spark Users: Special Attention Required

DataFrame Write Method Differences

-- Common Spark DataFrame write approach
-- df.write.mode("overwrite").partitionBy("date", "region").saveAsTable("table")

-- Things to note when migrating to Lakehouse SQL
-- 1. Ensure the table is already created and correctly partitioned
CREATE TABLE spark_migrated PARTITIONED BY (date STRING, region STRING);

-- 2. Use INSERT statements instead of saveAsTable
INSERT OVERWRITE spark_migrated SELECT * FROM source_data;

Transform Function Name Mapping

Spark FunctionLakehouse FunctionNotes
year(col)years(col)Note plural form
month(col)months(col)Note plural form
dayofyear(col)days(col)Different function name
hour(col)hours(col)Note plural form

Additional Challenges for Spark Users

// Issues more commonly encountered by Spark users:
// 1. Partition discovery issues
spark.sql("MSCK REPAIR TABLE table_name")  // Spark also has this issue

// 2. Performance pitfalls of dynamic partition writes
df.write.mode("append")
  .option("maxRecordsPerFile", "50000")  // Control file size
  .partitionBy("date")
  .saveAsTable("table")

// 3. Partition column auto-inference type issues
df.write.partitionBy($"date".cast("string"))  // Must cast to string

MaxCompute Users: Special Attention Required

Partition Usage Habit Adjustments

-- MaxCompute enforces partition conditions
-- SELECT * FROM table WHERE pt='20240601';  -- Must include partition condition, otherwise error

-- In Lakehouse, partition conditions are automatic
SELECT * FROM table WHERE order_date='2024-06-01';  -- Automatic partition pruning, more flexible

-- However, it is still recommended to include partition conditions in queries for optimal performance

Lifecycle Management Differences

-- MaxCompute's automatic lifecycle
-- ALTER TABLE table SET LIFECYCLE 30;  -- Auto-delete after 30 days

-- In Lakehouse, manual management or scheduling is needed
TRUNCATE TABLE table PARTITION (order_date = '2024-05-01');  -- Manual cleanup

Real Pain Points for MaxCompute Users

-- Issues most commonly encountered by MaxCompute users:
-- 1. Forced partition filter habit
-- MaxCompute: Queries without partition conditions will error directly
-- Lakehouse: Allows full table scan, but recommend including partition conditions

-- 2. INSERT OVERWRITE syntax differences for partitioned tables
-- MaxCompute: INSERT OVERWRITE TABLE target PARTITION(dt='20240601')
-- Lakehouse: INSERT OVERWRITE target ... (automatically identifies partitions)

-- 3. Cross-project access syntax changes
-- MaxCompute: SELECT * FROM project.table WHERE pt='20240601';
-- Lakehouse: SELECT * FROM catalog.schema.table WHERE pt='20240601';

Snowflake Users: Special Attention Required

Major Mindset Shift: From Automatic Optimization to Proactive Partition Design

Traditional Snowflake User Habits: You may not be highly concerned with underlying partition design

-- Typical usage pattern in traditional Snowflake
CREATE TABLE orders (id INT, amount DOUBLE, order_date DATE);  -- System auto-manages storage
ALTER TABLE orders CLUSTER BY (order_date);  -- Set clustering key, system auto-manages micro-partitions

-- No need to consider partitions in queries
SELECT * FROM orders WHERE amount > 1000;

Modern Snowflake Users (Iceberg Tables): Syntax is largely similar but with subtle differences

-- Snowflake Iceberg table
CREATE ICEBERG TABLE orders_iceberg PARTITION BY (year(order_date));

-- Corresponding syntax in Lakehouse
CREATE TABLE orders_lakehouse PARTITIONED BY (years(order_date));  -- Note plural form

Adjustments When Migrating to Lakehouse:

-- Traditional Snowflake thinking may be suboptimal in Lakehouse
CREATE TABLE orders (id INT, amount DOUBLE, order_date DATE);  -- Created a regular table, not a partitioned table
-- Result: Query performance may be less than ideal

-- Learn to proactively design partition strategies
CREATE TABLE orders (
    id INT, 
    amount DOUBLE,
    order_date DATE
) PARTITIONED BY (days(order_date));  -- Proactive partition design

-- Although pruning is automatic, partition design directly affects performance
SELECT * FROM orders 
WHERE order_date >= '2024-06-01'  -- Partition condition (recommended to include)
  AND amount > 1000;              -- Business condition

Snowflake Users' Learning Path

-- Phase 1: Understand the value of partitioning
-- Partition = Physically storing data separately by a field
-- Purpose: Only scan relevant partitions during queries, not the entire table

-- Phase 2: Learn partition design
-- Ask yourself: What field do my queries most commonly filter on?
-- Time fields: order_date, created_at, updated_at
-- Business fields: region, department, customer_type

-- Phase 3: Verify partition effectiveness
-- Compare query performance of partitioned vs non-partitioned tables

Cognitive Focus Points for Snowflake Users

-- Key understanding points for Snowflake users:
-- 1. Not all databases auto-optimize
SELECT * FROM large_table WHERE complex_condition;  -- Need to consider partition design

-- 2. Importance of partition design
-- Snowflake micro-partitions are automatic, but in Lakehouse proactive design is needed

-- 3. File system optimization concepts
-- Understand concepts like "small file problem", "partition count control"

-- 4. Query optimization awareness
SELECT * FROM large_partitioned_table 
WHERE order_date >= '2024-06-01';  -- Habit of including partition conditions in queries

Databricks Users: Special Attention Required

Delta Lake vs Iceberg Differences

-- Delta Lake partition syntax
-- CREATE TABLE delta_table PARTITIONED BY (year, month);

-- Equivalent syntax in Lakehouse (Iceberg)
CREATE TABLE iceberg_table PARTITIONED BY (years(date_col));  -- Can only use single time granularity

-- Cannot simultaneously use multiple time granularities like in Delta Lake

Optimization Command Differences

-- Delta Lake optimization commands
-- OPTIMIZE table_name;
-- OPTIMIZE table_name ZORDER BY (col1, col2);

-- Similar effect achieved in Lakehouse through rewrite
INSERT OVERWRITE table_name SELECT * FROM table_name;  -- File merge optimization

Advanced Feature Migration for Databricks Users

-- Issues more commonly encountered by Databricks users:
-- 1. Delta Lake time travel habits
-- Delta: SELECT * FROM table TIMESTAMP AS OF '2024-01-01 00:00:00';
-- Lakehouse: SELECT * FROM table TIMESTAMP AS OF '2024-01-01 00:00:00';  -- Similar syntax

-- 2. Heavy reliance on OPTIMIZE and Z-ORDER
-- Delta: OPTIMIZE table ZORDER BY (col1, col2);  -- This is a daily operation
-- Lakehouse: Need to combine partition + sort + bucket for equivalent effect

-- 3. Unity Catalog impact
-- Delta: CREATE TABLE catalog.schema.table;  -- Three-level namespace habit
-- Lakehouse: CREATE TABLE schema.table;  -- Two-level namespace

Practical Pitfall Avoidance Guide

Partition Creation Pitfalls

Do's and Don'ts Comparison

-- Wrong approach: Unclear partition design
CREATE TABLE bad_partition (
    id INT,
    data STRING,
    timestamp_col TIMESTAMP_LTZ
);  -- No PARTITIONED BY, not a partitioned table

INSERT INTO bad_partition VALUES (1, 'test', CURRENT_TIMESTAMP());
-- Later finding queries are slow and wanting to add partitions is too late

-- Correct approach: Plan partition strategy in advance
CREATE TABLE good_partition (
    id INT,
    data STRING,
    timestamp_col TIMESTAMP_LTZ
) PARTITIONED BY (days(timestamp_col));

-- Immediately verify partition was created correctly
SHOW PARTITIONS good_partition;  -- Should execute normally without errors

Partition Field Type Selection

-- Easy pitfall in type selection
CREATE TABLE date_type_trap (
    id INT,
    order_date DATE  -- DATE type
) PARTITIONED BY (order_date);

-- May get type errors when inserting data
INSERT INTO date_type_trap VALUES (1, '2024-06-01');  -- String vs DATE type

-- Recommended approach: Use STRING type for partitioning
CREATE TABLE string_partition (
    id INT,
    order_date_str STRING  -- Use STRING type to avoid type conversion issues
) PARTITIONED BY (order_date_str);

-- Or use transform partitions
CREATE TABLE transform_partition (
    id INT,
    order_date DATE
) PARTITIONED BY (days(order_date));  -- Let the system handle type conversion

Data Write Pitfalls

Large Batch Write Strategy

-- For large batch writes, recommend processing in batches to avoid too many partitions at once
INSERT INTO target_table 
SELECT * FROM source_table 
WHERE partition_col >= 'start_value' 
  AND partition_col <= 'end_value';

Partition Data Consistency

-- Note: Timezone issues with transform partitions
CREATE TABLE timezone_sensitive (
    id INT,
    event_time TIMESTAMP_LTZ  -- Timezone-aware timestamp
) PARTITIONED BY (days(event_time));

-- Same local time in different timezones may land in different partitions
INSERT INTO timezone_sensitive VALUES 
(1, TIMESTAMP '2024-06-01 23:30:00 UTC'),      -- UTC timezone
(2, TIMESTAMP '2024-06-01 23:30:00');          -- System default timezone

-- View partition distribution
SHOW PARTITIONS timezone_sensitive;
-- May see two different partition values: days(event_time)=19875 and days(event_time)=19876

Query Performance Pitfalls

Scenarios Where Partition Pruning Fails

-- Query that cannot leverage partition pruning
SELECT * FROM partitioned_table 
WHERE YEAR(order_date) = 2024;  -- Function wrapping partition field

SELECT * FROM partitioned_table 
WHERE order_date LIKE '2024%';  -- Fuzzy matching

-- Queries that effectively leverage partition pruning
SELECT * FROM partitioned_table 
WHERE order_date >= '2024-06-01' 
  AND order_date <= '2024-06-30';  -- Range query on partition field

Quick Partition Troubleshooting

Problem: Partition table performs worse than non-partitioned table

Investigation Commands:

-- 1. Check partition count and size distribution
SHOW PARTITIONS EXTENDED your_table;

-- 2. Check for excessive small partitions
SHOW PARTITIONS EXTENDED your_table WHERE bytes < 10*1024*1024;

-- 3. Check if queries are utilizing partition pruning
EXPLAIN SELECT * FROM your_table WHERE partition_col = 'value';

Common Causes and Solutions:

  • Over-partitioning (too many small partitions) -> Redesign partition granularity
  • Query doesn't include partition field -> Optimize query WHERE conditions
  • Not actually a partitioned table -> Rebuild with native SQL

Partition Performance Verification Practical Guide

Standard Method for Partition Effectiveness Verification

1. Create Comparison Test

-- Create partitioned and non-partitioned tables with the same data
CREATE TABLE sales_partitioned (
    id INT, amount DOUBLE, sale_date DATE
) PARTITIONED BY (days(sale_date));

CREATE TABLE sales_normal (
    id INT, amount DOUBLE, sale_date DATE
);

-- Insert the same large data volume for performance comparison testing

2. Performance Benchmark Test

-- Test partition query vs full table scan
SELECT COUNT(*) FROM sales_partitioned 
WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30';

SELECT COUNT(*) FROM sales_normal 
WHERE sale_date >= '2024-06-01' AND sale_date <= '2024-06-30';

3. Partition Health Check

-- Check partition size distribution
SHOW PARTITIONS EXTENDED table_name;

-- Ideal state:
-- Each partition 128MB - 1GB
-- Partition sizes relatively uniform
-- No large number of partitions smaller than 10MB
-- No single partition exceeding 5GB

Performance Issue Diagnosis

Partition Table Slower Than Non-Partitioned Table?

Possible Causes and Solutions:

  1. Over-partitioning: Partition is too granular, metadata overhead is high

    -- Problem: Hourly partitions lead to many small partitions
    PARTITIONED BY (hours(timestamp))
    
    -- Solution: Change to day-level partitions
    PARTITIONED BY (days(timestamp))
  2. Query not utilizing partitions: WHERE condition doesn't include partition field

    -- Cannot leverage partitions
    SELECT * FROM partitioned_table WHERE amount > 1000;
    
    -- Utilize partitions
    SELECT * FROM partitioned_table 
    WHERE sale_date >= '2024-06-01' AND amount > 1000;
  3. Wrong partition field selection: Partition field is not the query hotspot

    -- Problem analysis: Check query patterns
    -- If queries mainly filter by user_id but partition is by date, it's ineffective
    
    -- Solution: Redesign partition strategy
    PARTITIONED BY (bucket(100, user_id))  -- Partition by user hash instead

Partition Tuning in Practice

1. Partition Granularity Selection

Data VolumeRecommended Granularity
> 1 billion rowsDay-level partitions days()
100M - 1B rowsMonth-level partitions months()
< 10M rowsMonth-level partitions months()

2. Composite Partition Optimization

-- Original composite partition problem
CREATE TABLE sales_old PARTITIONED BY (region, sale_date, channel);
-- Problem: 10 regions x 365 days x 3 channels = 10,950 partitions

-- Optimization 1: Primary-secondary partitioning
CREATE TABLE sales_optimized (
    ..., region STRING, channel STRING
) PARTITIONED BY (days(sale_date));  -- Primary partition: time
CREATE BLOOMFILTER INDEX idx_region ON TABLE sales_optimized(region);
CREATE BLOOMFILTER INDEX idx_channel ON TABLE sales_optimized(channel);

-- Optimization 2: Hash compression
CREATE TABLE sales_hash PARTITIONED BY (
    days(sale_date),           -- Time partition: 365
    bucket(5, region)          -- Region hashed to 5 buckets
);
-- Result: 365 x 5 = 1,825 partitions (manageable)

3. Partition Evolution Strategy

-- Periodic partition health check
WITH partition_health AS (
    SELECT 
        partitions,
        total_rows,
        bytes,
        CASE 
            WHEN CAST(bytes AS BIGINT) < 10*1024*1024 THEN 'TOO_SMALL'
            WHEN CAST(bytes AS BIGINT) > 5*1024*1024*1024 THEN 'TOO_LARGE'
            ELSE 'HEALTHY'
        END AS health_status
    FROM (SHOW PARTITIONS EXTENDED target_table)
)
SELECT health_status, COUNT(*) as count
FROM partition_health
GROUP BY health_status;

Migration Case Studies

Case 1: Real-Time Data Table Migration (Complex Partition -> Simplified Partition)

Original Complex Partition Table

-- Original table: Multi-level time + business dimension composite partitions
CREATE TABLE events_old PARTITIONED BY (year, month, day, hour, event_type);
-- Problem: Too many partitions, complex queries, high maintenance cost

Lakehouse Migration Strategy

-- Redesign: Single time granularity partition + index
CREATE TABLE events_new (
    event_id INT,
    event_type STRING,
    event_data JSON,
    event_time TIMESTAMP_LTZ
) PARTITIONED BY (days(event_time))
  SORTED BY (event_type);

-- Query becomes cleaner
SELECT * FROM events_new 
WHERE event_time >= '2024-06-01 08:00:00' 
  AND event_time <= '2024-06-01 17:59:59';

-- Performance improvement:
-- Query speed: 40% improvement (avoids small partition scanning)
-- Maintenance cost: 60% reduction (partition count significantly reduced)
-- Query complexity: 80% reduction (no need to calculate year, month, day, hour)

Case 2: Log Analysis Table Migration (MaxCompute -> Singdata Lakehouse)

Original MaxCompute Table

-- Original MaxCompute table: Strict partition restrictions
CREATE TABLE app_logs (
    user_id STRING,
    event_type STRING,
    event_data STRING
) PARTITIONED BY (
    pt STRING,          -- Format: 20240601  
    region STRING,      -- Region: beijing, shanghai
    app_version STRING  -- Version: 1.0, 1.1, 1.2
);

-- MaxCompute characteristics:
-- Enforced partition conditions: SELECT must include WHERE pt='20240601'
-- Partition count explosion: 365 x 10 x 20 = 73,000/year
-- Data skew: Beijing region has much more data, other regions very little

Singdata Lakehouse Migration Strategy

-- Step 1: Analyze original query patterns
-- Discovery: 90% of queries filter by time range + region
-- Decision: Time as primary partition, region with index

-- Step 2: Redesign partition architecture
CREATE TABLE app_logs_new (
    user_id STRING,
    event_type STRING,
    event_data STRING,
    region STRING,        -- No partition, use index
    app_version STRING,   -- No partition, use index
    log_date DATE
) PARTITIONED BY (days(log_date));  -- Only partition by time

-- Step 3: Create indexes to optimize non-partition queries
CREATE BLOOMFILTER INDEX idx_region ON TABLE app_logs_new(region);
CREATE BLOOMFILTER INDEX idx_version ON TABLE app_logs_new(app_version);

-- Step 4: Verify query performance
-- Original query:
SELECT * FROM app_logs WHERE pt='20240601' AND region='beijing';

-- New query:
SELECT * FROM app_logs_new 
WHERE log_date='2024-06-01' AND region='beijing';
-- Result: Comparable performance, but partition count reduced from 73,000 to 365

Migration Process Challenges

-- Challenge 1: Inconsistent data types
-- MaxCompute: pt STRING '20240601'
-- Singdata Lakehouse: log_date DATE '2024-06-01'

-- Solution: ETL conversion script
INSERT INTO app_logs_new 
SELECT 
    user_id,
    event_type, 
    event_data,
    region,
    app_version,
    DATE(CONCAT(
        SUBSTR(pt, 1, 4), '-',
        SUBSTR(pt, 5, 2), '-', 
        SUBSTR(pt, 7, 2)
    )) as log_date
FROM maxcompute_source;

-- Challenge 2: Habitual syntax adjustments needed
-- MaxCompute habit: WHERE pt='20240601' (string exact match)
-- Singdata Lakehouse: WHERE log_date='2024-06-01' (auto partition pruning)

-- Challenge 3: Max partition query method changes
-- MaxCompute original: WHERE pt = max_pt()
-- Singdata Lakehouse new method: WHERE log_date = (SELECT DATE(max_pt('app_logs_new')))
-- Or redesign as STRING partition, using directly: WHERE pt = max_pt('app_logs_new')

Partition Maintenance Automation

-- MaxCompute-style automation script (adapted for new platform)
-- 1. Daily cleanup of partitions older than 30 days
TRUNCATE TABLE app_logs_new 
PARTITION (log_date < current_date() - INTERVAL '30' DAY);

-- 2. Smart cleanup: Keep latest partitions, clean small partitions
WITH old_partitions AS (
    SELECT partitions
    FROM (SHOW PARTITIONS EXTENDED app_logs_new)
    WHERE CAST(bytes AS BIGINT) < 1000000  -- Partitions smaller than 1MB
      AND partitions < max_pt('app_logs_new') - INTERVAL '7' DAY
)
-- Clean small partitions one by one (note: need specific partition values)

-- 3. Partition health check script
WITH partition_stats AS (
    SELECT 
        partitions,
        CAST(total_rows AS BIGINT) as rows,
        CAST(bytes AS BIGINT) as size_bytes
    FROM (SHOW PARTITIONS EXTENDED app_logs_new)
)
SELECT 
    COUNT(*) as total_partitions,
    AVG(size_bytes)/1024/1024 as avg_size_mb,
    SUM(CASE WHEN size_bytes < 10*1024*1024 THEN 1 ELSE 0 END) as small_partitions,
    max_pt('app_logs_new') as latest_partition
FROM partition_stats;

Case 3: Real-Time Data Table Migration (Spark -> Singdata Lakehouse)

Original Spark Delta Table

-- Original Spark table: Hourly partitions + Z-Order optimization
CREATE TABLE user_events USING DELTA
PARTITIONED BY (date_hour STRING)  -- Format: 2024060109
OPTIONS (
  'path' '/data/user_events'
);

-- Periodic optimization
OPTIMIZE user_events ZORDER BY (user_id, event_type);

Singdata Lakehouse Migration Challenges

-- Challenge 1: No direct Z-Order equivalent
-- Challenge 2: Hourly partitions may be too granular
-- Challenge 3: High real-time write performance requirements

-- Solution 1: Partition + Bucket combination (recommended)
CREATE TABLE user_events_new (
    user_id INT,
    event_type STRING,
    event_data JSON,
    event_time TIMESTAMP_LTZ
) PARTITIONED BY (hours(event_time))      -- Keep hourly partitions (real-time requirement)
  CLUSTERED BY (user_id) INTO 32 BUCKETS; -- User ID bucketing

-- Solution 2: Partition + Sort combination
CREATE TABLE user_events_sorted (
    user_id INT,
    event_type STRING,
    event_data JSON,
    event_time TIMESTAMP_LTZ
) PARTITIONED BY (hours(event_time))      -- Keep hourly partitions
  SORTED BY (event_type);                 -- Event type sort

-- Achieving similar Z-Order effect:
-- 1. Partition: Physical isolation by time
-- 2. Bucket: Hash distribution by user ID  
-- 3. Sort: Clustered storage by event type

Performance Tuning Process

-- Tuning 1: Monitor partition sizes
WITH partition_analysis AS (
    SELECT 
        partitions,
        CAST(bytes AS BIGINT)/1024/1024 as size_mb,
        CAST(total_rows AS BIGINT) as row_count
    FROM (SHOW PARTITIONS EXTENDED user_events_new)
)
SELECT * FROM partition_analysis 
ORDER BY size_mb DESC LIMIT 10;

-- Discovery: Night-time partitions too small (< 10MB)
-- Solution: Dynamically adjust partition strategy

-- Tuning 2: Hourly partitions during day, merged at night
-- Implement smart partition strategy through ETL:
-- 8:00-22:00 peak hours: Hourly partitions
-- 22:00-8:00 off-peak: Merge to daily partitions

Case 4: Data Warehouse Migration (Snowflake -> Singdata Lakehouse)

Special Challenges for Snowflake Users

-- Traditional Snowflake: Automatic storage management
CREATE TABLE sales (
    order_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    order_date DATE
);

-- Set clustering key (relatively transparent to user)
ALTER TABLE sales CLUSTER BY (order_date);

-- Queries: No need to worry about physical storage
SELECT * FROM sales WHERE amount > 1000;

Singdata Lakehouse Learning Path

-- Phase 1: Understand partition concepts
-- Question: What is partitioning? Why is it needed?
-- Answer: Partitioning physically stores data separately by a field, scanning only relevant partitions during queries

-- Phase 2: Learn partition design
-- Question: Which field should I partition by?
-- Analysis: Check the most commonly used WHERE conditions
SELECT 
    COUNT(*) as query_count,
    'order_date filter' as filter_type
FROM query_log 
WHERE query_text LIKE '%WHERE%order_date%'
UNION ALL
SELECT 
    COUNT(*),
    'customer_id filter' 
FROM query_log 
WHERE query_text LIKE '%WHERE%customer_id%';

-- Result: order_date filtering accounts for 80%, customer_id for 20%
-- Decision: Partition by order_date, use index for customer_id

-- Phase 3: Correct partition table design
CREATE TABLE sales_partitioned (
    order_id INT,
    customer_id INT,
    amount DECIMAL(10,2),
    order_date DATE
) PARTITIONED BY (days(order_date));

CREATE BLOOMFILTER INDEX idx_customer ON TABLE sales_partitioned(customer_id);

-- Phase 4: Verify performance improvement
-- Comparative queries:
SELECT * FROM sales WHERE order_date = '2024-06-01';           -- Full table scan
SELECT * FROM sales_partitioned WHERE order_date = '2024-06-01'; -- Partition scan

-- Result: Partitioned tables typically show significant performance improvement

Common Adjustments and Real Confusions for Snowflake Users

-- Adjustment 1: Create partitioned tables instead of regular tables
CREATE TABLE sales_correct (
    order_id INT,
    amount DECIMAL(10,2),
    order_date DATE
) PARTITIONED BY (days(order_date));  -- This is a partitioned table

-- Adjustment 2: Include partition conditions in queries
SELECT * FROM sales_partitioned 
WHERE order_date >= '2024-06-01'  -- Leverage partition
  AND amount > 1000;              -- Business filter

-- Adjustment 3: Understand the importance of partition fields
SELECT * FROM sales_partitioned WHERE amount > 1000;  -- Can be optimized to include partition conditions

-- Adjustment 4: Not all queries will automatically be fast
-- Snowflake users are accustomed to the system handling all storage optimization automatically
-- Lakehouse requires more proactive design awareness

The Actual Process of Mindset Shift

### Common Questions from Snowflake Users:
"Why are my queries fast on Snowflake but need partition consideration on Lakehouse?"

**Answer**: Snowflake's automatic optimization vs Lakehouse's proactive partition design each have their advantages

### Typical Issues During the Learning Process:
1. "Why should I decide the partition strategy?"
2. "What is the small file problem? I never considered file sizes before"
3. "What does partition count control mean?"

### Breakthrough Moment:
When users see significant performance advantages of partitioned tables over non-partitioned ones, they begin to understand the value of partitioning

Migration Verification Checklist

Partition Creation Verification

-- Checkpoint 1: Table is indeed a partitioned table
SHOW PARTITIONS your_table_name;
-- Should execute normally, not reporting "not a partitioned table" error

-- Checkpoint 2: Partition field types are correct
DESCRIBE TABLE your_table_name;
-- Confirm partition field data types match expectations

-- Checkpoint 3: Test data write
INSERT INTO your_table_name VALUES (test_data);
SHOW PARTITIONS your_table_name;
-- Should see newly created partitions

-- Checkpoint 4: Verify partition value format
-- Transform partitions generate numeric values, e.g. days('2024-06-01') = 19875
-- years('2024-06-01') = 54 (year count from 1970)
-- Confirm partition values meet expectations

-- Checkpoint 5: Test max partition retrieval (compatibility verification)
SELECT max_pt('your_table_name');
-- Should return the maximum partition value for compatibility with similar original platform features

Performance Verification

-- Checkpoint 6: Partition pruning is effective
-- Compare execution time of these two queries
SELECT COUNT(*) FROM your_table_name;  -- Full table scan
SELECT COUNT(*) FROM your_table_name WHERE partition_col = 'specific_value';  -- Partition query

-- Partition query should be noticeably faster

-- Checkpoint 7: Partition sizes are reasonable
SHOW PARTITIONS EXTENDED your_table_name;
-- Check each partition's size, ideal range: 128MB - 1GB

Compatibility Verification

-- Checkpoint 8: Whether original queries need modification
-- Test original platform queries in Lakehouse
-- Pay special attention to:
-- 1. Whether transform function names need adjustment (year -> years)
-- 2. Whether partition conditions can be automatically recognized
-- 3. Whether data type conversions work correctly

-- Checkpoint 9: Whether partition evolution strategies are feasible
-- Verify partition cleanup and maintenance scripts work correctly

Common Error Self-Check

If you encounter the following errors, resolve them accordingly:

Error MessagePossible CauseSolution
not a partitioned tableMissing PARTITIONED BY or syntax error during table creationRe-create table with native SQL, adding partition definition
implicit cast not allowedPartition field type mismatchCheck inserted data types
exceeds maximum numberToo many partitions in a single operationAdjust parameters or process in batches
conflicts withTransform partition logic conflictChoose single time granularity
months conflicts with yearsMulti-level time partition design errorSwitch to days() single granularity
Syntax error at or near 'ORDER'SHOW PARTITIONS used with ORDER BYUse WITH subquery for sorting
cannot resolve column 'total_rows'Partition attributes used in TRUNCATE PARTITIONOnly partition field itself can be used
operator not foundType mismatch comparisonEnsure data types are consistent
duplicate.syntax.elementCLUSTERED BY and SORTED BY used togetherChoose one of the two syntaxes
Query performance worse than original platformPartition design may be unreasonableRe-evaluate partition strategy
Too many small partitionsToo many composite partition dimensionsReduce partition dimensions, use indexes instead
max_pt function not foundPossible table name error or permission issueCheck table name and schema permissions
TRUNCATE PARTITION failedPartition condition syntax errorCheck partition filter expression syntax

Performance Benchmark Acceptance Criteria

Basic Migration Success Indicators

  • Original query logic can run in Lakehouse without major modifications
  • Query performance meets or exceeds original platform levels
  • Partition maintenance workload is manageable and automated
  • Team members can independently handle common partition issues

Complex Migration Success Indicators

  • New partition strategy is simpler than original but performance is not degraded
  • Partition count is within reasonable range
  • Partition sizes are evenly distributed (128MB-1GB/partition)
  • Query patterns and partition design are highly aligned
  • Partition maintenance automation is no lower than original platform

Summary: Key Elements for Successful Migration

Core Mindset Shifts

  1. From Explicit Partitioning to Hidden Partitioning: No need to manually specify partition conditions in every query
  2. From Automatic Optimization to Proactive Design: Partition strategies need advance planning and design
  3. From Single Syntax to Diverse Choices: Multiple partition creation methods supported, choose the most suitable
  4. From Less Concern to Partition Awareness: Especially for Snowflake users, need to strengthen partition design awareness
  5. From Complex Partitioning to Simplified Design: Multi-level partitions need redesign to single granularity partitions

Pitfall Avoidance Summary

Basic Syntax Traps

  1. Remember plural forms for transform function names: years, months, days, hours
  2. Keep partition types consistent: Avoid mixing STRING and DATE types
  3. Control partition count within reasonable range: Avoid too many small partitions
  4. Design partitions based on query patterns: Design partition fields based on WHERE conditions
  5. Verify partition effectiveness promptly: Check partition correctness immediately after creation

Complex Migration Traps

  1. Multi-level time partitions cannot be directly migrated: year+month+day should change to days() single granularity
  2. Simplify composite partition dimensions: Too many dimensions lead to excessive partition count
  3. Rebuild partition evolution strategies: From automatic management to manual scheduled management
  4. Adjust performance optimization strategies: Z-Order and other advanced optimizations need redesign
  5. Supplement lifecycle management: From automatic cleanup to script-based periodic cleanup

Creation Verification Traps

  1. Recommend using native SQL to create partition tables: Ensure syntax accuracy
  2. Must verify after every creation: Execute the complete verification checklist
  3. Performance testing must compare baselines: Ensure partitioned tables are indeed faster than non-partitioned ones
  4. Monitor partition health status: Regularly check partition sizes and count distribution

Advanced Syntax Traps

  1. SHOW PARTITIONS does not support ORDER BY: Use WITH subquery for sorting
  2. bucket function parameters must be reasonable: Recommended positive integers in range 1-1000
  3. Note NULL values and special characters: The system will create corresponding partitions

Successful Migration Experience

For Simple Partition Scenarios

  • Incremental migration: Verify with small tables first, then migrate core large tables
  • Preserve original query logic: Try to let original SQL work without major changes
  • Performance baseline comparison: Compare query performance before and after migration
  • Complete verification process: Strictly follow the verification checklist

For Complex Partition Scenarios

  • Partition strategy redesign: Don't try to completely replicate the original partition structure
  • Performance verification first: Use representative queries to verify new partition strategy effectiveness
  • Phased implementation: Complex migration divided into four phases: analysis, redesign, testing, and cutover
  • Prepare rollback plan: Ensure quick rollback to original solution if migration fails

Special Reminders

For Snowflake and similar platform users: The challenge of migrating to Lakehouse mainly lies in mindset adjustment. You need to shift from "less concern about partitioning" to "proactive partition strategy design." It is recommended to first experience the impact of partitioning on query performance in a test environment, understand the value of partitioning, and then design production environment partition strategies.

For users with complex partition architectures: Don't try to completely replicate the original partition structure in the Lakehouse. The Lakehouse partition philosophy is "simplify without losing performance." Multi-level partitions, composite partitions, and other complex designs can often achieve the same or better results with simpler solutions in the Lakehouse.

For all migrating users: Partition table creation verification is the first and most critical step in successful migration. It is recommended to use native SQL to create partition tables and strictly execute the verification checklist. Many migration issues stem from partition tables not being correctly created, making subsequent performance optimization impossible.

Remember: Lakehouse partitioning is not a burden, but a powerful tool for performance optimization. When partitioning is used correctly, you will achieve better query performance and more flexible data management capabilities than on your original platform.


Quick Reference Card

Common Partition Management Commands

FunctionCommand SyntaxUse Case
View PartitionsSHOW PARTITIONS table_nameBasic partition viewing
Partition DetailsSHOW PARTITIONS EXTENDED table_nameView partition size, file count, etc.
Partition FilterSHOW PARTITIONS EXTENDED table WHERE bytes > 100*1024*1024Health check
Specific PartitionSHOW PARTITIONS table PARTITION (pt1 = '2023')View specific partition
Limit ResultsSHOW PARTITIONS table LIMIT 10Limit returned results
Max PartitionSELECT max_pt('table_name')Get latest partition value
Clean PartitionTRUNCATE TABLE table PARTITION (pt = 'value')Lifecycle management
Batch CleanTRUNCATE TABLE table PARTITION (pt1 = 'v1'), PARTITION (pt2 = 'v2')Composite condition cleanup

Platform Syntax Quick Reference

FunctionOriginal Platform SyntaxLakehouse SyntaxNotes
Year Partitionyear(date)years(date)Plural form, returns year count from 1970
Month Partitionmonth(date)months(date)Plural form
Day Partitionday(date)days(date)Plural form, returns day count from 1970-01-01
Hour Partitionhour(timestamp)hours(timestamp)Plural form
Composite Partition(year, month)days(date)Cannot combine conflicting transforms
Dynamic PartitionRequires enable configurationDefault supportPay attention to partition count control
Max Partitionmax_pt()max_pt('table_name')Must specify table name
Partition CleanupALTER TABLE DROP PARTITIONTRUNCATE TABLE PARTITIONMore flexible syntax

Quick Error Reference

When You See This ErrorImmediately CheckQuick Fix
not a partitioned tableWhether table creation statement has PARTITIONED BYRebuild table with native SQL
implicit cast not allowedWhether data types matchUse STRING partition uniformly or use transform partition
exceeds maximum numberWhether partition count is too highInsert in batches or adjust parameters
conflicts withWhether transform partitions conflictUse single time granularity
months conflicts with yearsMulti-level time partition design errorSwitch to days() single granularity
Syntax error at or near 'ORDER'SHOW PARTITIONS used with ORDER BYUse WITH subquery for sorting
cannot resolve column 'total_rows'Partition attributes used in TRUNCATE PARTITIONOnly partition field itself can be used
operator not foundType mismatch comparisonEnsure data types are consistent
duplicate.syntax.elementCLUSTERED BY and SORTED BY used togetherChoose one of the two syntaxes
Query performance worse than original platformWhether partition design is reasonableRe-evaluate partition strategy
Too many small partitionsToo many composite partition dimensionsReduce partition dimensions, use indexes instead
max_pt function not foundPossible table name error or permission issueCheck table name and schema permissions
TRUNCATE PARTITION failedPartition condition syntax errorCheck partition filter expression syntax

Partition Table Verification Quick Reference

Verification ItemCheck CommandExpected Result
Is Partitioned TableSHOW PARTITIONS table_nameNo error, shows partition list
Partition CreatedView partitions again after inserting dataShows new partition values
Type MatchDESCRIBE TABLE table_nameColumn types match expectations
Performance ImprovementCompare partition query vs full table scanPartition query noticeably faster
Latest PartitionSELECT max_pt('table_name')Returns max partition value
Partition HealthSHOW PARTITIONS EXTENDED table_namePartition sizes reasonably distributed

Advanced Syntax Quick Reference

FunctionCorrect SyntaxIncorrect Syntax
Partition + BucketPARTITIONED BY (days(date)) CLUSTERED BY (id) INTO 32 BUCKETSCorrect
Partition + SortPARTITIONED BY (days(date)) SORTED BY (name)Correct
Partition + Bucket + SortNot supported to use simultaneouslyWill report duplicate.syntax.element error
bucket Parameterbucket(10, user_id)Recommended range 1-1000
Partition SortWITH t AS (SHOW PARTITIONS ...) SELECT * FROM t ORDER BY ...Use subquery
NULL PartitionINSERT ... VALUES (1, NULL) -> col=NULL partitionSupported

Migration Priority Checklist

First Priority (Must Do)

  • Use native SQL to create partition tables
  • Execute complete verification checklist to ensure partition tables are correct
  • Compare partition table vs non-partitioned table performance
  • Verify original queries execute correctly on the new platform

Second Priority (Important)

  • Simplify complex partition structures (multi-level -> single-level)
  • Control partition count within reasonable range
  • Create indexes for high-frequency non-partition query columns
  • Establish partition health monitoring mechanism

Third Priority (Optimization)

  • Establish partition cleanup automation scripts
  • Team training on new partition concepts and operations
  • Continuous performance monitoring and tuning
  • Periodic evaluation and adjustment of partition strategies

Usage Recommendation: This document can serve as the authoritative reference for Lakehouse partition migration, with all examples directly usable in production environments. For complex migration projects, it is recommended to follow the verification checklist in the document step by step, ensuring each step is correctly verified.


Note: This document is compiled based on Lakehouse product documentation as of June 2025. It is recommended to periodically check the official documentation for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.