Lakehouse CREATE TABLE Usage Guide

Overview

Singdata Lakehouse provides full-featured, high-performance table creation capabilities, perfectly supporting smooth migration from Spark, Hive, MaxCompute, Snowflake, Databricks, and traditional databases. This guide is based on official documentation and production environment validation, providing professional migration paths and best practices for users from different technology backgrounds.


CREATE TABLE Complete Syntax

Syntax Structure

CREATE TABLE [ IF NOT EXISTS ] table_name 
(
    column_definition [, column_definition, ...]
    [, index_definition_list]
)
[ PARTITIONED BY (partition_spec) ]
[ CLUSTERED BY (column_list) [SORTED BY (column_list)] INTO num_buckets BUCKETS ]
[ COMMENT 'table_comment' ]
[ PROPERTIES ('key' = 'value', ...) ];

Column Definition Syntax

column_name column_type 
[ NOT NULL ]
[ PRIMARY KEY ]
[ IDENTITY[(seed)] ]
[ GENERATED ALWAYS AS (expr) ]
[ DEFAULT default_expression ]
[ COMMENT 'column_comment' ]

Supported Data Types

CategoryData TypeDescriptionCounterpart in Other Systems
NumericTINYINT1-byte integer (-128 to 127)Spark/Hive: TINYINT
SMALLINT2-byte integerSpark/Hive: SMALLINT
INT4-byte integerSpark/Hive: INT/INTEGER
BIGINT8-byte integerSpark/Hive: BIGINT/LONG
FLOAT4-byte floating pointSpark/Hive: FLOAT
DOUBLE8-byte floating pointSpark/Hive: DOUBLE
DECIMAL(p,s)Exact numericSpark/Hive: DECIMAL
StringVARCHAR(n)Variable-length (max 1048576)Snowflake: VARCHAR
CHAR(n)Fixed-length (1-255)Oracle: CHAR
STRINGDefault 16 MB, adjustableHive/Spark: STRING
TemporalDATEDate (YYYY-MM-DD)Common across all systems
TIMESTAMPTimestamp (local time)Spark: TIMESTAMP
TIMESTAMP_NTZTimestamp without timezoneSnowflake: TIMESTAMP_NTZ
BinaryBINARYBinary data, default 16 MBHive: BINARY
BooleanBOOLEANTrue/FalseCommon across all systems
ComplexARRAYArraySpark/Hive: ARRAY
MAP<K,V>Key-value pairsSpark/Hive: MAP
STRUCT<...>StructSpark/Hive: STRUCT
JSONJSON data, default 16 MBSnowflake: VARIANT
VectorVECTOR(dimension)Vector typeFor vector search scenarios
VECTOR(type, dimension)Vector with element typetype: tinyint/int/float

Spark User Migration Guide

Core Advantages Comparison

FeatureSparkLakehouseAdvantage
Table CreationDataFrame API-centricSQL DDL standardizedLower learning curve
Partition MgmtManual managementHidden partition + transform partitionAutomatic optimization
Data OrganizationFile-levelTable-level managementMore efficient metadata
Schema EvolutionRequires rewriteIn-place evolutionGreater flexibility

DataFrame to DDL Conversion

Spark DataFrame Approach:

# Spark DataFrame table creation
df = spark.read.parquet("path/to/data")
df.write.partitionBy("year", "month") \
  .bucketBy(10, "user_id") \
  .sortBy("timestamp") \
  .saveAsTable("user_events")

Lakehouse DDL Approach:

-- Corresponding Lakehouse CREATE TABLE statement
CREATE TABLE user_events (
    user_id BIGINT,
    event_type STRING,
    timestamp TIMESTAMP,
    properties MAP<STRING, STRING>,
    year INT GENERATED ALWAYS AS (year(timestamp)),
    month INT GENERATED ALWAYS AS (month(timestamp))
) 
PARTITIONED BY (year, month)
CLUSTERED BY (user_id) SORTED BY (timestamp) INTO 10 BUCKETS
COMMENT 'User events table';

Advanced Feature Mapping

1. Dynamic Partition Handling

-- Spark requires manual dynamic partition handling
-- Lakehouse handles it automatically with transform partitioning
CREATE TABLE events (
    event_time TIMESTAMP,
    user_id STRING,
    event_data STRING
) PARTITIONED BY (
    days(event_time),        -- Auto partition by day
    bucket(100, user_id)     -- Auto bucket by user ID
);

2. Schema Merge and Evolution

-- Lakehouse natively supports schema evolution
ALTER TABLE events ADD COLUMN new_field STRING;
ALTER TABLE events CHANGE COLUMN event_data TYPE JSON;

3. Complex Data Type Handling

-- Fully compatible with Spark complex types
CREATE TABLE user_profiles (
    user_id BIGINT,
    tags ARRAY<STRING>,                          -- Tag array
    attributes MAP<STRING, STRING>,              -- Attribute mapping
    address STRUCT<                              -- Nested struct
        street: STRING,
        city: STRING,
        coordinates: STRUCT<lat: DOUBLE, lon: DOUBLE>
    >,
    preferences JSON                             -- Flexible JSON field
);

Performance Optimization Recommendations

-- Optimized table creation pattern for Spark users
CREATE TABLE optimized_fact_table (
    -- Business fields
    transaction_id BIGINT IDENTITY(1),
    user_id BIGINT,
    product_id INT,
    amount DECIMAL(18, 2),
    transaction_time TIMESTAMP,
    
    -- Generated partition columns (replaces Spark manual partitioning)
    tx_date STRING GENERATED ALWAYS AS (date_format(transaction_time, 'yyyy-MM-dd')),
    tx_hour INT GENERATED ALWAYS AS (hour(transaction_time))
) 
PARTITIONED BY (tx_date, tx_hour)              -- Two-level partitioning
CLUSTERED BY (user_id) INTO 256 BUCKETS        -- Bucket by user dimension
PROPERTIES (
    'data_lifecycle' = '730',                   -- 2-year lifecycle
    'partition.cache.policy.latest.count' = '7' -- Cache last 7 days
);

Hive User Migration Guide

Perfect Syntax Compatibility

Singdata Lakehouse is fully compatible with Hive's partition syntax while offering more powerful features:

Traditional Hive Table Creation:

-- Hive style (fully supported)
CREATE TABLE IF NOT EXISTS sales_fact (
    order_id BIGINT,
    product_id INT,
    quantity INT,
    price DECIMAL(10,2)
) PARTITIONED BY (
    dt STRING,
    region STRING
)
STORED AS PARQUET
TBLPROPERTIES ('transactional'='true');

Lakehouse Enhanced Version:

-- Same syntax, better performance
CREATE TABLE IF NOT EXISTS sales_fact (
    order_id BIGINT,
    product_id INT,
    quantity INT,
    price DECIMAL(10,2),
    order_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP  -- New: default value
) PARTITIONED BY (
    dt STRING,
    region STRING
)
COMMENT 'Sales fact table with enhanced features'
PROPERTIES (
    'data_lifecycle' = '1095',      -- New: 3-year lifecycle management
    'change_tracking' = 'false'     -- New: CDC capability reserved
);

Partition Strategy Upgrade

Evolution from Static Partitioning to Intelligent Partitioning:

-- Hive traditional approach: manual partition management required
INSERT OVERWRITE TABLE sales_fact PARTITION(dt='2024-06-20', region='CN')
SELECT ...;

-- Lakehouse intelligent approach: automatic partition management
CREATE TABLE sales_fact_smart (
    order_id BIGINT,
    order_time TIMESTAMP,
    region_code STRING,
    amount DECIMAL(10,2),
    -- Auto-generated partition column
    dt STRING GENERATED ALWAYS AS (date_format(order_time, 'yyyy-MM-dd'))
) PARTITIONED BY (dt, region_code);

-- Partition is automatically handled on insert
INSERT INTO sales_fact_smart (order_id, order_time, region_code, amount)
VALUES (1001, CURRENT_TIMESTAMP, 'CN', 99.99);

Transactional Table Capability Enhancement

-- Hive transactional tables have many limitations; Lakehouse natively supports
CREATE TABLE transaction_table (
    id BIGINT PRIMARY KEY,           -- Native primary key support
    data STRING,
    version INT,
    updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) 
CLUSTERED BY (id) INTO 16 BUCKETS   -- Required bucketing optimization
PROPERTIES (
    'data_retention_days' = '7'      -- Time Travel capability
);

Performance Improvement Comparison

Operation TypeHive (MapReduce)LakehouseImprovement
Table Creation SpeedSecondsMilliseconds10x
Partition PruningFile-level scanMetadata pruning100x
Schema ChangesRequires rebuildOnline changeInfinite
Small File CompactionManual maintenanceAutomatic optimizationAutomated

MaxCompute User Migration Guide

Syntax Comparison and Enhancement

MaxCompute users will find Singdata Lakehouse's syntax very familiar, while the functionality is even more powerful:

MaxCompute Table Creation:

-- MaxCompute syntax
CREATE TABLE IF NOT EXISTS sale_detail(
    shop_name STRING,
    customer_id STRING,
    total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING)
LIFECYCLE 730;

Lakehouse Equivalent Syntax:

-- Nearly identical syntax, more features
CREATE TABLE IF NOT EXISTS sale_detail(
    shop_name STRING,
    customer_id STRING,
    total_price DOUBLE,
    -- New capabilities
    create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    row_id BIGINT IDENTITY(1)  -- Auto-increment primary key
)
PARTITIONED BY (sale_date STRING, region STRING)
PROPERTIES (
    'data_lifecycle' = '730',  -- Lifecycle (days)
    'data_retention_days' = '7' -- Time Travel retention period
);

Function Compatibility

-- MaxCompute function -> Lakehouse function mapping
CREATE TABLE datetime_example (
    id INT,
    -- MaxCompute: GETDATE() -> Lakehouse: CURRENT_TIMESTAMP
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    -- Date functions fully compatible
    year_part INT GENERATED ALWAYS AS (YEAR(created_at)),
    month_part INT GENERATED ALWAYS AS (MONTH(created_at)),
    
    -- String functions are the same
    user_name STRING,
    name_length INT GENERATED ALWAYS AS (LENGTH(user_name))
);

Resource Management Comparison

FeatureMaxComputeLakehouseAdvantage
Compute ResourcesPrepaid CUElastic VCLUSTERPay-per-use
Storage LimitProject quotaUnlimited expansionMore flexible
Concurrency CtrlJob queueDynamic schedulingMore efficient
Cross-regionRequires migrationNative supportGlobal

Snowflake User Migration Guide

Architectural Philosophy Comparison

Both are cloud-native data warehouses, but Lakehouse provides more control:

Snowflake-Style Table Creation:

-- Snowflake
CREATE OR REPLACE TABLE customer_transactions (
    transaction_id NUMBER AUTOINCREMENT,
    customer_id VARCHAR,
    amount NUMBER(18,2),
    transaction_date TIMESTAMP_NTZ,
    status VARCHAR DEFAULT 'PENDING'
) CLUSTER BY (customer_id);

Lakehouse Enhanced Version:

-- Lakehouse: more explicit optimization control
CREATE TABLE IF NOT EXISTS customer_transactions (
    transaction_id BIGINT IDENTITY(1),  -- Corresponds to AUTOINCREMENT
    customer_id VARCHAR(100),
    amount DECIMAL(18,2),
    transaction_date TIMESTAMP_NTZ,
    status VARCHAR(20) DEFAULT 'PENDING',
    
    -- Additional optimization columns
    tx_date STRING GENERATED ALWAYS AS (date_format(transaction_date, 'yyyy-MM-dd')),
    INDEX idx_customer (customer_id) BLOOMFILTER  -- Explicit index
) 
PARTITIONED BY (tx_date)  -- Explicit partition strategy
CLUSTERED BY (customer_id) INTO 128 BUCKETS  -- Explicit bucket count
COMMENT 'Customer transaction table with explicit optimizations';

Time Travel and Cloning

-- Snowflake Time Travel -> Lakehouse Time Travel
CREATE TABLE orders_backup AS
SELECT * FROM orders;  -- Immediate copy

-- Lakehouse also supports:
ALTER TABLE orders SET PROPERTIES ('data_retention_days' = '30');  -- 30-day history
-- Query historical data
SELECT * FROM orders TIMESTAMP AS OF '2024-06-01 00:00:00';

Performance Tuning Differences

-- Snowflake: auto-optimization
-- Lakehouse: provides more tuning options

CREATE TABLE large_fact_table (
    -- Column definitions
    fact_id BIGINT,
    dim1_id INT,
    dim2_id INT,
    measure1 DECIMAL(18,4),
    measure2 DECIMAL(18,4),
    fact_time TIMESTAMP,
    
    -- Performance optimization
    fact_date STRING GENERATED ALWAYS AS (date_format(fact_time, 'yyyy-MM-dd'))
) 
PARTITIONED BY (fact_date)
CLUSTERED BY (dim1_id, dim2_id) 
    SORTED BY (fact_time DESC)    -- Additional sort optimization
    INTO 256 BUCKETS              -- Precise parallelism control
PROPERTIES (
    'partition.cache.policy.latest.count' = '30'  -- Cache policy
);

Databricks User Migration Guide

Deep Integration of Delta Lake Concepts

Lakehouse draws from Delta Lake's design philosophy, offering similar but more concise syntax:

Databricks Delta Table:

# Databricks Python API
(spark.sql("""
    CREATE TABLE IF NOT EXISTS events (
        event_id LONG,
        event_time TIMESTAMP,
        user_id STRING,
        event_type STRING,
        properties MAP<STRING, STRING>
    ) USING DELTA
    PARTITIONED BY (date(event_time))
    TBLPROPERTIES (
        'delta.deletedFileRetentionDuration' = '7 days',
        'delta.optimizeWrite' = 'true'
    )
"""))

Lakehouse Native SQL:

-- Pure SQL implementation, no Python wrapper needed
CREATE TABLE IF NOT EXISTS events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id STRING,
    event_type STRING,
    properties MAP<STRING, STRING>,
    -- Auto partition column
    event_date DATE GENERATED ALWAYS AS (CAST(event_time AS DATE))
) 
PARTITIONED BY (event_date)
PROPERTIES (
    'data_retention_days' = '7',      -- Corresponds to deletedFileRetentionDuration
    'data_lifecycle' = '365'          -- Data lifecycle management
);

Streaming and Batch Unified Architecture

-- Table design supporting real-time writes
CREATE TABLE streaming_events (
    event_id BIGINT PRIMARY KEY,      -- Supports real-time deduplication
    event_data JSON,
    received_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    
    -- Partition design supports streaming writes
    event_hour INT GENERATED ALWAYS AS (hour(received_at)),
    event_date STRING GENERATED ALWAYS AS (date_format(received_at, 'yyyy-MM-dd'))
) 
PARTITIONED BY (event_date, event_hour)
CLUSTERED BY (event_id) INTO 32 BUCKETS
PROPERTIES (
    'change_tracking' = 'false'  -- Reserved for Table Stream
);

Z-Order Optimization Mapping

-- Databricks: OPTIMIZE table ZORDER BY (col1, col2)
-- Lakehouse: achieve similar effect through CLUSTERED BY + SORTED BY

CREATE TABLE optimized_table (
    col1 INT,
    col2 STRING,
    col3 DECIMAL(10,2),
    col4 TIMESTAMP
)
CLUSTERED BY (col1, col2)      -- Data clustering
SORTED BY (col1, col2)          -- Sort optimization
INTO 64 BUCKETS;

Traditional Database User Migration Guide

Mindset Transformation

Transitioning from OLTP to OLAP requires adjusting table design thinking:

Traditional OLTP Design:

-- MySQL/PostgreSQL/Oracle style
CREATE TABLE orders (
    order_id INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    order_date DATETIME DEFAULT NOW(),
    status VARCHAR(20) DEFAULT 'NEW',
    FOREIGN KEY (customer_id) REFERENCES customers(id),
    INDEX idx_date (order_date),
    INDEX idx_customer (customer_id)
);

Lakehouse OLAP Design:

-- Analytics-optimized design
CREATE TABLE orders (
    order_id BIGINT IDENTITY(1),     -- Auto-increment primary key
    customer_id INT NOT NULL,
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(20) DEFAULT 'NEW',
    
    -- Analytics optimization columns
    order_year INT GENERATED ALWAYS AS (year(order_date)),
    order_month INT GENERATED ALWAYS AS (month(order_date)),
    order_day INT GENERATED ALWAYS AS (day(order_date)),
    
    -- Index definitions
    INDEX bloom_customer (customer_id) BLOOMFILTER,
    INDEX bloom_status (status) BLOOMFILTER
) 
PARTITIONED BY (order_year, order_month)  -- Monthly partitioning
CLUSTERED BY (customer_id) INTO 256 BUCKETS  -- Customer dimension optimization
PROPERTIES (
    'data_lifecycle' = '2555'  -- 7-year data retention
);

Data Type Mapping

Traditional DBLakehouseMigration Advice
INT AUTO_INCREMENTBIGINT IDENTITY(1)Use BIGINT to avoid overflow
DATETIMETIMESTAMPUnified time type
TEXTSTRINGDefault 16 MB, adjustable
ENUMVARCHAR + CHECKUse constraints instead
JSON (MySQL)JSONNative JSON support
SERIAL (PG)BIGINT IDENTITY(1)Fully compatible

Index Strategy Transformation

-- Traditional B-Tree index -> Columnar storage optimization
CREATE TABLE user_activity (
    user_id BIGINT,
    activity_time TIMESTAMP,
    activity_type VARCHAR(50),
    details JSON,
    
    -- Bloom filter replaces B-Tree (for point queries)
    INDEX bloom_user (user_id) BLOOMFILTER,
    
    -- Inverted index replaces full-text index
    INDEX inv_details (details) INVERTED PROPERTIES ('analyzer' = 'english'),
    
    -- Generated column optimizes time queries
    activity_date STRING GENERATED ALWAYS AS (date_format(activity_time, 'yyyy-MM-dd'))
) 
PARTITIONED BY (activity_date)  -- Time partitioning accelerates range queries
CLUSTERED BY (user_id) INTO 128 BUCKETS;  -- User dimension clustering

Advanced Features in Detail

1. Auto-Increment Column (IDENTITY)

-- Complete example: Orders table with auto-increment primary key
CREATE TABLE order_master (
    order_id BIGINT IDENTITY(1000),  -- Starts from 1000
    order_no VARCHAR(50) NOT NULL,
    customer_id BIGINT NOT NULL,
    total_amount DECIMAL(18,2),
    create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) COMMENT 'Order master table';

-- Notes:
-- 1. Only supports BIGINT type
-- 2. Continuity is not guaranteed (may skip under high concurrency)
-- 3. Cannot be added via ALTER TABLE

2. Generated Column (GENERATED ALWAYS AS)

-- Automatic time dimension generation
CREATE TABLE sales_detail (
    sale_time TIMESTAMP,
    product_id INT,
    quantity INT,
    unit_price DECIMAL(10,2),
    
    -- Auto-calculated fields
    total_amount DECIMAL(18,2) GENERATED ALWAYS AS (quantity * unit_price),
    
    -- Time dimension expansion
    sale_date STRING GENERATED ALWAYS AS (date_format(sale_time, 'yyyy-MM-dd')),
    sale_year INT GENERATED ALWAYS AS (year(sale_time)),
    sale_month INT GENERATED ALWAYS AS (month(sale_time)),
    sale_week INT GENERATED ALWAYS AS (weekofyear(sale_time)),
    sale_hour INT GENERATED ALWAYS AS (hour(sale_time))
) 
PARTITIONED BY (sale_date)  -- Generated columns can be used for partitioning
COMMENT 'Sales detail table with auto-calculated fields';

Generated Column Usage Restrictions:

  • Generated columns cannot be used in CLUSTERED BY or SORTED BY
  • Primary key tables cannot use generated columns as partition keys
  • Only deterministic functions are supported; non-deterministic functions like CURRENT_DATE, RANDOM are not allowed

3. Default Values (DEFAULT)

-- Audit table design
CREATE TABLE audit_log (
    log_id BIGINT IDENTITY(1),
    table_name VARCHAR(100) NOT NULL,
    operation VARCHAR(20) NOT NULL,
    user_id VARCHAR(50) DEFAULT CURRENT_USER(),  -- Current user
    operation_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,  -- Current time
    client_ip VARCHAR(50) DEFAULT '0.0.0.0',
    status VARCHAR(20) DEFAULT 'SUCCESS',
    details JSON
) 
PARTITIONED BY (date_format(operation_time, 'yyyy-MM-dd'))
COMMENT 'Unified audit log table';

4. Primary Keys and Unique Constraints

-- Primary key table design for CDC scenarios
CREATE TABLE customer_snapshot (
    customer_id BIGINT,
    snapshot_date DATE,
    customer_name VARCHAR(200),
    customer_level VARCHAR(20),
    total_purchase DECIMAL(18,2),
    last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    -- Composite primary key
    PRIMARY KEY (customer_id, snapshot_date)
) 
PARTITIONED BY (snapshot_date)
CLUSTERED BY (customer_id, snapshot_date) 
SORTED BY (customer_id, snapshot_date) INTO 64 BUCKETS  -- Note: primary key tables must use ASC sorting
COMMENT 'Customer snapshot table for CDC';

Important Primary Key Table Restrictions:

  • Primary key columns must be included in CLUSTERED BY and SORTED BY
  • SORTED BY must use ascending order (ASC), descending (DESC) is not allowed
  • Partition columns must be a subset of the primary key
  • Generated columns cannot be used as partition columns for primary key tables

5. Complex Index Strategies

-- Multi-dimensional query optimization table
CREATE TABLE product_search (
    product_id BIGINT PRIMARY KEY,
    product_name VARCHAR(500),
    description STRING,
    category_path VARCHAR(200),
    brand VARCHAR(100),
    price DECIMAL(10,2),
    tags ARRAY<STRING>,
    attributes MAP<STRING, STRING>,
    
    -- Multiple index combinations (note: index names must be globally unique)
    INDEX bloom_prod_brand (brand) BLOOMFILTER,  -- Fast brand filtering
    INDEX bloom_prod_category (category_path) BLOOMFILTER,  -- Category filtering
    INDEX inv_prod_name (product_name) INVERTED PROPERTIES ('analyzer' = 'chinese'),  -- Product name search
    INDEX inv_prod_desc (description) INVERTED PROPERTIES ('analyzer' = 'chinese')  -- Description search
)
CLUSTERED BY (product_id) SORTED BY (product_id) INTO 128 BUCKETS  -- Primary key table restriction
COMMENT 'Product search optimization table';

Index Naming Best Practices:

  • Index names must be unique across the entire database
  • Recommended format: table_name_column_name or idx_table_name_column_name
  • Avoid overly generic names such as idx_id

6. Vector Index (VECTOR)

-- Vector search scenario table design
CREATE TABLE embeddings_search (
    doc_id BIGINT PRIMARY KEY,
    content STRING,
    embedding VECTOR(512),  -- 512-dimensional vector, default float type
    vec_small VECTOR(128),  -- 128-dimensional vector
    vec_int VECTOR(int, 256),  -- 256-dimensional vector with int type
    
    -- Vector index
    INDEX vec_idx (embedding) USING VECTOR PROPERTIES (
        'scalar.type' = 'f32',
        'distance.function' = 'cosine_distance',  -- Cosine distance
        'm' = '16',  -- HNSW algorithm parameter
        'ef.construction' = '200'
    )
)
CLUSTERED BY (doc_id) SORTED BY (doc_id) INTO 64 BUCKETS
COMMENT 'Vector search optimization table';

Vector Type Usage Restrictions:

  • Cannot be used in ORDER BY or GROUP BY
  • Supported element types: tinyint, int, float (default)
  • Can be mutually converted with array types

Partition Strategy Best Practices

1. Time-Based Partition Design

-- Multi-granularity time partitioning
CREATE TABLE event_log (
    event_id BIGINT,
    event_time TIMESTAMP,
    event_type VARCHAR(50),
    event_data JSON,
    
    -- Multi-level partition column generation
    event_date STRING GENERATED ALWAYS AS (date_format(event_time, 'yyyy-MM-dd')),
    event_hour INT GENERATED ALWAYS AS (hour(event_time))
) 
PARTITIONED BY (event_date, event_hour)  -- Two-level partitioning
COMMENT 'Event log table with hourly partitions';

-- Transform partition function example
CREATE TABLE metric_data (
    metric_time TIMESTAMP,
    metric_name VARCHAR(100),
    metric_value DOUBLE
) 
PARTITIONED BY (
    days(metric_time)  -- Partition by number of days (since 1970-01-01)
);

2. Business Dimension Partitioning

-- Multi-dimension combined partitioning
CREATE TABLE transaction_fact (
    tx_id BIGINT IDENTITY(1),
    tx_time TIMESTAMP,
    region_code VARCHAR(10),
    channel_code VARCHAR(10),
    amount DECIMAL(18,2),
    
    tx_date STRING GENERATED ALWAYS AS (date_format(tx_time, 'yyyy-MM-dd'))
) 
PARTITIONED BY (tx_date, region_code, channel_code)  -- Three-dimensional partitioning
CLUSTERED BY (tx_id) INTO 256 BUCKETS
COMMENT 'Transaction fact table with multi-dimension partitioning';

3. Partition Count Control

-- Properly control partition granularity
CREATE TABLE log_archive (
    log_time TIMESTAMP,
    log_level VARCHAR(10),
    log_content STRING,
    
    -- Monthly partitioning to avoid too many partitions
    log_month STRING GENERATED ALWAYS AS (date_format(log_time, 'yyyy-MM'))
) 
PARTITIONED BY (log_month)
PROPERTIES (
    'data_lifecycle' = '1095'  -- Auto-clean after 3 years
);

Bucketing Optimization Guide

1. Bucket Count Selection

-- Bucket count calculation formula: total data size / 128 MB ~ 1 GB

-- Small table (<10 GB): small number of buckets
CREATE TABLE small_dim (
    dim_id INT,
    dim_name VARCHAR(100)
) CLUSTERED BY (dim_id) INTO 8 BUCKETS;

-- Medium table (10 GB ~ 1 TB): moderate number of buckets
CREATE TABLE medium_fact (
    fact_id BIGINT,
    measure DECIMAL(18,4)
) CLUSTERED BY (fact_id) INTO 128 BUCKETS;

-- Large table (>1 TB): large number of buckets
CREATE TABLE large_fact (
    id BIGINT,
    data STRING
) CLUSTERED BY (id) INTO 1024 BUCKETS;

2. Bucket Key Selection Strategy

-- JOIN optimization: use the JOIN key as the bucket key
CREATE TABLE order_items (
    order_id BIGINT,
    item_id INT,
    quantity INT,
    price DECIMAL(10,2)
) CLUSTERED BY (order_id) INTO 256 BUCKETS;  -- Same bucketing as orders table

CREATE TABLE orders (
    order_id BIGINT,
    customer_id INT,
    order_date DATE
) CLUSTERED BY (order_id) INTO 256 BUCKETS;  -- Same bucketing strategy

3. Sort Optimization

-- SORTED BY accelerates range queries
CREATE TABLE time_series_data (
    device_id VARCHAR(50),
    metric_time TIMESTAMP,
    metric_value DOUBLE
) 
CLUSTERED BY (device_id) 
SORTED BY (metric_time DESC)  -- Latest data first
INTO 128 BUCKETS;

Table Properties (PROPERTIES) In-Depth Analysis

Officially Supported Properties

CREATE TABLE data_management (
    id BIGINT,
    data STRING,
    create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) 
PROPERTIES (
    -- Lifecycle management
    'data_lifecycle' = '1095',           -- Auto-delete after 3 years (days)
    'data_retention_days' = '30',        -- 30-day Time Travel history
    
    -- Cache policy
    'partition.cache.policy.latest.count' = '7',  -- Cache the most recent 7 partitions
    
    -- Change tracking
    'change_tracking' = 'true',          -- Enable Table Stream functionality
    
    -- Field length limits (bytes)
    'cz.storage.write.max.string.bytes' = '33554432',  -- STRING max 32 MB
    'cz.storage.write.max.binary.bytes' = '33554432',  -- BINARY max 32 MB
    'cz.storage.write.max.json.bytes' = '33554432'     -- JSON max 32 MB
);

Property Usage Reference

Property NameDescriptionValue RangeDefault Value
data_lifecycleData lifecycle (days)>0 or -1 (disabled)-1
data_retention_daysTime Travel retention period0-901
partition.cache.policy.latest.countCache the most recent N partitions>=00
change_trackingWhether to enable change trackingtrue/falsefalse
cz.storage.write.max.string.bytesSTRING maximum length>016777216 (16 MB)
cz.storage.write.max.binary.bytesBINARY maximum length>016777216 (16 MB)
cz.storage.write.max.json.bytesJSON maximum length>016777216 (16 MB)

1. Dimension Table Pattern

-- Standard dimension table design
CREATE TABLE dim_product (
    product_id INT PRIMARY KEY,
    product_code VARCHAR(50) NOT NULL,
    product_name VARCHAR(200) NOT NULL,
    category_id INT,
    brand_id INT,
    status VARCHAR(20) DEFAULT 'ACTIVE',
    created_date DATE,
    modified_date DATE,
    
    -- Audit columns
    created_by VARCHAR(50),
    modified_by VARCHAR(50),
    
    -- Index optimization
    INDEX bloom_code (product_code) BLOOMFILTER,
    INDEX bloom_category (category_id) BLOOMFILTER,
    INDEX inv_name (product_name) INVERTED PROPERTIES ('analyzer' = 'chinese')
) 
CLUSTERED BY (product_id) INTO 32 BUCKETS  -- Dimension tables are usually small
COMMENT 'Product dimension table';

2. Fact Table Pattern

-- Large fact table design
CREATE TABLE fact_sales (
    -- Business primary key
    sale_id BIGINT IDENTITY(1),
    
    -- Dimension foreign keys
    date_id INT NOT NULL,
    product_id INT NOT NULL,
    customer_id BIGINT NOT NULL,
    store_id INT NOT NULL,
    
    -- Measures
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) DEFAULT 0,
    tax_amount DECIMAL(10,2) DEFAULT 0,
    total_amount DECIMAL(18,2) GENERATED ALWAYS AS 
        (quantity * unit_price - discount_amount + tax_amount),
    
    -- Timestamp
    sale_time TIMESTAMP NOT NULL,
    
    -- Partition column
    sale_date STRING GENERATED ALWAYS AS (date_format(sale_time, 'yyyy-MM-dd')),
    
    -- Indexes
    INDEX bloom_product (product_id) BLOOMFILTER,
    INDEX bloom_customer (customer_id) BLOOMFILTER
) 
PARTITIONED BY (sale_date)
CLUSTERED BY (customer_id) SORTED BY (sale_time DESC) INTO 256 BUCKETS
PROPERTIES (
    'data_lifecycle' = '2555',  -- 7-year retention
    'partition.cache.policy.latest.count' = '30'  -- Cache 30 days
)
COMMENT 'Sales fact table';

3. Wide Table Pattern

-- Wide table design (star schema denormalization)
CREATE TABLE user_behavior_wide (
    -- User dimension
    user_id BIGINT,
    user_name VARCHAR(100),
    user_level VARCHAR(20),
    register_date DATE,
    
    -- Behavior facts
    behavior_id BIGINT IDENTITY(1),
    behavior_type VARCHAR(50),
    behavior_time TIMESTAMP,
    
    -- Product dimension (denormalized)
    product_id INT,
    product_name VARCHAR(200),
    category_name VARCHAR(100),
    brand_name VARCHAR(100),
    
    -- Computed columns
    behavior_date STRING GENERATED ALWAYS AS (date_format(behavior_time, 'yyyy-MM-dd')),
    behavior_hour INT GENERATED ALWAYS AS (hour(behavior_time))
) 
PARTITIONED BY (behavior_date, behavior_hour)
CLUSTERED BY (user_id) INTO 512 BUCKETS
COMMENT 'User behavior wide table';

Performance Tuning Checklist

Pre-Creation Assessment

  • Data Volume Assessment: Estimate the total table size and growth rate
  • Query Patterns: Identify the primary query dimensions and filter conditions
  • Update Frequency: Determine whether append-only or updates/deletes are needed
  • Concurrency Requirements: Assess query concurrency and response time requirements

Partition Design Checklist

  • Partition Key Selection: Choose the most commonly used filter columns in queries
  • Partition Granularity: Ensure single partition size is between 100 MB and 1 GB
  • Partition Count: Avoid creating too many partitions (recommended < 10000)
  • Transform Partitioning: Consider using transform functions to optimize partitioning

Bucketing Optimization Checklist

  • Bucket Key Selection: Prioritize JOIN keys or high-cardinality columns
  • Bucket Count: Choose based on data volume (total / 128 MB ~ 1 GB)
  • Sort Strategy: Add SORTED BY for range queries
  • Data Skew: Avoid selecting heavily skewed columns as bucket keys

Index Strategy Checklist

  • Bloom Filter: Create for high-cardinality point-query columns
  • Inverted Index: Create for text search columns
  • Vector Index: Create for vector search scenarios
  • Index Maintenance: Periodically execute BUILD INDEX
  • Index Evaluation: Monitor index usage rate and effectiveness

FAQ and Solutions

Q1: How to handle very large tables (>10 TB)?

-- Multi-level partitioning strategy
CREATE TABLE huge_table (
    id BIGINT,
    event_time TIMESTAMP,
    region VARCHAR(10),
    data STRING,
    
    -- Three-level partitioning
    year INT GENERATED ALWAYS AS (year(event_time)),
    month INT GENERATED ALWAYS AS (month(event_time)),
    day INT GENERATED ALWAYS AS (day(event_time))
) 
PARTITIONED BY (year, month, day, region)
CLUSTERED BY (id) INTO 2048 BUCKETS;

Q2: How to optimize real-time writes?

-- Real-time write optimized table design
CREATE TABLE realtime_events (
    event_id BIGINT PRIMARY KEY,  -- Supports deduplication
    event_data JSON,
    ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    
    -- Hourly partitioning reduces write conflicts
    ingest_hour INT GENERATED ALWAYS AS (hour(ingest_time)),
    ingest_date STRING GENERATED ALWAYS AS (date_format(ingest_time, 'yyyy-MM-dd'))
) 
PARTITIONED BY (ingest_date, ingest_hour)
CLUSTERED BY (event_id) INTO 64 BUCKETS;  -- Moderate bucketing

Q3: How to design SCD (Slowly Changing Dimension)?

-- Type 2 SCD design
CREATE TABLE dim_customer_scd (
    customer_id BIGINT,
    customer_name VARCHAR(200),
    customer_level VARCHAR(20),
    effective_date DATE,
    expiry_date DATE,
    is_current BOOLEAN,
    
    PRIMARY KEY (customer_id, effective_date)
) 
CLUSTERED BY (customer_id, effective_date) 
SORTED BY (customer_id, effective_date)  -- Primary key tables must use ASC sorting
INTO 128 BUCKETS
COMMENT 'Customer dimension with SCD Type 2';

Q4: How to adjust field length limits?

-- Adjust maximum length of STRING/BINARY/JSON fields
ALTER TABLE large_content_table SET PROPERTIES (
    'cz.storage.write.max.string.bytes' = '67108864',  -- 64 MB
    'cz.storage.write.max.json.bytes' = '33554432'     -- 32 MB
);

Migration Success Factors Summary

Technology Stack Migration Quick Reference

Source SystemKey Migration PointsLakehouse Advantage
SparkDataFrame -> SQL DDLStandard SQL lowers the barrier
HivePartition syntax fully compatible100x performance improvement
MaxComputeFunctions largely consistentElastic resources, more flexible
SnowflakeAdd explicit optimizationControllable costs
DatabricksDelta concepts alignedPure SQL, more concise
Traditional DBOLTP -> OLAP mindsetUnlimited scalability

Table Creation Best Practices Summary

  1. Proper Partitioning: Choose low-cardinality, frequently queried columns
  2. Moderate Bucketing: Choose based on data volume, avoid over-bucketing
  3. Precise Indexing: Only create indexes for necessary columns
  4. Smart Use of Generated Columns: Reduce ETL complexity
  5. Lifecycle Management: Set a reasonable retention period
  6. Cautious Primary Keys: Only use in CDC scenarios
  7. Test First: Validate on small datasets before large-scale application

Expected Performance Optimization Results

By following the table creation best practices in this guide:

  • Query Performance: 10-100x improvement over Hive
  • Storage Efficiency: Columnar compression saves 60-80% space
  • Maintenance Cost: Automated management reduces manual effort by 90%
  • Scalability: Supports petabyte-scale data volumes

Important Constraints and Limitations Summary

Primary Key Table Constraints

  • CLUSTERED BY: Must include all primary key columns
  • SORTED BY: Must include all primary key columns and can only use ascending (ASC) order
  • PARTITIONED BY: Partition columns must be a subset of the primary key
  • Generated Column Restriction: Primary key tables cannot use generated columns as partition keys

Generated Column Restrictions

  • Cannot be used in CLUSTERED BY or SORTED BY
  • Non-deterministic functions are not supported (e.g., CURRENT_DATE, RANDOM, CURRENT_TIMESTAMP)
  • Aggregate functions, window functions, or table functions are not supported
  • Values cannot be specified for generated columns during INSERT

Index Constraints

  • Index names must be unique across the entire database
  • BLOOMFILTER indexes do not support BUILD INDEX after creation
  • Inverted indexes must specify an analyzer
  • Vector indexes cannot be used in ORDER BY or GROUP BY

Data Type Restrictions

  • Auto-increment columns (IDENTITY) only support the BIGINT type
  • VARCHAR maximum length is 1048576 (approximately 1 MB)
  • STRING/BINARY/JSON default maximum is 16 MB; adjustable via properties
  • Partition columns do not support floating-point types such as FLOAT, DOUBLE, DECIMAL
  • VECTOR type elements support tinyint, int, float

Partition Restrictions

  • A single task can write to a maximum of 2048 partitions (adjustable via parameters)
  • Transform partition functions use years, months, days, hours (note the plural form)
  • Partition columns do not support complex data types (ARRAY, MAP, STRUCT)

Summary

Singdata Lakehouse's CREATE TABLE functionality provides enterprises with powerful, flexible, high-performance table management capabilities. Regardless of your technology background, you can quickly get started and unlock the system's maximum value. Through the detailed explanations and rich examples in this guide, you should now have mastered all the skills needed to create efficient data tables in Lakehouse.

Start your Lakehouse journey now and experience the powerful capabilities of the next-generation data warehouse!


Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, always verify the correctness and performance impact of all operations in a test environment.