Lakehouse CREATE TABLE Usage Guide
Overview
Singdata Lakehouse provides full-featured, high-performance table creation capabilities, perfectly supporting smooth migration from Spark, Hive, MaxCompute, Snowflake, Databricks, and traditional databases. This guide is based on official documentation and production environment validation, providing professional migration paths and best practices for users from different technology backgrounds.
Quick Navigation
CREATE TABLE Complete Syntax
Syntax Structure
CREATE TABLE [ IF NOT EXISTS ] table_name
(
column_definition [, column_definition, ...]
[, index_definition_list]
)
[ PARTITIONED BY (partition_spec) ]
[ CLUSTERED BY (column_list) [SORTED BY (column_list)] INTO num_buckets BUCKETS ]
[ COMMENT 'table_comment' ]
[ PROPERTIES ('key' = 'value', ...) ];
Column Definition Syntax
column_name column_type
[ NOT NULL ]
[ PRIMARY KEY ]
[ IDENTITY[(seed)] ]
[ GENERATED ALWAYS AS (expr) ]
[ DEFAULT default_expression ]
[ COMMENT 'column_comment' ]
Supported Data Types
| Category | Data Type | Description | Counterpart in Other Systems |
|---|
| Numeric | TINYINT | 1-byte integer (-128 to 127) | Spark/Hive: TINYINT |
| SMALLINT | 2-byte integer | Spark/Hive: SMALLINT |
| INT | 4-byte integer | Spark/Hive: INT/INTEGER |
| BIGINT | 8-byte integer | Spark/Hive: BIGINT/LONG |
| FLOAT | 4-byte floating point | Spark/Hive: FLOAT |
| DOUBLE | 8-byte floating point | Spark/Hive: DOUBLE |
| DECIMAL(p,s) | Exact numeric | Spark/Hive: DECIMAL |
| String | VARCHAR(n) | Variable-length (max 1048576) | Snowflake: VARCHAR |
| CHAR(n) | Fixed-length (1-255) | Oracle: CHAR |
| STRING | Default 16 MB, adjustable | Hive/Spark: STRING |
| Temporal | DATE | Date (YYYY-MM-DD) | Common across all systems |
| TIMESTAMP | Timestamp (local time) | Spark: TIMESTAMP |
| TIMESTAMP_NTZ | Timestamp without timezone | Snowflake: TIMESTAMP_NTZ |
| Binary | BINARY | Binary data, default 16 MB | Hive: BINARY |
| Boolean | BOOLEAN | True/False | Common across all systems |
| Complex | ARRAY | Array | Spark/Hive: ARRAY |
| MAP<K,V> | Key-value pairs | Spark/Hive: MAP |
| STRUCT<...> | Struct | Spark/Hive: STRUCT |
| JSON | JSON data, default 16 MB | Snowflake: VARIANT |
| Vector | VECTOR(dimension) | Vector type | For vector search scenarios |
| VECTOR(type, dimension) | Vector with element type | type: tinyint/int/float |
Spark User Migration Guide
Core Advantages Comparison
| Feature | Spark | Lakehouse | Advantage |
|---|
| Table Creation | DataFrame API-centric | SQL DDL standardized | Lower learning curve |
| Partition Mgmt | Manual management | Hidden partition + transform partition | Automatic optimization |
| Data Organization | File-level | Table-level management | More efficient metadata |
| Schema Evolution | Requires rewrite | In-place evolution | Greater flexibility |
DataFrame to DDL Conversion
Spark DataFrame Approach:
# Spark DataFrame table creation
df = spark.read.parquet("path/to/data")
df.write.partitionBy("year", "month") \
.bucketBy(10, "user_id") \
.sortBy("timestamp") \
.saveAsTable("user_events")
Lakehouse DDL Approach:
-- Corresponding Lakehouse CREATE TABLE statement
CREATE TABLE user_events (
user_id BIGINT,
event_type STRING,
timestamp TIMESTAMP,
properties MAP<STRING, STRING>,
year INT GENERATED ALWAYS AS (year(timestamp)),
month INT GENERATED ALWAYS AS (month(timestamp))
)
PARTITIONED BY (year, month)
CLUSTERED BY (user_id) SORTED BY (timestamp) INTO 10 BUCKETS
COMMENT 'User events table';
Advanced Feature Mapping
1. Dynamic Partition Handling
-- Spark requires manual dynamic partition handling
-- Lakehouse handles it automatically with transform partitioning
CREATE TABLE events (
event_time TIMESTAMP,
user_id STRING,
event_data STRING
) PARTITIONED BY (
days(event_time), -- Auto partition by day
bucket(100, user_id) -- Auto bucket by user ID
);
2. Schema Merge and Evolution
-- Lakehouse natively supports schema evolution
ALTER TABLE events ADD COLUMN new_field STRING;
ALTER TABLE events CHANGE COLUMN event_data TYPE JSON;
3. Complex Data Type Handling
-- Fully compatible with Spark complex types
CREATE TABLE user_profiles (
user_id BIGINT,
tags ARRAY<STRING>, -- Tag array
attributes MAP<STRING, STRING>, -- Attribute mapping
address STRUCT< -- Nested struct
street: STRING,
city: STRING,
coordinates: STRUCT<lat: DOUBLE, lon: DOUBLE>
>,
preferences JSON -- Flexible JSON field
);
-- Optimized table creation pattern for Spark users
CREATE TABLE optimized_fact_table (
-- Business fields
transaction_id BIGINT IDENTITY(1),
user_id BIGINT,
product_id INT,
amount DECIMAL(18, 2),
transaction_time TIMESTAMP,
-- Generated partition columns (replaces Spark manual partitioning)
tx_date STRING GENERATED ALWAYS AS (date_format(transaction_time, 'yyyy-MM-dd')),
tx_hour INT GENERATED ALWAYS AS (hour(transaction_time))
)
PARTITIONED BY (tx_date, tx_hour) -- Two-level partitioning
CLUSTERED BY (user_id) INTO 256 BUCKETS -- Bucket by user dimension
PROPERTIES (
'data_lifecycle' = '730', -- 2-year lifecycle
'partition.cache.policy.latest.count' = '7' -- Cache last 7 days
);
Hive User Migration Guide
Perfect Syntax Compatibility
Singdata Lakehouse is fully compatible with Hive's partition syntax while offering more powerful features:
Traditional Hive Table Creation:
-- Hive style (fully supported)
CREATE TABLE IF NOT EXISTS sales_fact (
order_id BIGINT,
product_id INT,
quantity INT,
price DECIMAL(10,2)
) PARTITIONED BY (
dt STRING,
region STRING
)
STORED AS PARQUET
TBLPROPERTIES ('transactional'='true');
Lakehouse Enhanced Version:
-- Same syntax, better performance
CREATE TABLE IF NOT EXISTS sales_fact (
order_id BIGINT,
product_id INT,
quantity INT,
price DECIMAL(10,2),
order_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP -- New: default value
) PARTITIONED BY (
dt STRING,
region STRING
)
COMMENT 'Sales fact table with enhanced features'
PROPERTIES (
'data_lifecycle' = '1095', -- New: 3-year lifecycle management
'change_tracking' = 'false' -- New: CDC capability reserved
);
Partition Strategy Upgrade
Evolution from Static Partitioning to Intelligent Partitioning:
-- Hive traditional approach: manual partition management required
INSERT OVERWRITE TABLE sales_fact PARTITION(dt='2024-06-20', region='CN')
SELECT ...;
-- Lakehouse intelligent approach: automatic partition management
CREATE TABLE sales_fact_smart (
order_id BIGINT,
order_time TIMESTAMP,
region_code STRING,
amount DECIMAL(10,2),
-- Auto-generated partition column
dt STRING GENERATED ALWAYS AS (date_format(order_time, 'yyyy-MM-dd'))
) PARTITIONED BY (dt, region_code);
-- Partition is automatically handled on insert
INSERT INTO sales_fact_smart (order_id, order_time, region_code, amount)
VALUES (1001, CURRENT_TIMESTAMP, 'CN', 99.99);
Transactional Table Capability Enhancement
-- Hive transactional tables have many limitations; Lakehouse natively supports
CREATE TABLE transaction_table (
id BIGINT PRIMARY KEY, -- Native primary key support
data STRING,
version INT,
updated_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
CLUSTERED BY (id) INTO 16 BUCKETS -- Required bucketing optimization
PROPERTIES (
'data_retention_days' = '7' -- Time Travel capability
);
| Operation Type | Hive (MapReduce) | Lakehouse | Improvement |
|---|
| Table Creation Speed | Seconds | Milliseconds | 10x |
| Partition Pruning | File-level scan | Metadata pruning | 100x |
| Schema Changes | Requires rebuild | Online change | Infinite |
| Small File Compaction | Manual maintenance | Automatic optimization | Automated |
MaxCompute User Migration Guide
Syntax Comparison and Enhancement
MaxCompute users will find Singdata Lakehouse's syntax very familiar, while the functionality is even more powerful:
MaxCompute Table Creation:
-- MaxCompute syntax
CREATE TABLE IF NOT EXISTS sale_detail(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING)
LIFECYCLE 730;
Lakehouse Equivalent Syntax:
-- Nearly identical syntax, more features
CREATE TABLE IF NOT EXISTS sale_detail(
shop_name STRING,
customer_id STRING,
total_price DOUBLE,
-- New capabilities
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
row_id BIGINT IDENTITY(1) -- Auto-increment primary key
)
PARTITIONED BY (sale_date STRING, region STRING)
PROPERTIES (
'data_lifecycle' = '730', -- Lifecycle (days)
'data_retention_days' = '7' -- Time Travel retention period
);
Function Compatibility
-- MaxCompute function -> Lakehouse function mapping
CREATE TABLE datetime_example (
id INT,
-- MaxCompute: GETDATE() -> Lakehouse: CURRENT_TIMESTAMP
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Date functions fully compatible
year_part INT GENERATED ALWAYS AS (YEAR(created_at)),
month_part INT GENERATED ALWAYS AS (MONTH(created_at)),
-- String functions are the same
user_name STRING,
name_length INT GENERATED ALWAYS AS (LENGTH(user_name))
);
Resource Management Comparison
| Feature | MaxCompute | Lakehouse | Advantage |
|---|
| Compute Resources | Prepaid CU | Elastic VCLUSTER | Pay-per-use |
| Storage Limit | Project quota | Unlimited expansion | More flexible |
| Concurrency Ctrl | Job queue | Dynamic scheduling | More efficient |
| Cross-region | Requires migration | Native support | Global |
Snowflake User Migration Guide
Architectural Philosophy Comparison
Both are cloud-native data warehouses, but Lakehouse provides more control:
Snowflake-Style Table Creation:
-- Snowflake
CREATE OR REPLACE TABLE customer_transactions (
transaction_id NUMBER AUTOINCREMENT,
customer_id VARCHAR,
amount NUMBER(18,2),
transaction_date TIMESTAMP_NTZ,
status VARCHAR DEFAULT 'PENDING'
) CLUSTER BY (customer_id);
Lakehouse Enhanced Version:
-- Lakehouse: more explicit optimization control
CREATE TABLE IF NOT EXISTS customer_transactions (
transaction_id BIGINT IDENTITY(1), -- Corresponds to AUTOINCREMENT
customer_id VARCHAR(100),
amount DECIMAL(18,2),
transaction_date TIMESTAMP_NTZ,
status VARCHAR(20) DEFAULT 'PENDING',
-- Additional optimization columns
tx_date STRING GENERATED ALWAYS AS (date_format(transaction_date, 'yyyy-MM-dd')),
INDEX idx_customer (customer_id) BLOOMFILTER -- Explicit index
)
PARTITIONED BY (tx_date) -- Explicit partition strategy
CLUSTERED BY (customer_id) INTO 128 BUCKETS -- Explicit bucket count
COMMENT 'Customer transaction table with explicit optimizations';
Time Travel and Cloning
-- Snowflake Time Travel -> Lakehouse Time Travel
CREATE TABLE orders_backup AS
SELECT * FROM orders; -- Immediate copy
-- Lakehouse also supports:
ALTER TABLE orders SET PROPERTIES ('data_retention_days' = '30'); -- 30-day history
-- Query historical data
SELECT * FROM orders TIMESTAMP AS OF '2024-06-01 00:00:00';
-- Snowflake: auto-optimization
-- Lakehouse: provides more tuning options
CREATE TABLE large_fact_table (
-- Column definitions
fact_id BIGINT,
dim1_id INT,
dim2_id INT,
measure1 DECIMAL(18,4),
measure2 DECIMAL(18,4),
fact_time TIMESTAMP,
-- Performance optimization
fact_date STRING GENERATED ALWAYS AS (date_format(fact_time, 'yyyy-MM-dd'))
)
PARTITIONED BY (fact_date)
CLUSTERED BY (dim1_id, dim2_id)
SORTED BY (fact_time DESC) -- Additional sort optimization
INTO 256 BUCKETS -- Precise parallelism control
PROPERTIES (
'partition.cache.policy.latest.count' = '30' -- Cache policy
);
Databricks User Migration Guide
Deep Integration of Delta Lake Concepts
Lakehouse draws from Delta Lake's design philosophy, offering similar but more concise syntax:
Databricks Delta Table:
# Databricks Python API
(spark.sql("""
CREATE TABLE IF NOT EXISTS events (
event_id LONG,
event_time TIMESTAMP,
user_id STRING,
event_type STRING,
properties MAP<STRING, STRING>
) USING DELTA
PARTITIONED BY (date(event_time))
TBLPROPERTIES (
'delta.deletedFileRetentionDuration' = '7 days',
'delta.optimizeWrite' = 'true'
)
"""))
Lakehouse Native SQL:
-- Pure SQL implementation, no Python wrapper needed
CREATE TABLE IF NOT EXISTS events (
event_id BIGINT,
event_time TIMESTAMP,
user_id STRING,
event_type STRING,
properties MAP<STRING, STRING>,
-- Auto partition column
event_date DATE GENERATED ALWAYS AS (CAST(event_time AS DATE))
)
PARTITIONED BY (event_date)
PROPERTIES (
'data_retention_days' = '7', -- Corresponds to deletedFileRetentionDuration
'data_lifecycle' = '365' -- Data lifecycle management
);
Streaming and Batch Unified Architecture
-- Table design supporting real-time writes
CREATE TABLE streaming_events (
event_id BIGINT PRIMARY KEY, -- Supports real-time deduplication
event_data JSON,
received_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
processed_at TIMESTAMP,
-- Partition design supports streaming writes
event_hour INT GENERATED ALWAYS AS (hour(received_at)),
event_date STRING GENERATED ALWAYS AS (date_format(received_at, 'yyyy-MM-dd'))
)
PARTITIONED BY (event_date, event_hour)
CLUSTERED BY (event_id) INTO 32 BUCKETS
PROPERTIES (
'change_tracking' = 'false' -- Reserved for Table Stream
);
Z-Order Optimization Mapping
-- Databricks: OPTIMIZE table ZORDER BY (col1, col2)
-- Lakehouse: achieve similar effect through CLUSTERED BY + SORTED BY
CREATE TABLE optimized_table (
col1 INT,
col2 STRING,
col3 DECIMAL(10,2),
col4 TIMESTAMP
)
CLUSTERED BY (col1, col2) -- Data clustering
SORTED BY (col1, col2) -- Sort optimization
INTO 64 BUCKETS;
Traditional Database User Migration Guide
Transitioning from OLTP to OLAP requires adjusting table design thinking:
Traditional OLTP Design:
-- MySQL/PostgreSQL/Oracle style
CREATE TABLE orders (
order_id INT PRIMARY KEY AUTO_INCREMENT,
customer_id INT NOT NULL,
order_date DATETIME DEFAULT NOW(),
status VARCHAR(20) DEFAULT 'NEW',
FOREIGN KEY (customer_id) REFERENCES customers(id),
INDEX idx_date (order_date),
INDEX idx_customer (customer_id)
);
Lakehouse OLAP Design:
-- Analytics-optimized design
CREATE TABLE orders (
order_id BIGINT IDENTITY(1), -- Auto-increment primary key
customer_id INT NOT NULL,
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(20) DEFAULT 'NEW',
-- Analytics optimization columns
order_year INT GENERATED ALWAYS AS (year(order_date)),
order_month INT GENERATED ALWAYS AS (month(order_date)),
order_day INT GENERATED ALWAYS AS (day(order_date)),
-- Index definitions
INDEX bloom_customer (customer_id) BLOOMFILTER,
INDEX bloom_status (status) BLOOMFILTER
)
PARTITIONED BY (order_year, order_month) -- Monthly partitioning
CLUSTERED BY (customer_id) INTO 256 BUCKETS -- Customer dimension optimization
PROPERTIES (
'data_lifecycle' = '2555' -- 7-year data retention
);
Data Type Mapping
| Traditional DB | Lakehouse | Migration Advice |
|---|
| INT AUTO_INCREMENT | BIGINT IDENTITY(1) | Use BIGINT to avoid overflow |
| DATETIME | TIMESTAMP | Unified time type |
| TEXT | STRING | Default 16 MB, adjustable |
| ENUM | VARCHAR + CHECK | Use constraints instead |
| JSON (MySQL) | JSON | Native JSON support |
| SERIAL (PG) | BIGINT IDENTITY(1) | Fully compatible |
-- Traditional B-Tree index -> Columnar storage optimization
CREATE TABLE user_activity (
user_id BIGINT,
activity_time TIMESTAMP,
activity_type VARCHAR(50),
details JSON,
-- Bloom filter replaces B-Tree (for point queries)
INDEX bloom_user (user_id) BLOOMFILTER,
-- Inverted index replaces full-text index
INDEX inv_details (details) INVERTED PROPERTIES ('analyzer' = 'english'),
-- Generated column optimizes time queries
activity_date STRING GENERATED ALWAYS AS (date_format(activity_time, 'yyyy-MM-dd'))
)
PARTITIONED BY (activity_date) -- Time partitioning accelerates range queries
CLUSTERED BY (user_id) INTO 128 BUCKETS; -- User dimension clustering
Advanced Features in Detail
1. Auto-Increment Column (IDENTITY)
-- Complete example: Orders table with auto-increment primary key
CREATE TABLE order_master (
order_id BIGINT IDENTITY(1000), -- Starts from 1000
order_no VARCHAR(50) NOT NULL,
customer_id BIGINT NOT NULL,
total_amount DECIMAL(18,2),
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
) COMMENT 'Order master table';
-- Notes:
-- 1. Only supports BIGINT type
-- 2. Continuity is not guaranteed (may skip under high concurrency)
-- 3. Cannot be added via ALTER TABLE
2. Generated Column (GENERATED ALWAYS AS)
-- Automatic time dimension generation
CREATE TABLE sales_detail (
sale_time TIMESTAMP,
product_id INT,
quantity INT,
unit_price DECIMAL(10,2),
-- Auto-calculated fields
total_amount DECIMAL(18,2) GENERATED ALWAYS AS (quantity * unit_price),
-- Time dimension expansion
sale_date STRING GENERATED ALWAYS AS (date_format(sale_time, 'yyyy-MM-dd')),
sale_year INT GENERATED ALWAYS AS (year(sale_time)),
sale_month INT GENERATED ALWAYS AS (month(sale_time)),
sale_week INT GENERATED ALWAYS AS (weekofyear(sale_time)),
sale_hour INT GENERATED ALWAYS AS (hour(sale_time))
)
PARTITIONED BY (sale_date) -- Generated columns can be used for partitioning
COMMENT 'Sales detail table with auto-calculated fields';
Generated Column Usage Restrictions:
- Generated columns cannot be used in
CLUSTERED BY or SORTED BY
- Primary key tables cannot use generated columns as partition keys
- Only deterministic functions are supported; non-deterministic functions like
CURRENT_DATE, RANDOM are not allowed
3. Default Values (DEFAULT)
-- Audit table design
CREATE TABLE audit_log (
log_id BIGINT IDENTITY(1),
table_name VARCHAR(100) NOT NULL,
operation VARCHAR(20) NOT NULL,
user_id VARCHAR(50) DEFAULT CURRENT_USER(), -- Current user
operation_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- Current time
client_ip VARCHAR(50) DEFAULT '0.0.0.0',
status VARCHAR(20) DEFAULT 'SUCCESS',
details JSON
)
PARTITIONED BY (date_format(operation_time, 'yyyy-MM-dd'))
COMMENT 'Unified audit log table';
4. Primary Keys and Unique Constraints
-- Primary key table design for CDC scenarios
CREATE TABLE customer_snapshot (
customer_id BIGINT,
snapshot_date DATE,
customer_name VARCHAR(200),
customer_level VARCHAR(20),
total_purchase DECIMAL(18,2),
last_update TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Composite primary key
PRIMARY KEY (customer_id, snapshot_date)
)
PARTITIONED BY (snapshot_date)
CLUSTERED BY (customer_id, snapshot_date)
SORTED BY (customer_id, snapshot_date) INTO 64 BUCKETS -- Note: primary key tables must use ASC sorting
COMMENT 'Customer snapshot table for CDC';
Important Primary Key Table Restrictions:
- Primary key columns must be included in
CLUSTERED BY and SORTED BY
SORTED BY must use ascending order (ASC), descending (DESC) is not allowed
- Partition columns must be a subset of the primary key
- Generated columns cannot be used as partition columns for primary key tables
5. Complex Index Strategies
-- Multi-dimensional query optimization table
CREATE TABLE product_search (
product_id BIGINT PRIMARY KEY,
product_name VARCHAR(500),
description STRING,
category_path VARCHAR(200),
brand VARCHAR(100),
price DECIMAL(10,2),
tags ARRAY<STRING>,
attributes MAP<STRING, STRING>,
-- Multiple index combinations (note: index names must be globally unique)
INDEX bloom_prod_brand (brand) BLOOMFILTER, -- Fast brand filtering
INDEX bloom_prod_category (category_path) BLOOMFILTER, -- Category filtering
INDEX inv_prod_name (product_name) INVERTED PROPERTIES ('analyzer' = 'chinese'), -- Product name search
INDEX inv_prod_desc (description) INVERTED PROPERTIES ('analyzer' = 'chinese') -- Description search
)
CLUSTERED BY (product_id) SORTED BY (product_id) INTO 128 BUCKETS -- Primary key table restriction
COMMENT 'Product search optimization table';
Index Naming Best Practices:
- Index names must be unique across the entire database
- Recommended format:
table_name_column_name or idx_table_name_column_name
- Avoid overly generic names such as
idx_id
6. Vector Index (VECTOR)
-- Vector search scenario table design
CREATE TABLE embeddings_search (
doc_id BIGINT PRIMARY KEY,
content STRING,
embedding VECTOR(512), -- 512-dimensional vector, default float type
vec_small VECTOR(128), -- 128-dimensional vector
vec_int VECTOR(int, 256), -- 256-dimensional vector with int type
-- Vector index
INDEX vec_idx (embedding) USING VECTOR PROPERTIES (
'scalar.type' = 'f32',
'distance.function' = 'cosine_distance', -- Cosine distance
'm' = '16', -- HNSW algorithm parameter
'ef.construction' = '200'
)
)
CLUSTERED BY (doc_id) SORTED BY (doc_id) INTO 64 BUCKETS
COMMENT 'Vector search optimization table';
Vector Type Usage Restrictions:
- Cannot be used in ORDER BY or GROUP BY
- Supported element types: tinyint, int, float (default)
- Can be mutually converted with array types
Partition Strategy Best Practices
1. Time-Based Partition Design
-- Multi-granularity time partitioning
CREATE TABLE event_log (
event_id BIGINT,
event_time TIMESTAMP,
event_type VARCHAR(50),
event_data JSON,
-- Multi-level partition column generation
event_date STRING GENERATED ALWAYS AS (date_format(event_time, 'yyyy-MM-dd')),
event_hour INT GENERATED ALWAYS AS (hour(event_time))
)
PARTITIONED BY (event_date, event_hour) -- Two-level partitioning
COMMENT 'Event log table with hourly partitions';
-- Transform partition function example
CREATE TABLE metric_data (
metric_time TIMESTAMP,
metric_name VARCHAR(100),
metric_value DOUBLE
)
PARTITIONED BY (
days(metric_time) -- Partition by number of days (since 1970-01-01)
);
2. Business Dimension Partitioning
-- Multi-dimension combined partitioning
CREATE TABLE transaction_fact (
tx_id BIGINT IDENTITY(1),
tx_time TIMESTAMP,
region_code VARCHAR(10),
channel_code VARCHAR(10),
amount DECIMAL(18,2),
tx_date STRING GENERATED ALWAYS AS (date_format(tx_time, 'yyyy-MM-dd'))
)
PARTITIONED BY (tx_date, region_code, channel_code) -- Three-dimensional partitioning
CLUSTERED BY (tx_id) INTO 256 BUCKETS
COMMENT 'Transaction fact table with multi-dimension partitioning';
3. Partition Count Control
-- Properly control partition granularity
CREATE TABLE log_archive (
log_time TIMESTAMP,
log_level VARCHAR(10),
log_content STRING,
-- Monthly partitioning to avoid too many partitions
log_month STRING GENERATED ALWAYS AS (date_format(log_time, 'yyyy-MM'))
)
PARTITIONED BY (log_month)
PROPERTIES (
'data_lifecycle' = '1095' -- Auto-clean after 3 years
);
Bucketing Optimization Guide
1. Bucket Count Selection
-- Bucket count calculation formula: total data size / 128 MB ~ 1 GB
-- Small table (<10 GB): small number of buckets
CREATE TABLE small_dim (
dim_id INT,
dim_name VARCHAR(100)
) CLUSTERED BY (dim_id) INTO 8 BUCKETS;
-- Medium table (10 GB ~ 1 TB): moderate number of buckets
CREATE TABLE medium_fact (
fact_id BIGINT,
measure DECIMAL(18,4)
) CLUSTERED BY (fact_id) INTO 128 BUCKETS;
-- Large table (>1 TB): large number of buckets
CREATE TABLE large_fact (
id BIGINT,
data STRING
) CLUSTERED BY (id) INTO 1024 BUCKETS;
2. Bucket Key Selection Strategy
-- JOIN optimization: use the JOIN key as the bucket key
CREATE TABLE order_items (
order_id BIGINT,
item_id INT,
quantity INT,
price DECIMAL(10,2)
) CLUSTERED BY (order_id) INTO 256 BUCKETS; -- Same bucketing as orders table
CREATE TABLE orders (
order_id BIGINT,
customer_id INT,
order_date DATE
) CLUSTERED BY (order_id) INTO 256 BUCKETS; -- Same bucketing strategy
3. Sort Optimization
-- SORTED BY accelerates range queries
CREATE TABLE time_series_data (
device_id VARCHAR(50),
metric_time TIMESTAMP,
metric_value DOUBLE
)
CLUSTERED BY (device_id)
SORTED BY (metric_time DESC) -- Latest data first
INTO 128 BUCKETS;
Table Properties (PROPERTIES) In-Depth Analysis
Officially Supported Properties
CREATE TABLE data_management (
id BIGINT,
data STRING,
create_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PROPERTIES (
-- Lifecycle management
'data_lifecycle' = '1095', -- Auto-delete after 3 years (days)
'data_retention_days' = '30', -- 30-day Time Travel history
-- Cache policy
'partition.cache.policy.latest.count' = '7', -- Cache the most recent 7 partitions
-- Change tracking
'change_tracking' = 'true', -- Enable Table Stream functionality
-- Field length limits (bytes)
'cz.storage.write.max.string.bytes' = '33554432', -- STRING max 32 MB
'cz.storage.write.max.binary.bytes' = '33554432', -- BINARY max 32 MB
'cz.storage.write.max.json.bytes' = '33554432' -- JSON max 32 MB
);
Property Usage Reference
| Property Name | Description | Value Range | Default Value |
|---|
| data_lifecycle | Data lifecycle (days) | >0 or -1 (disabled) | -1 |
| data_retention_days | Time Travel retention period | 0-90 | 1 |
| partition.cache.policy.latest.count | Cache the most recent N partitions | >=0 | 0 |
| change_tracking | Whether to enable change tracking | true/false | false |
| cz.storage.write.max.string.bytes | STRING maximum length | >0 | 16777216 (16 MB) |
| cz.storage.write.max.binary.bytes | BINARY maximum length | >0 | 16777216 (16 MB) |
| cz.storage.write.max.json.bytes | JSON maximum length | >0 | 16777216 (16 MB) |
Recommended Table Creation Patterns
1. Dimension Table Pattern
-- Standard dimension table design
CREATE TABLE dim_product (
product_id INT PRIMARY KEY,
product_code VARCHAR(50) NOT NULL,
product_name VARCHAR(200) NOT NULL,
category_id INT,
brand_id INT,
status VARCHAR(20) DEFAULT 'ACTIVE',
created_date DATE,
modified_date DATE,
-- Audit columns
created_by VARCHAR(50),
modified_by VARCHAR(50),
-- Index optimization
INDEX bloom_code (product_code) BLOOMFILTER,
INDEX bloom_category (category_id) BLOOMFILTER,
INDEX inv_name (product_name) INVERTED PROPERTIES ('analyzer' = 'chinese')
)
CLUSTERED BY (product_id) INTO 32 BUCKETS -- Dimension tables are usually small
COMMENT 'Product dimension table';
2. Fact Table Pattern
-- Large fact table design
CREATE TABLE fact_sales (
-- Business primary key
sale_id BIGINT IDENTITY(1),
-- Dimension foreign keys
date_id INT NOT NULL,
product_id INT NOT NULL,
customer_id BIGINT NOT NULL,
store_id INT NOT NULL,
-- Measures
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
discount_amount DECIMAL(10,2) DEFAULT 0,
tax_amount DECIMAL(10,2) DEFAULT 0,
total_amount DECIMAL(18,2) GENERATED ALWAYS AS
(quantity * unit_price - discount_amount + tax_amount),
-- Timestamp
sale_time TIMESTAMP NOT NULL,
-- Partition column
sale_date STRING GENERATED ALWAYS AS (date_format(sale_time, 'yyyy-MM-dd')),
-- Indexes
INDEX bloom_product (product_id) BLOOMFILTER,
INDEX bloom_customer (customer_id) BLOOMFILTER
)
PARTITIONED BY (sale_date)
CLUSTERED BY (customer_id) SORTED BY (sale_time DESC) INTO 256 BUCKETS
PROPERTIES (
'data_lifecycle' = '2555', -- 7-year retention
'partition.cache.policy.latest.count' = '30' -- Cache 30 days
)
COMMENT 'Sales fact table';
3. Wide Table Pattern
-- Wide table design (star schema denormalization)
CREATE TABLE user_behavior_wide (
-- User dimension
user_id BIGINT,
user_name VARCHAR(100),
user_level VARCHAR(20),
register_date DATE,
-- Behavior facts
behavior_id BIGINT IDENTITY(1),
behavior_type VARCHAR(50),
behavior_time TIMESTAMP,
-- Product dimension (denormalized)
product_id INT,
product_name VARCHAR(200),
category_name VARCHAR(100),
brand_name VARCHAR(100),
-- Computed columns
behavior_date STRING GENERATED ALWAYS AS (date_format(behavior_time, 'yyyy-MM-dd')),
behavior_hour INT GENERATED ALWAYS AS (hour(behavior_time))
)
PARTITIONED BY (behavior_date, behavior_hour)
CLUSTERED BY (user_id) INTO 512 BUCKETS
COMMENT 'User behavior wide table';
Pre-Creation Assessment
Partition Design Checklist
Bucketing Optimization Checklist
Index Strategy Checklist
FAQ and Solutions
Q1: How to handle very large tables (>10 TB)?
-- Multi-level partitioning strategy
CREATE TABLE huge_table (
id BIGINT,
event_time TIMESTAMP,
region VARCHAR(10),
data STRING,
-- Three-level partitioning
year INT GENERATED ALWAYS AS (year(event_time)),
month INT GENERATED ALWAYS AS (month(event_time)),
day INT GENERATED ALWAYS AS (day(event_time))
)
PARTITIONED BY (year, month, day, region)
CLUSTERED BY (id) INTO 2048 BUCKETS;
Q2: How to optimize real-time writes?
-- Real-time write optimized table design
CREATE TABLE realtime_events (
event_id BIGINT PRIMARY KEY, -- Supports deduplication
event_data JSON,
ingest_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Hourly partitioning reduces write conflicts
ingest_hour INT GENERATED ALWAYS AS (hour(ingest_time)),
ingest_date STRING GENERATED ALWAYS AS (date_format(ingest_time, 'yyyy-MM-dd'))
)
PARTITIONED BY (ingest_date, ingest_hour)
CLUSTERED BY (event_id) INTO 64 BUCKETS; -- Moderate bucketing
Q3: How to design SCD (Slowly Changing Dimension)?
-- Type 2 SCD design
CREATE TABLE dim_customer_scd (
customer_id BIGINT,
customer_name VARCHAR(200),
customer_level VARCHAR(20),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN,
PRIMARY KEY (customer_id, effective_date)
)
CLUSTERED BY (customer_id, effective_date)
SORTED BY (customer_id, effective_date) -- Primary key tables must use ASC sorting
INTO 128 BUCKETS
COMMENT 'Customer dimension with SCD Type 2';
Q4: How to adjust field length limits?
-- Adjust maximum length of STRING/BINARY/JSON fields
ALTER TABLE large_content_table SET PROPERTIES (
'cz.storage.write.max.string.bytes' = '67108864', -- 64 MB
'cz.storage.write.max.json.bytes' = '33554432' -- 32 MB
);
Migration Success Factors Summary
Technology Stack Migration Quick Reference
| Source System | Key Migration Points | Lakehouse Advantage |
|---|
| Spark | DataFrame -> SQL DDL | Standard SQL lowers the barrier |
| Hive | Partition syntax fully compatible | 100x performance improvement |
| MaxCompute | Functions largely consistent | Elastic resources, more flexible |
| Snowflake | Add explicit optimization | Controllable costs |
| Databricks | Delta concepts aligned | Pure SQL, more concise |
| Traditional DB | OLTP -> OLAP mindset | Unlimited scalability |
Table Creation Best Practices Summary
- Proper Partitioning: Choose low-cardinality, frequently queried columns
- Moderate Bucketing: Choose based on data volume, avoid over-bucketing
- Precise Indexing: Only create indexes for necessary columns
- Smart Use of Generated Columns: Reduce ETL complexity
- Lifecycle Management: Set a reasonable retention period
- Cautious Primary Keys: Only use in CDC scenarios
- Test First: Validate on small datasets before large-scale application
By following the table creation best practices in this guide:
- Query Performance: 10-100x improvement over Hive
- Storage Efficiency: Columnar compression saves 60-80% space
- Maintenance Cost: Automated management reduces manual effort by 90%
- Scalability: Supports petabyte-scale data volumes
Important Constraints and Limitations Summary
Primary Key Table Constraints
- CLUSTERED BY: Must include all primary key columns
- SORTED BY: Must include all primary key columns and can only use ascending (ASC) order
- PARTITIONED BY: Partition columns must be a subset of the primary key
- Generated Column Restriction: Primary key tables cannot use generated columns as partition keys
Generated Column Restrictions
- Cannot be used in
CLUSTERED BY or SORTED BY
- Non-deterministic functions are not supported (e.g.,
CURRENT_DATE, RANDOM, CURRENT_TIMESTAMP)
- Aggregate functions, window functions, or table functions are not supported
- Values cannot be specified for generated columns during INSERT
Index Constraints
- Index names must be unique across the entire database
- BLOOMFILTER indexes do not support
BUILD INDEX after creation
- Inverted indexes must specify an analyzer
- Vector indexes cannot be used in ORDER BY or GROUP BY
Data Type Restrictions
- Auto-increment columns (IDENTITY) only support the
BIGINT type
- VARCHAR maximum length is 1048576 (approximately 1 MB)
- STRING/BINARY/JSON default maximum is 16 MB; adjustable via properties
- Partition columns do not support floating-point types such as
FLOAT, DOUBLE, DECIMAL
- VECTOR type elements support tinyint, int, float
Partition Restrictions
- A single task can write to a maximum of 2048 partitions (adjustable via parameters)
- Transform partition functions use
years, months, days, hours (note the plural form)
- Partition columns do not support complex data types (ARRAY, MAP, STRUCT)
Summary
Singdata Lakehouse's CREATE TABLE functionality provides enterprises with powerful, flexible, high-performance table management capabilities. Regardless of your technology background, you can quickly get started and unlock the system's maximum value. Through the detailed explanations and rich examples in this guide, you should now have mastered all the skills needed to create efficient data tables in Lakehouse.
Start your Lakehouse journey now and experience the powerful capabilities of the next-generation data warehouse!
Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, always verify the correctness and performance impact of all operations in a test environment.