Singdata Lakehouse SELECT Statement Guide

Document Introduction

Welcome to Singdata Lakehouse! This guide will help you quickly master the use of SELECT statements in Lakehouse. Whether you are a data analyst, engineer, or data scientist, you will find the technical guidance you need here.

📖 What This Document Contains

  • Quick Start Guide: If you are accustomed to other database systems, we will help you adapt quickly
  • Complete Syntax Reference: Detailed explanations from basic queries to advanced features
  • Practical Examples: All code has been verified in real environments and can be used directly
  • Performance Optimization: Columnar storage, indexes, partitioning, and other optimization techniques
  • Best Practices: Avoid common mistakes and improve development efficiency

🎯 Target Audience

  • Data Analysts: Need to perform complex data queries and analysis
  • Data Engineers: Building ETL pipelines and data workflows
  • Data Scientists: Performing feature engineering and model training data preparation
  • System Administrators: Database management and performance optimization

💡 How to Use This Document

  1. New Users: Start with the "Migration User Guide" to understand differences from your familiar system
  2. Quick Reference: Jump directly to relevant chapters for specific syntax
  3. In-Depth Learning: Read sequentially to systematically master all Lakehouse query features
  4. Troubleshooting: Check the "Common Questions" section for solutions

Overview

Singdata Lakehouse is based on the ANSI SQL 2003 standard, highly compatible with modern SQL syntax, while providing rich extension features for big data and AI scenarios. Regardless of which database system you are used to, you can quickly get started and leverage the powerful capabilities of Lakehouse.

🚀 Core Features

  • Standard SQL Compatibility: Supports ANSI SQL 2003 core features, compatible with mainstream database syntax
  • Columnar Storage Optimization: High-performance storage engine designed for analytical queries
  • Modern Data Types: Native support for complex data types such as JSON and VECTOR
  • Intelligent Index System: Three index types: Bloom filter, full-text search, and vector index
  • AI/ML Integration: Built-in vector computation and full-text search capabilities

Migration User Guide

If you are accustomed to Spark, MySQL, PostgreSQL, or other database systems, this section will help you quickly adapt to Singdata Lakehouse and avoid common mistakes. We understand the challenge of switching systems and have prepared this detailed comparison guide.

🔄 If You Are Accustomed to Spark SQL

Skills You Can Directly Reuse

-- The following Spark SQL code can run directly in Lakehouse
SELECT customer_id, COUNT(*), SUM(amount)
FROM orders 
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
HAVING COUNT(*) > 5;

-- Window functions are fully compatible
SELECT 
    employee_id,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees;

⚠️ Syntax That Needs Adjustment

1. LATERAL VIEW Syntax

-- ❌ Spark syntax (not supported)
SELECT user_id, event 
FROM user_events LATERAL VIEW explode(event_array) AS event;

-- ✅ Lakehouse syntax
SELECT user_id, explode(event_array) as event 
FROM user_events;

2. Regular Expression Functions

-- ❌ Spark syntax (not supported)
WHERE regexp_like(column_name, 'pattern')

-- ✅ Lakehouse alternative
WHERE regexp_extract(column_name, 'pattern', 0) != ''
-- Or use LIKE
WHERE column_name LIKE '%pattern%'

3. Type Conversion Differences

-- ⚠️ Note: failed conversions return a special value, not NULL
SELECT 
    TRY_CAST('abc' AS INT) as result,  -- Returns nan, not NULL
    -- Safe conversion method
    CASE 
        WHEN TRY_CAST('abc' AS INT) IS NOT NULL 
             AND CAST(TRY_CAST('abc' AS INT) AS STRING) != 'nan'
        THEN TRY_CAST('abc' AS INT)
        ELSE 0
    END as safe_result;

4. NULL Value Display Characteristics

-- ⚠️ Numeric and temporal type NULLs have special display formats
SELECT 
    LAG(order_date) OVER (...) as prev_date,     -- Temporal NULL displayed as "NaT"
    LAG(customer_id) OVER (...) as prev_id,      -- Numeric NULL displayed as "nan"
    CASE 
        WHEN LAG(order_date) OVER (...) IS NULL  -- But IS NULL check is still valid
        THEN 'First Record'
        ELSE 'Has Previous'
    END as status
FROM orders;

Complex Data Type Advanced Applications

-- Spark SQL array operations (directly compatible)
WITH user_events AS (
    SELECT 
        user_id,
        array('login', 'view_product', 'purchase') as event_sequence
    FROM user_activity
)
SELECT 
    user_id,
    event_sequence[0] as first_event,                    -- Index starts from 0 (consistent with Spark)
    size(event_sequence) as event_count,
    array_contains(event_sequence, 'purchase') as converted,
    explode(event_sequence) as individual_event          -- Note: LATERAL VIEW not needed
FROM user_events;

-- Spark SQL struct operations (fully compatible)
WITH customer_profiles AS (
    SELECT 
        customer_id,
        named_struct(
            'name', customer_name,
            'contact', named_struct(
                'email', email,
                'phone', phone
            )
        ) as profile
    FROM customers
)
SELECT 
    customer_id,
    profile.name as customer_name,           -- Dot access syntax consistent with Spark
    profile.contact.email as email,
    profile.contact.phone as phone
FROM customer_profiles;

🗄️ If You Are Accustomed to MySQL/PostgreSQL

Skills You Can Directly Reuse

-- Standard SQL queries are fully compatible
SELECT c.customer_name, COUNT(o.order_id) as order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.registration_date >= '2024-01-01'
GROUP BY c.customer_id, c.customer_name;

🔌 Extra Option for MySQL Users: MySQL Protocol Support

[Preview Release] Singdata Lakehouse supports MySQL protocol connections!

# ✅ You can use familiar MySQL connection methods
# MySQL 8.x connection method
jdbc:mysql://cn-shanghai-alicloud-mysql.api.singdata.com:3306/public?useSSL=true

# MySQL 5.x connection method  
jdbc:mysql://cn-shanghai-alicloud-mysql.api.singdata.com:3306/public?useSSL=false

# Command-line connection
mysql -h cn-shanghai-alicloud-mysql.api.singdata.com -P 3306 -u username -p database_name

📊 Primary Use Cases

  • BI Reporting Tools: Reporting tools that cannot use custom JDBC drivers
  • Legacy System Integration: Quick integration for existing MySQL client applications
  • Development Tool Compatibility: Use familiar MySQL management tools

⚠️ Important Limitations

  • Syntax is still Lakehouse: Although the protocol is compatible, SQL syntax must use Lakehouse standards
  • Data Type Limitations: MySQL-specific types (mediumint, text, blob, enum, etc.) are not supported
  • Feature Limitations: MySQL-specific commands like mysqldump, LOAD are not supported
-- ❌ MySQL-specific syntax not supported
CREATE TABLE test (col MEDIUMINT);  -- Error
SELECT CAST(value AS TEXT);         -- Error

-- ✅ Use Lakehouse corresponding types
CREATE TABLE test (col INT);        -- Correct
SELECT CAST(value AS STRING);       -- Correct

💡 Selection Advice

  • Priority Recommendation: Use the Lakehouse native driver for full functionality
  • Compatibility Scenario: Choose MySQL protocol when native driver cannot be used

⚠️ Key Differences and Tool Limitations

1. Query Result Return Tool Limitations

Tool/ClientQuery Result LimitUse CaseWorkaround
Lakehouse Studio Web UIMax 10000 rowsData exploration, debuggingAdd WHERE filter conditions
JDBC ClientNo hard limit*Application development, ETLBatch processing, streaming reads
Python SDKNo hard limit*Data science, scriptingBatch processing, pandas chunking
BI ToolsVaries by toolReport displayCheck specific tool documentation
Command-line ToolsMemory/config limitsData exportAdjust config or export in batches

*Note: "No hard limit" means no limit at the SQL engine level, but still subject to memory, network, and other resource constraints.

2. Pagination Strategy Selection

-- 📱 Web UI data exploration strategy (subject to 10000-row limit)
SELECT customer_id, order_count, total_amount
FROM customer_summary
WHERE registration_date >= '2024-01-01'  -- Filter first
  AND customer_type = 'premium'
ORDER BY total_amount DESC
LIMIT 100;  -- Small batch viewing

-- 💻 Programming interface big data processing strategy (no hard limit)
WITH batch_data AS (
    SELECT customer_id, order_id, amount
    FROM orders 
    WHERE order_date = '2024-06-01'
      AND customer_id BETWEEN :start_id AND :end_id  -- Batch range
)
SELECT * FROM batch_data;

-- 🔄 Recommended: cursor-based pagination (applicable to all tools)
SELECT customer_id, order_id, order_date, amount
FROM orders 
WHERE customer_id > :last_customer_id  -- Cursor pagination
  AND order_date >= '2024-01-01'
ORDER BY customer_id, order_id
LIMIT 1000;  -- Moderate batch size

3. String Function Compatibility

-- ✅ Most MySQL-style functions are compatible
SELECT 
    SUBSTRING(name, 1, 5) as substr_mysql,    -- Supported
    SUBSTR(name, 1, 5) as substr_standard,    -- Supported
    CONCAT_WS(',', field1, field2) as concat_ws,  -- Supported
    CONCAT(field1, ',', field2) as concat_std     -- Supported
FROM users;

4. Index Concept Differences

-- ❌ Traditional database thinking
CREATE INDEX idx_name ON table(column);  -- Expect immediate effect

-- ✅ Lakehouse correct approach
CREATE BLOOMFILTER INDEX idx_name ON TABLE table(column);  -- Only effective for new data
-- To make effective for existing data (only supported by some index types)
BUILD INDEX idx_name ON table;

5. Transactions and Locking

-- ⚠️ Lakehouse does not support traditional transaction syntax
-- ❌ Not supported
BEGIN TRANSACTION;
UPDATE products SET price = price * 1.1;
COMMIT;

-- ✅ Use batch updates or MERGE statements
MERGE INTO products p
USING (SELECT product_id, price * 1.1 as new_price FROM products) src
ON p.product_id = src.product_id
WHEN MATCHED THEN UPDATE SET price = src.new_price;

📊 Columnar Storage Optimization Thinking

Traditional Row-Based vs Columnar Storage

-- ❌ Row-based database thinking: avoid SELECT *
SELECT * FROM large_table WHERE id = 123;

-- ✅ Columnar storage optimization: explicitly specify columns
SELECT customer_id, order_amount, order_date 
FROM large_table 
WHERE id = 123;

-- 💡 Advantages of columnar storage
SELECT 
    product_category,
    AVG(price) as avg_price,  -- Only reads the price column, excellent performance
    COUNT(*) as product_count
FROM products 
GROUP BY product_category;

Understanding the Partition Concept

-- 💡 Partitioned table query optimization
-- ✅ Correct: directly use partition columns
SELECT * FROM orders 
WHERE order_date >= '2024-06-01' 
  AND order_date <= '2024-06-30';

-- ❌ Incorrect: using functions on partition columns
SELECT * FROM orders 
WHERE YEAR(order_date) = 2024;  -- Cannot leverage partition pruning

🚨 Common Mistakes and Solutions

1. Data Type Selection Errors

-- ❌ Wrong choice: storing JSON data as string
CREATE TABLE user_profiles (
    user_id INT,
    profile_data STRING  -- Requires type conversion to use JSON functions
);

-- ✅ Correct choice: use native JSON type
CREATE TABLE user_profiles (
    user_id INT,
    profile_data JSON    -- Directly supports JSON functions, better performance
);

-- Query syntax comparison
-- STRING type: requires conversion
SELECT user_id, json_extract_string(CAST(profile_data AS JSON), '$.age') as age
FROM user_profiles_string;

-- JSON type: direct use
SELECT user_id, json_extract_string(profile_data, '$.age') as age
FROM user_profiles_json;

2. JOIN Optimization Pitfalls

-- ❌ Traditional database thinking: large table on the left
SELECT l.*, s.name
FROM large_table l
JOIN small_table s ON l.id = s.id;

-- ✅ Lakehouse optimization: use MAPJOIN hint
SELECT /*+ MAPJOIN(small_table) */ l.*, s.name
FROM large_table l
JOIN small_table s ON l.id = s.id;

3. Aggregation Query Optimization

-- ❌ Inefficient: multiple scans of large table
SELECT 
    (SELECT COUNT(*) FROM large_table WHERE status = 'active') as active_count,
    (SELECT COUNT(*) FROM large_table WHERE status = 'inactive') as inactive_count;

-- ✅ Efficient: single scan
SELECT 
    SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) as active_count,
    SUM(CASE WHEN status = 'inactive' THEN 1 ELSE 0 END) as inactive_count
FROM large_table;

💡 Quick Adaptation Suggestions

1. Features to Prioritize

  • JSON Data Type: Replace string storage for semi-structured data
  • VECTOR Data Type: Native vector search support
  • Partitioned Tables: Partition by time or business dimensions
  • Columnar Indexes: BLOOMFILTER, INVERTED, VECTOR
  • MySQL Protocol Connection: Compatibility option when native driver cannot be used

2. Performance Optimization Habits

-- Good habit: partition + column selection + index
SELECT customer_id, order_amount
FROM orders 
WHERE order_date >= '2024-06-01'     -- Partition pruning
  AND customer_type = 'premium'      -- Bloom filter index
  AND match_any(description, 'phone', map('analyzer', 'chinese'))  -- Full-text index
ORDER BY order_amount DESC
LIMIT 100;

3. Anti-Patterns to Avoid

-- ❌ Patterns to avoid
-- 1. Using functions on partition columns
WHERE YEAR(partition_date) = 2024

-- 2. Overusing SELECT *
SELECT * FROM large_table

-- 3. Ignoring data type optimization
profile_json STRING  -- Should use JSON type

-- 4. Deep pagination ignoring tool limits (Web UI)
-- Not adding filter conditions when querying large data volumes in Studio

-- 5. Full-table sorted pagination
SELECT * FROM huge_table ORDER BY random_column LIMIT 100 OFFSET 5000  -- Extremely poor performance

📚 Learning Path Suggestions

Regardless of which database system you previously used, we recommend the following learning path:

  1. Week 1: Familiarize yourself with basic query syntax and understand differences from your familiar system
  2. Week 2: Master the data type system, especially modern types like JSON and VECTOR
  3. Week 3: Learn the index system and performance optimization techniques
  4. Week 4: Dive into partitioned tables and advanced features

This learning pace will help you fully master Lakehouse core features within one month.

🔗 Quick Reference

Feature CategoryTraditional DatabaseLakehouse Support StatusNotes
Connection ProtocolMySQL Protocol✅ Supported (Preview)Suitable for BI reports, etc.
Pagination QueryLIMIT n OFFSET m✅ Fully supportedTool display limits vary
Regex Functionsregexp_like()❌ Use regexp_extract() != ''Function name difference
String FunctionsSUBSTRING()✅ Fully supportedMySQL syntax compatible
String ConcatenationCONCAT_WS()✅ Fully supportedMySQL syntax compatible
Array ExpansionLATERAL VIEW explode()❌ Use explode()Simplified syntax
Transaction SyntaxBEGIN/COMMIT❌ Use batch operationsColumnar storage characteristics
Query OptimizationSELECT *⚠️ Explicit column names recommendedColumnar storage optimization

Basic Syntax

SELECT Statement Structure

Lakehouse SELECT statements follow standard SQL syntax and support full query functionality:

SELECT [ALL | DISTINCT] select_expr [, select_expr ...]
[FROM table_reference [, table_reference ...]]
[WHERE where_condition]
[GROUP BY grouping_element [, grouping_element ...]]
[HAVING having_condition]
[ORDER BY sort_item [, sort_item ...]]
[LIMIT row_count]

Basic Query Examples

-- Basic query
SELECT 
    customer_id,
    customer_name,
    registration_date,
    total_orders
FROM customers
WHERE registration_date >= '2024-01-01'
ORDER BY total_orders DESC
LIMIT 100;

-- Aggregation query
SELECT 
    product_category,
    COUNT(*) as product_count,
    AVG(price) as avg_price,
    SUM(sales_amount) as total_sales
FROM products
GROUP BY product_category
HAVING COUNT(*) > 10;

Enhanced Syntax Features

EXCEPT Clause

Lakehouse provides the EXCEPT clause to exclude specified columns:

-- Exclude sensitive fields
SELECT * EXCEPT(password, credit_card, ssn)
FROM user_accounts;

-- Exclude multiple fields
SELECT * EXCEPT(internal_id, created_by, updated_by)
FROM public_data;

Data Type System

Core Data Types

Type CategoryData TypeDescriptionExample
NumericTINYINT, SMALLINT, INT, BIGINTInteger typesSELECT 123::INT
FLOAT, DOUBLEFloating-point typesSELECT 3.14::DOUBLE
DECIMAL(p,s)Exact numeric typeSELECT 999.99::DECIMAL(10,2)
StringSTRINGVariable-length string, max 16MBSELECT 'Hello'::STRING
VARCHAR(n), CHAR(n)Fixed/variable-length stringSELECT 'Text'::VARCHAR(50)
Date/TimeDATEDate typeSELECT '2024-06-01'::DATE
TIMESTAMP_LTZTimestamp with time zoneSELECT CURRENT_TIMESTAMP()
TIMESTAMP_NTZTimestamp without time zoneSELECT '2024-06-01 12:00:00'::TIMESTAMP_NTZ
BooleanBOOLEANBoolean valueSELECT true::BOOLEAN
BinaryBINARYBinary dataSELECT X'48656C6C6F'::BINARY

Complex Data Types

ARRAY Type

-- Create and manipulate arrays
SELECT 
    array(1, 2, 3, 4, 5) as numbers,
    array('apple', 'banana', 'orange') as fruits,
    array_contains(array(1, 2, 3), 2) as contains_check,
    size(array(1, 2, 3, 4)) as array_length;

-- Array access and processing
WITH data AS (
    SELECT array('a', 'b', 'c', 'd') as letters
)
SELECT 
    letters[0] as first_letter,    -- Index starts from 0
    letters[3] as last_letter,
    slice(letters, 1, 2) as middle_slice,  -- Start from index 1, take 2 elements
    array_sort(letters) as sorted_letters
FROM data;

⚠️ Important Note: slice(array, start_index, length) function parameter meanings:

  • start_index: Starting index position (0-based)
  • length: Number of elements to extract
  • Example: slice(array('a','b','c','d'), 1, 2) returns ['b', 'c']

STRUCT Type

-- Create and access structs
SELECT 
    named_struct('name', 'Alice', 'age', 30, 'city', 'Seattle') as person,
    named_struct(
        'street', '123 Main St',
        'city', 'Seattle',
        'zipcode', '98101'
    ) as address;

-- Struct field access
WITH customer_data AS (
    SELECT 
        1 as id,
        named_struct(
            'name', 'Alice Johnson',
            'contact', named_struct(
                'email', 'alice@example.com',
                'phone', '555-1234'
            )
        ) as profile
)
SELECT 
    id,
    profile.name as customer_name,
    profile.contact.email as email,
    profile.contact.phone as phone
FROM customer_data;

JSON Type

JSON is natively supported as a first-class citizen with excellent query performance:

-- JSON literals and functions
SELECT 
    JSON '{"name": "Alice", "age": 30, "skills": ["Python", "SQL"]}' as profile,
    json_extract_string(JSON '{"name": "Alice"}', '$.name') as name,
    json_extract_int(JSON '{"age": 30}', '$.age') as age,
    json_extract(JSON '{"skills": ["Python", "SQL"]}', '$.skills') as skills;

-- JSON query and filtering
WITH user_profiles AS (
    SELECT 
        1 as user_id,
        JSON '{"name": "Alice", "department": "Engineering", "skills": ["Python", "Java"]}' as profile
    UNION ALL
    SELECT 2, JSON '{"name": "Bob", "department": "Sales", "skills": ["Excel", "CRM"]}'
)
SELECT 
    user_id,
    json_extract_string(profile, '$.name') as name,
    json_extract_string(profile, '$.department') as department,
    json_extract(profile, '$.skills[0]') as primary_skill
FROM user_profiles
WHERE json_extract_string(profile, '$.department') = 'Engineering';

VECTOR Type

Vector data type designed for AI/ML scenarios:

-- Vector creation and computation
SELECT 
    VECTOR(0.1, 0.2, 0.3, 0.4) as embedding_vector,
    l2_distance(VECTOR(0.1, 0.2, 0.3), VECTOR(0.4, 0.5, 0.6)) as l2_dist,
    cosine_distance(VECTOR(0.1, 0.2, 0.3), VECTOR(0.4, 0.5, 0.6)) as cosine_sim,
    dot_product(VECTOR(1.0, 2.0, 3.0), VECTOR(2.0, 3.0, 4.0)) as dot_prod;

Type Conversion

Basic Conversion Syntax

Lakehouse supports multiple type conversion syntaxes:

-- Three conversion methods
SELECT 
    CAST(123 AS STRING) as cast_syntax,
    123::STRING as double_colon_syntax,
    STRING(123) as function_syntax;

Safe Type Conversion

Handling mechanism when conversion fails:

-- TRY_CAST safe conversion
SELECT 
    '123' as original,
    TRY_CAST('123' AS INT) as valid_conversion,      -- Returns 123
    TRY_CAST('abc' AS INT) as invalid_conversion,    -- Returns nan
    
    -- Conversion validation
    CASE 
        WHEN TRY_CAST('123' AS INT) IS NOT NULL 
             AND CAST(TRY_CAST('123' AS INT) AS STRING) != 'nan'
        THEN 'VALID'
        ELSE 'INVALID'
    END as conversion_status;

🚨 NULL Value Display Characteristics

Important Note: The display of NULL values in Lakehouse has special characteristics:

-- ⚠️ Numeric and temporal type NULLs have special display formats
SELECT 
    customer_id,
    LAG(customer_id) OVER (ORDER BY registration_date) as prev_id,        -- Numeric NULL displayed as "nan"
    LAG(registration_date) OVER (ORDER BY registration_date) as prev_date, -- Temporal NULL displayed as "NaT"
    
    -- But IS NULL/IS NOT NULL logic checks work perfectly fine
    CASE 
        WHEN LAG(customer_id) OVER (ORDER BY registration_date) IS NULL 
        THEN 'First Customer'
        ELSE 'Has Previous'
    END as position_status
FROM customers;

Key Points:

  • Numeric types (INT, DOUBLE, etc.) NULL displayed as "nan"
  • Temporal types (DATE, TIMESTAMP, etc.) NULL displayed as "NaT"
  • String types (STRING, VARCHAR, etc.) NULL displayed normally
  • All types IS NULL and IS NOT NULL checks work correctly

Function Library

String Functions

SELECT 
    -- Basic string operations
    upper('hello world') as uppercase,
    lower('HELLO WORLD') as lowercase,
    length('hello') as str_length,
    substr('hello world', 1, 5) as substring,
    trim('  hello  ') as trimmed,
    
    -- String concatenation and replacement
    concat('hello', ' ', 'world') as concatenated,
    replace('hello world', 'world', 'lakehouse') as replaced,
    
    -- String splitting and extraction
    split('a,b,c,d', ',') as split_array,
    regexp_extract('abc123def', '[0-9]+', 0) as extracted_number,
    regexp_replace('hello123world', '[0-9]+', 'XXX') as pattern_replaced;

Mathematical Functions

SELECT 
    -- Basic math operations
    abs(-10) as absolute_value,
    round(3.14159, 2) as rounded,
    ceil(3.1) as ceiling,
    floor(3.9) as floor,
    
    -- Advanced math functions
    sqrt(16) as square_root,
    pow(2, 3) as power,
    mod(10, 3) as modulo,
    greatest(1, 5, 3, 8, 2) as max_value,
    least(1, 5, 3, 8, 2) as min_value;

Date/Time Functions

SELECT 
    -- Current time
    current_date() as today,
    current_timestamp() as now,
    localtimestamp() as local_now,
    
    -- Date arithmetic
    date_add('2024-06-01', 30) as add_days,
    date_sub('2024-06-01', 7) as subtract_days,
    datediff('2024-06-10', '2024-06-01') as date_difference,
    
    -- Date extraction
    year('2024-06-01') as extract_year,
    month('2024-06-01') as extract_month,
    day('2024-06-01') as extract_day,
    
    -- Formatting
    date_format('2024-06-01', 'yyyy-MM-dd') as formatted_date;

Aggregate Functions

-- Basic aggregation
SELECT 
    product_category,
    COUNT(*) as total_count,
    COUNT(DISTINCT customer_id) as unique_customers,
    SUM(sales_amount) as total_sales,
    AVG(sales_amount) as average_sales,
    MIN(sales_amount) as min_sales,
    MAX(sales_amount) as max_sales,
    STDDEV(sales_amount) as sales_stddev
FROM sales_data
GROUP BY product_category;

-- Advanced aggregation
SELECT 
    product_category,
    collect_list(product_name) as product_names,
    collect_set(brand) as unique_brands,
    percentile(sales_amount, 0.5) as median_sales,
    percentile(sales_amount, array(0.25, 0.5, 0.75)) as quartiles
FROM sales_data
GROUP BY product_category;

Window Functions

-- Ranking functions
SELECT 
    employee_name,
    department,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) as row_num,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dense_rank,
    PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary) as percentile_rank,
    CUME_DIST() OVER (PARTITION BY department ORDER BY salary) as cumulative_dist
FROM employees;

-- Offset functions
SELECT 
    employee_name,
    salary,
    hire_date,
    LAG(salary, 1) OVER (ORDER BY hire_date) as prev_salary,
    LEAD(salary, 1) OVER (ORDER BY hire_date) as next_salary,
    FIRST_VALUE(salary) OVER (ORDER BY hire_date) as first_salary,
    LAST_VALUE(salary) OVER (
        ORDER BY hire_date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as last_salary
FROM employees;

-- Aggregate window functions
SELECT 
    employee_name,
    department,
    salary,
    SUM(salary) OVER (PARTITION BY department) as dept_total,
    AVG(salary) OVER (PARTITION BY department) as dept_average,
    COUNT(*) OVER (PARTITION BY department) as dept_count,
    
    -- Moving window
    AVG(salary) OVER (
        ORDER BY hire_date 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) as moving_avg_3months
FROM employees;

Advanced Queries

JOIN Operations

Basic JOIN Types

-- INNER JOIN
SELECT c.customer_name, o.order_id, o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

-- LEFT JOIN
SELECT c.customer_name, o.order_id, o.total_amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;

-- RIGHT JOIN
SELECT c.customer_name, o.order_id, o.total_amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;

-- FULL OUTER JOIN
SELECT c.customer_name, o.order_id, o.total_amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;

JOIN Optimization Hints

-- Broadcast JOIN hint
SELECT /*+ MAPJOIN(small_table) */
    l.large_table_id,
    s.small_table_value
FROM large_table l
JOIN small_table s ON l.join_key = s.join_key;

-- Sort merge JOIN hint
SELECT /*+ SORTMERGEJOIN(table1, table2) */
    t1.column1,
    t2.column2
FROM table1 t1
JOIN table2 t2 ON t1.key = t2.key;

Subqueries and CTEs

Common Table Expressions (CTE)

-- Basic CTE
WITH monthly_sales AS (
    SELECT 
        date_format(order_date, 'yyyy-MM') as month,
        SUM(total_amount) as monthly_total
    FROM orders
    GROUP BY date_format(order_date, 'yyyy-MM')
),
avg_monthly_sales AS (
    SELECT AVG(monthly_total) as avg_monthly
    FROM monthly_sales
)
SELECT 
    ms.month,
    ms.monthly_total,
    ams.avg_monthly,
    ms.monthly_total - ams.avg_monthly as variance
FROM monthly_sales ms
CROSS JOIN avg_monthly_sales ams
ORDER BY ms.month;

Hierarchical Queries

For hierarchical data processing, you can use multi-level JOIN or self-join approaches:

-- Organizational structure query example (using self-join approach)
WITH employee_levels AS (
    -- Level 1: top-level managers
    SELECT employee_id, employee_name, manager_id, 1 as level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Level 2: direct reports
    SELECT e.employee_id, e.employee_name, e.manager_id, 2 as level
    FROM employees e
    INNER JOIN employees m ON e.manager_id = m.employee_id
    WHERE m.manager_id IS NULL
    
    UNION ALL
    
    -- Level 3: deeper reports
    SELECT e.employee_id, e.employee_name, e.manager_id, 3 as level
    FROM employees e
    INNER JOIN employees m1 ON e.manager_id = m1.employee_id
    INNER JOIN employees m2 ON m1.manager_id = m2.employee_id
    WHERE m2.manager_id IS NULL
)
SELECT * FROM employee_levels
ORDER BY level, employee_name;

Correlated Subqueries

-- Correlated subquery example
SELECT 
    product_id,
    product_name,
    price,
    (SELECT AVG(price) 
     FROM products p2 
     WHERE p2.category = p1.category) as category_avg_price
FROM products p1
WHERE price > (
    SELECT AVG(price) 
    FROM products p2 
    WHERE p2.category = p1.category
);

Advanced Features

Built-in full-text search functionality supporting multiple match modes:

-- Full-text search functions
WITH documents AS (
    SELECT 1 as doc_id, 'Artificial intelligence and machine learning technologies are developing rapidly' as content
    UNION ALL SELECT 2, 'Deep learning is an important branch of machine learning'
    UNION ALL SELECT 3, 'Cloud computing provides powerful computing capabilities for artificial intelligence'
)
SELECT 
    doc_id,
    content,
    -- Match all keywords
    match_all(content, 'machine learning artificial intelligence', map('analyzer', 'chinese')) as matches_all,
    -- Match any keyword
    match_any(content, 'machine learning deep learning', map('analyzer', 'chinese')) as matches_any,
    -- Phrase matching
    match_phrase(content, 'machine learning technology', map('analyzer', 'chinese')) as phrase_match
FROM documents
WHERE match_all(content, 'machine learning', map('analyzer', 'chinese'));

Supports efficient vector similarity search:

-- Vector similarity search
WITH document_embeddings AS (
    SELECT 1 as doc_id, 'AI Technical Documentation' as title, VECTOR(0.1, 0.3, 0.7, 0.2) as embedding
    UNION ALL SELECT 2, 'Machine Learning Guide', VECTOR(0.2, 0.4, 0.6, 0.3)
    UNION ALL SELECT 3, 'Database Tutorial', VECTOR(0.8, 0.1, 0.2, 0.4)
),
search_query AS (
    SELECT VECTOR(0.1, 0.3, 0.7, 0.2) as query_vector
)
SELECT 
    d.doc_id,
    d.title,
    cosine_distance(d.embedding, s.query_vector) as similarity_score
FROM document_embeddings d
CROSS JOIN search_query s
WHERE cosine_distance(d.embedding, s.query_vector) < 0.5
ORDER BY similarity_score
LIMIT 10;

Historical Data Queries

For scenarios requiring historical data queries, it is recommended to use time fields for filtering:

-- Query data within a specific time range
SELECT * FROM orders 
WHERE created_time >= '2024-06-01 00:00:00'
  AND created_time <= '2024-06-01 23:59:59';

-- Query data relative to current time
SELECT * FROM orders 
WHERE created_time >= CURRENT_TIMESTAMP() - INTERVAL '1' HOUR;

Index Optimization

Singdata Lakehouse supports three types of indexes to optimize query performance, each targeting different query scenarios:

Index Type Overview

Index TypeUse CaseSupported Data TypesQuery Functions
BLOOMFILTEREquality filterBasic types except interval, struct, map, arrayStandard WHERE conditions
INVERTEDFull-text searchSTRING typematch_all, match_any, match_phrase
VECTORVector similarity searchVECTOR typecosine_distance, l2_distance, etc.

1. Bloom Filter Index (BLOOMFILTER INDEX)

Bloom filter indexes are used to quickly filter equality queries, especially suitable for high-cardinality columns.

Creation Syntax

-- Basic syntax
CREATE BLOOMFILTER INDEX [IF NOT EXISTS] index_name 
ON TABLE [schema].table_name(column_name) 
[COMMENT 'comment'] 
[PROPERTIES ('key'='value')];

-- Example
CREATE BLOOMFILTER INDEX idx_customer_id 
ON TABLE orders(customer_id) 
COMMENT 'Customer ID bloom filter index';

Usage Example

-- Create test table
CREATE TABLE product_sales (
    product_id INT,
    category STRING,
    sales_amount DOUBLE,
    sale_date DATE
);

-- Create bloom filter index
CREATE BLOOMFILTER INDEX idx_category_bloom 
ON TABLE product_sales(category) 
COMMENT 'Product category bloom filter index';

-- Use index for equality query (index applies automatically)
SELECT product_id, sales_amount
FROM product_sales
WHERE category = 'Electronics';  -- Bloom filter accelerates this query

Limitations

  • Only effective for new data: Only effective for newly written data after creation; existing data needs to be rewritten
  • BUILD INDEX not supported: Bloom filter indexes do not support building indexes on existing data
  • Data type limitations: Does not support complex types such as interval, struct, map, array

2. Inverted Index (INVERTED INDEX)

Inverted indexes are used for full-text search functionality, supporting multiple text analyzers.

Creation Syntax

-- Create during table creation
CREATE TABLE documents (
    doc_id INT,
    title STRING,
    content STRING,
    INDEX content_idx (content) INVERTED COMMENT 'Content inverted index'
);

-- Add index to existing table
CREATE INVERTED INDEX [IF NOT EXISTS] index_name 
ON TABLE [schema].table_name(column_name) 
PROPERTIES('analyzer'='analyzer_type');

Analyzer Types

AnalyzerUse CaseDescription
chineseChinese textSupports Chinese word segmentation
englishEnglish textEnglish word segmentation and stemming
keywordKeyword matchingNo segmentation, exact match
unicodeGeneral textUnicode character processing

Usage Example

-- Create inverted index
CREATE INVERTED INDEX IF NOT EXISTS idx_content_search 
ON TABLE documents(content) 
PROPERTIES('analyzer'='chinese');

-- Build index (effective on existing data)
BUILD INDEX idx_content_search ON documents;

-- Full-text search queries
-- Match all keywords
SELECT doc_id, title, content
FROM documents
WHERE match_all(content, 'artificial intelligence machine learning', map('analyzer', 'chinese'));

-- Match any keyword
SELECT doc_id, title, content
FROM documents
WHERE match_any(content, 'artificial intelligence deep learning', map('analyzer', 'chinese'));

-- Phrase matching
SELECT doc_id, title, content
FROM documents
WHERE match_phrase(content, 'machine learning technology', map('analyzer', 'chinese'));

Support for Partition Build

-- Build index by partition
BUILD INDEX idx_content_search ON documents 
WHERE partition_date >= '2024-06-01' AND partition_date <= '2024-06-30';

3. Vector Index (VECTOR INDEX)

Vector indexes are designed for AI/ML scenarios, supporting efficient vector similarity search.

Creation Syntax

CREATE VECTOR INDEX [IF NOT EXISTS] index_name 
ON TABLE [schema].table_name(column_name) 
PROPERTIES(
    "scalar.type" = "scalar_type",
    "distance.function" = "distance_function",
    "ef_construction" = "ef_value",
    "M" = "M_value"
);

Parameter Descriptions

ParameterOptionsDefaultDescription
scalar.typef32, f16, i8, b1f32Vector scalar type
distance.functioncosine_distance, l2_distance, jaccard_distance, hamming_distancecosine_distanceDistance computation function
ef_constructionInteger200Build parameter, affects recall rate and build time
MInteger16Graph connectivity, affects query performance

Usage Example

-- Create table with vector column
CREATE TABLE document_embeddings (
    doc_id INT,
    title STRING,
    embedding VECTOR(512),  -- 512-dimensional vector
    created_at TIMESTAMP_LTZ
);

-- Create vector index
CREATE VECTOR INDEX IF NOT EXISTS idx_doc_embedding 
ON TABLE document_embeddings(embedding) 
PROPERTIES(
    "scalar.type" = "f32",
    "distance.function" = "cosine_distance",
    "ef_construction" = "200",
    "M" = "16"
) COMMENT 'Document vector embedding index';

-- Build index
BUILD INDEX idx_doc_embedding ON document_embeddings;

-- Vector similarity search
WITH query_vector AS (
    SELECT VECTOR(0.1, 0.2, 0.3, ...) as search_embedding  -- Query vector
)
SELECT 
    d.doc_id,
    d.title,
    cosine_distance(d.embedding, q.search_embedding) as similarity_score
FROM document_embeddings d
CROSS JOIN query_vector q
WHERE cosine_distance(d.embedding, q.search_embedding) < 0.5  -- Similarity threshold
ORDER BY similarity_score
LIMIT 10;

-- Using other distance functions
SELECT 
    doc_id,
    title,
    l2_distance(embedding, VECTOR(0.1, 0.2, 0.3)) as l2_dist,
    dot_product(embedding, VECTOR(0.1, 0.2, 0.3)) as dot_prod
FROM document_embeddings
ORDER BY l2_dist
LIMIT 5;

Index Management

IF NOT EXISTS Support for Index Creation

✅ Syntax Support: All three index types support the IF NOT EXISTS syntax:

-- Bloom filter index
CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_name ON TABLE table_name(column);

-- Inverted index
CREATE INVERTED INDEX IF NOT EXISTS idx_name ON TABLE table_name(column);

-- Vector index
CREATE VECTOR INDEX IF NOT EXISTS idx_name ON TABLE table_name(column);

⚠️ Current Implementation Note: Although the syntax is supported, creating duplicate indexes of the same type on the same column may still result in an error. This indicates that syntax parsing is correct, but logic checks are still being refined.

Building Indexes (BUILD INDEX)

Build indexes on existing data, only supported for INVERTED and VECTOR indexes:

-- Full table build
BUILD INDEX index_name ON [schema].table_name;

-- Partition build
BUILD INDEX index_name ON table_name 
WHERE partition_col = 'value' AND other_col > 100;

-- Multi-partition build
BUILD INDEX index_name ON table_name 
WHERE partition_date >= '2024-06-01' 
  AND partition_date <= '2024-06-30';

Viewing Index Information

-- View all indexes on a table
DESCRIBE TABLE table_name;

-- View specific index details
DESCRIBE INDEX index_name;

-- Extended index information
DESCRIBE INDEX EXTENDED index_name;

Dropping Indexes

DROP INDEX [IF EXISTS] index_name;

Query Optimization Suggestions

Composite Search: Combining Multiple Indexes

-- Combine bloom filter and full-text search
SELECT doc_id, title, content
FROM documents
WHERE category = 'technology'  -- Bloom filter index acceleration
  AND match_any(content, 'artificial intelligence', map('analyzer', 'chinese'));  -- Inverted index search

-- Vector search combined with filtering
SELECT doc_id, title, embedding,
       cosine_distance(embedding, VECTOR(0.1, 0.2, 0.3)) as score
FROM document_embeddings
WHERE doc_type = 'article'  -- Filter first, then vector search
  AND cosine_distance(embedding, VECTOR(0.1, 0.2, 0.3)) < 0.3
ORDER BY score
LIMIT 20;

Performance Optimization

Query Optimization Principles

Partition Pruning Optimization

-- Correct partition query approach
SELECT *
FROM partitioned_table
WHERE date_partition >= '2024-06-01'
  AND date_partition <= '2024-06-30'
  AND status = 'active';

-- Anti-pattern to avoid
-- WHERE year(date_partition) = 2024  -- Cannot leverage partition pruning

Predicate Pushdown Optimization

-- Filter data before JOIN
WITH filtered_orders AS (
    SELECT customer_id, order_id, total_amount
    FROM orders
    WHERE order_date >= '2024-06-01'
      AND total_amount > 100
),
active_customers AS (
    SELECT customer_id, customer_name
    FROM customers
    WHERE status = 'active'
)
SELECT c.customer_name, o.order_id, o.total_amount
FROM active_customers c
JOIN filtered_orders o ON c.customer_id = o.customer_id;

Column Pruning Optimization

-- Select only the needed columns
SELECT customer_id, customer_name, total_orders
FROM customers
WHERE registration_date >= '2024-01-01';

-- Use EXCEPT to exclude unnecessary columns
SELECT * EXCEPT(internal_notes, created_by, updated_by)
FROM customer_details;

Pagination Query Optimization Strategy

Choose the appropriate pagination strategy based on the tool you are using:

📱 Web UI Pagination Strategy

Applicable to: Lakehouse Studio interface exploration (subject to 10000-row display limit)

-- Strategy 1: conditional filtering + small batch viewing
SELECT customer_id, order_count, last_order_date
FROM customer_summary
WHERE customer_type = 'premium'     -- Filter first
  AND last_order_date >= '2024-01-01'
ORDER BY order_count DESC
LIMIT 100;  -- Within display limit

-- Strategy 2: time window pagination
SELECT order_id, customer_id, amount
FROM orders
WHERE order_date >= '2024-06-01 00:00:00'
  AND order_date < '2024-06-01 06:00:00'  -- 6-hour window
ORDER BY order_date
LIMIT 1000;

-- Strategy 3: categorical viewing
SELECT product_id, product_name, price
FROM products
WHERE category = 'electronics'  -- Single category
ORDER BY price DESC
LIMIT 500;

💻 Programming Interface Pagination Strategy

Applicable to: JDBC, Python SDK, and other clients (no hard row limit)

-- Strategy 1: cursor-based pagination (recommended)
-- First page
SELECT customer_id, order_id, order_date, amount
FROM orders 
WHERE customer_id >= 0  -- Starting point
ORDER BY customer_id, order_id
LIMIT 1000;

-- Next page (based on the last record of the previous page)
SELECT customer_id, order_id, order_date, amount
FROM orders 
WHERE customer_id > :last_customer_id  -- Cursor position
   OR (customer_id = :last_customer_id AND order_id > :last_order_id)
ORDER BY customer_id, order_id
LIMIT 1000;

-- Strategy 2: range pagination
-- First batch: ID 0-9999
SELECT * FROM large_table 
WHERE id BETWEEN 0 AND 9999
ORDER BY id;

-- Second batch: ID 10000-19999
SELECT * FROM large_table 
WHERE id BETWEEN 10000 AND 19999
ORDER BY id;

-- Strategy 3: time-partitioned pagination
-- Process June 2024 data, in daily batches
SELECT * FROM orders 
WHERE order_date >= '2024-06-01'
  AND order_date < '2024-06-02'
ORDER BY order_id;

🔄 Streaming Processing Strategy

Applicable to: Big data ETL, data export scenarios

-- Strategy 1: batch streaming processing
-- Python pseudocode example
"""
batch_size = 50000
offset = 0
while True:
    sql = f'''
    SELECT * FROM large_table 
    WHERE process_date = '2024-06-01'
    ORDER BY id 
    LIMIT {batch_size} OFFSET {offset}
    '''
    rows = execute_query(sql)
    if not rows:
        break
    process_batch(rows)
    offset += batch_size
"""

-- Strategy 2: conditional iterative processing  
-- Suitable for tables with ordered primary keys
"""
last_id = 0
batch_size = 50000
while True:
    sql = f'''
    SELECT * FROM large_table 
    WHERE id > {last_id}
    ORDER BY id 
    LIMIT {batch_size}
    '''
    rows = execute_query(sql)
    if not rows:
        break
    process_batch(rows)
    last_id = rows[-1]['id']
"""

Performance Tuning Suggestions

1. Index Selection Strategy

  • High-cardinality equality queries: Use BLOOMFILTER index
  • Text search requirements: Use INVERTED index, choose the appropriate analyzer
  • Vector similarity search: Use VECTOR index, adjust ef_construction and M parameters

2. Build Strategy

  • Large table partition build: Build indexes in batches to avoid processing too much data at once
  • Build during off-peak hours: BUILD INDEX is a synchronous operation that consumes compute resources
  • Monitor build progress: Check build status via Job Profile

3. Query Optimization

-- Recommended: partition pruning + index filtering
SELECT *
FROM partitioned_table
WHERE partition_date >= '2024-06-01'  -- Partition pruning
  AND category = 'electronics'        -- Bloom filter index
  AND match_any(description, 'phone', map('analyzer', 'chinese'));  -- Inverted index

-- Avoid: using functions on indexed columns
-- Incorrect example
WHERE UPPER(category) = 'ELECTRONICS'  -- Cannot leverage index

-- Correct example
WHERE category = 'electronics'  -- Can leverage index

Best Practices

Query Writing Suggestions

  1. Explicitly specify column names: Avoid using SELECT *, explicitly specify needed columns
  2. Use indexes appropriately: Create appropriate indexes for frequently queried filter conditions
  3. Optimize JOIN order: Place small tables on the right side of JOINs, use MAPJOIN hints
  4. Use partition pruning: Directly use partition columns in WHERE conditions
  5. Avoid wrapping partition columns in functions: Do not use functions on partition columns
  6. Understand tool limitations: Choose appropriate query strategies based on the client being used

Data Type Selection Suggestions

  1. Prioritize JSON for semi-structured data: Better performance than STRING storage
  2. Choose TIMESTAMP_LTZ or TIMESTAMP_NTZ based on time data scenario
  3. Use VECTOR type for vector data: Native performance optimization
  4. Consider full-text search requirements for large text data: Can be combined with corresponding search features
  5. Understand NULL value display characteristics: Numeric and temporal types have special NULL display formats

Index Best Practices

  1. Choose the right index type: Select appropriate indexes based on query patterns
  2. Avoid over-indexing: Indexes increase storage costs and write overhead
  3. Maintain indexes regularly: Monitor index usage and performance impact
  4. Test and verify: Verify query performance improvement after creating indexes
  5. Consider data characteristics: High-cardinality columns suit bloom filters; text columns suit inverted indexes

Error Handling Suggestions

  1. Use TRY_CAST for safe type conversion
  2. Check if conversion result is "nan"
  3. Provide appropriate default value handling for NULL results in window functions
  4. Identify and handle outliers in data quality checks

Toolchain Selection Suggestions

ScenarioRecommended ToolNotesOptimization Strategy
Data ExplorationWeb Studio10000-row display limitAdd filter conditions, focus on analysis logic
Application DevelopmentJDBC/Python SDKMemory and network limitsBatch processing, reasonable batch sizes
Batch ProcessingProgramming InterfaceAvoid loading large datasets at onceImplement batch processing logic
Report DisplayBI ToolsVaries by toolUnderstand specific tool limits, design appropriate data sources
Performance TestingMulti-tool ComparisonVerify in real environmentsTest actual performance of different tools

Common Questions

Q: Is recursive CTE (WITH RECURSIVE) supported?

A: Recursive CTE syntax is not supported in the current version. For hierarchical data processing, you can use multi-level JOIN or self-join approaches:

-- Alternative: use multi-level JOIN to process hierarchical data
WITH level_1 AS (
    SELECT employee_id, employee_name, 1 as level
    FROM employees WHERE manager_id IS NULL
),
level_2 AS (
    SELECT e.employee_id, e.employee_name, 2 as level
    FROM employees e
    JOIN level_1 l1 ON e.manager_id = l1.employee_id
)
SELECT * FROM level_1
UNION ALL
SELECT * FROM level_2;

Q: How to query historical version data?

A: It is currently recommended to use time fields for historical data queries:

-- Use creation time field to query historical data
SELECT * FROM orders 
WHERE created_time >= '2024-06-01 00:00:00'
  AND created_time <= '2024-06-01 23:59:59';

-- Query data relative to current time
SELECT * FROM orders 
WHERE created_time >= CURRENT_TIMESTAMP() - INTERVAL '1' HOUR;

Q: Why does type conversion failure return "nan" instead of NULL?

A: Lakehouse adopts a more lenient type conversion strategy, returning the special value "nan" when conversion fails. You can use the following approach for safe conversion:

-- Check if conversion succeeded
SELECT 
    CASE 
        WHEN TRY_CAST(column AS INT) IS NOT NULL 
             AND CAST(TRY_CAST(column AS INT) AS STRING) != 'nan'
        THEN TRY_CAST(column AS INT)
        ELSE 0  -- Or other default value
    END as safe_converted_value
FROM table_name;

Q: Why are NULL values in window functions displayed in special formats?

A: This is the special display format for numeric and temporal NULL values in Lakehouse:

-- Numeric NULL displayed as "nan", temporal NULL displayed as "NaT"
SELECT 
    LAG(customer_id) OVER (...) as prev_id,      -- Displayed as "nan"
    LAG(timestamp_column) OVER (...) as prev_time, -- Displayed as "NaT"
    CASE 
        WHEN LAG(timestamp_column) OVER (...) IS NULL   -- IS NULL check still valid
        THEN 'No Previous Value'
        ELSE 'Has Previous Value'
    END as status
FROM table_name;

Q: How to migrate Spark's regexp_like function?

A: Lakehouse does not support the regexp_like function. You can use the following alternatives:

-- Alternative 1: use LIKE
WHERE column_name LIKE '%pattern%'

-- Alternative 2: use regexp_extract to check for matches
WHERE regexp_extract(column_name, 'pattern', 0) != ''

-- Alternative 3: combine multiple conditions
WHERE column_name LIKE '%pattern1%' OR column_name LIKE '%pattern2%'

Q: LATERAL VIEW explode syntax is not supported, what should I do?

A: Use the explode function directly, without LATERAL VIEW:

-- Spark syntax
-- SELECT id, item FROM table LATERAL VIEW explode(array_column) AS item;

-- Lakehouse syntax
SELECT id, explode(array_column) as item FROM table;

Q: Which indexes support BUILD INDEX?

A: Only INVERTED indexes and VECTOR indexes support the BUILD INDEX command:

-- Support BUILD INDEX
BUILD INDEX inverted_idx ON table_name;   -- Inverted index
BUILD INDEX vector_idx ON table_name;     -- Vector index

-- Do not support BUILD INDEX
-- BUILD INDEX bloom_idx ON table_name;   -- Bloom filter index not supported

Bloom filter indexes are only effective for newly written data. To make them effective for existing data, the data needs to be rewritten.

Q: The Web UI shows only 10000 rows, but I have more data. How can I view everything?

A: This is a display limitation of the Web UI, not a database limitation. Recommendations:

-- Web UI strategy: add filter conditions
SELECT * FROM large_table 
WHERE category = 'electronics'  -- Filter first
  AND created_date >= '2024-06-01'
LIMIT 1000;

-- Or use the Web UI's full download feature to download and then view

-- Or use programming interface to process full data
-- Python example:
"""
import singdata
conn = singdata.connect(...)
cursor = conn.cursor()
cursor.execute("SELECT * FROM large_table")
for row in cursor.fetchall():  -- Can process more than 10000 rows
    process(row)
"""

Q: How can I confirm the query limitations of different tools?

A: It is recommended to test before actual use:

  1. Web UI: Known limit of 10000 rows
  2. JDBC Client: Test actual limits through simple queries
  3. Python SDK: Check official documentation or test to verify
  4. BI Tools: Check tool documentation or contact vendor
  5. Custom Applications: Perform stress testing in development environment

Q: Does OFFSET have limitations?

A: After actual verification, Lakehouse's OFFSET feature has no hard limit:

-- ✅ These queries can all execute normally
SELECT * FROM table ORDER BY id LIMIT 10 OFFSET 10000;   -- Normal
SELECT * FROM table ORDER BY id LIMIT 10 OFFSET 15000;   -- Normal
SELECT * FROM table ORDER BY id LIMIT 10 OFFSET 50000;   -- Normal

-- Limitations mainly come from:
-- 1. Display limits of the client tool being used
-- 2. Actual constraints of network transmission and memory
-- 3. Query performance considerations (deep pagination has poor performance)

Q: Does the IF NOT EXISTS syntax for indexes have any special behavior?

A: The syntax is fully supported, but the current implementation has some peculiarities:

-- ✅ Syntax support
CREATE BLOOMFILTER INDEX IF NOT EXISTS idx_name ON TABLE table_name(column);
CREATE INVERTED INDEX IF NOT EXISTS idx_name ON TABLE table_name(column);
CREATE VECTOR INDEX IF NOT EXISTS idx_name ON TABLE table_name(column);

-- ⚠️ Current behavior: creating duplicate indexes of the same type on the same column may still result in an error
-- This indicates that syntax parsing is correct, but logic checks are still being refined

It is recommended to use cautiously in production environments and check whether the index already exists beforehand.


Summary

Singdata Lakehouse, as a modern data lake solution, maintains SQL standard compatibility while providing powerful extension capabilities for big data analysis and AI applications. Through this document, you should be able to:

🎯 Core Skills Mastered

  • Standard SQL Queries: Proficient use of basic features such as SELECT, JOIN, and window functions
  • Modern Data Types: Full utilization of complex data types such as JSON, VECTOR, and ARRAY
  • Performance Optimization: Improve query efficiency through indexes, partitions, and query hints
  • Full-Text Search: Intelligent text retrieval using inverted indexes
  • Vector Search: Support for similarity search needs in AI/ML scenarios
  • Toolchain Awareness: Understanding limitations and best practices of different client tools
  • NULL Value Handling: Understanding the special display formats for numeric and temporal NULL values

🚀 Advanced Development Suggestions

  1. Deep Index Optimization: Choose appropriate index strategies based on business scenarios
  2. Partitioned Table Design: Master creation and query optimization techniques for partitioned tables
  3. Data Pipeline Construction: Combine advanced features such as dynamic tables and stream processing
  4. AI/ML Integration: Explore vector search and knowledge graph applications
  5. Multi-Tool Collaboration: Choose the most suitable query tool for different scenarios

🔑 Important Insights

  • Tool Limitations != System Limitations: Always distinguish between client limitations and database engine capabilities
  • Strategy Selection: Choose appropriate query and pagination strategies based on the tool being used
  • Performance Trade-offs: Find the best balance between functional requirements and performance
  • Data Type Understanding: Master NULL value display characteristics and handle data conversions correctly

💡 Continuous Learning

  • Follow Singdata official documentation updates to learn about new features
  • Participate in community discussions, share usage experiences and best practices
  • Regularly review query performance and continuously optimize data processing workflows
  • Verify the performance of different tools in real projects and accumulate experience

🤝 Getting Help

If you encounter issues during use, we recommend:

  1. First check the "Common Questions" section of this document
  2. Refer to official technical documentation for detailed explanations
  3. Verify tool limitations and capabilities in real environments
  4. Contact the technical support team for professional assistance

Singdata Lakehouse will continue to evolve, and we look forward to your success in the journey of data analysis and AI exploration!


Reference Resources


Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.