Lakehouse JOIN Usage Guide: Multi-Technology Migration Handbook

Overview

Singdata Lakehouse provides complete, high-performance JOIN functionality, supporting seamless migration from Spark, Hive, MaxCompute, Snowflake, and traditional databases. This guide is based on real production environment experience, providing professional migration guidance and best practices for users from different technology backgrounds.


JOIN Types and Syntax

Complete JOIN Type Support

Singdata Lakehouse supports the complete SQL JOIN standard, offering 7 JOIN types:

JOIN TypeDescriptionTypical Use Cases
INNER JOINReturns matching records from both tablesStandard business association queries
LEFT [OUTER] JOINPreserves all records from the left tableEnsuring master table data integrity
RIGHT [OUTER] JOINPreserves all records from the right tableComplete dimension table display
FULL [OUTER] JOINPreserves all records from both tablesComplete data audit analysis
SEMI JOINReturns records from the left table that have matchesData existence verification
ANTI JOINReturns records from the left table without matchesData differential analysis
CROSS JOINReturns the Cartesian product of both tablesData combination generation

Basic JOIN Syntax

-- Standard INNER JOIN
SELECT e.emp_name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

-- LEFT JOIN ensures left table integrity
SELECT e.emp_name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

-- SEMI JOIN for existence check
SELECT e.emp_name, e.salary
FROM employees e
SEMI JOIN departments d ON e.dept_id = d.dept_id;

-- ANTI JOIN for identifying orphan data
SELECT e.emp_name, e.dept_id
FROM employees e
ANTI JOIN departments d ON e.dept_id = d.dept_id;

Performance Optimization Strategies

MAPJOIN Broadcast Optimization

MAPJOIN is a core optimization feature of Singdata Lakehouse that significantly improves JOIN performance by broadcasting small tables to all compute nodes.

Optimization Principles:

  • Eliminates expensive Shuffle operations
  • Mitigates data skew issues
  • Improves query execution speed
-- Single-table broadcast optimization
SELECT /*+ MAPJOIN(departments) */ 
    e.emp_name, d.dept_name, d.budget
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id;

-- Multi-table broadcast optimization
SELECT /*+ MAPJOIN(employees, departments) */
    o.order_id, e.emp_name, d.dept_name
FROM orders o
JOIN employees e ON o.emp_id = e.emp_id
JOIN departments d ON e.dept_id = d.dept_id;

Usage Recommendations:

  • It is recommended to keep small table sizes within 1 GB
  • Suitable for dimension table and fact table joins
  • Prioritize broadcasting small tables over large-table-to-large-table JOINs

SORTMERGEJOIN Sort-Merge Optimization

Suitable for large-table-to-large-table JOIN scenarios, especially when data is already sorted by the JOIN key.

-- Large table JOIN optimization
SELECT /*+ SORTMERGEJOIN(table1, table2) */
    t1.customer_id, t2.order_amount
FROM large_customer_table t1
JOIN large_order_table t2 ON t1.customer_id = t2.customer_id;

Query Structure Optimization

Follow the "filter-join-aggregate" best practice pattern:

-- Recommended query structure
WITH filtered_facts AS (
    SELECT fact_id, dimension_id, amount, date_key
    FROM fact_table
    WHERE date_key >= '20240101'          -- Filter first
      AND amount > 1000
),
enriched_data AS (
    SELECT /*+ MAPJOIN(dim_table) */
        f.fact_id, f.amount, d.category
    FROM filtered_facts f
    JOIN dim_table d ON f.dimension_id = d.dimension_id  -- Then join
)
SELECT 
    category,
    COUNT(*) as record_count,
    SUM(amount) as total_amount                          -- Aggregate last
FROM enriched_data
GROUP BY category;

Spark User Migration Guide

Skills That Transfer Directly

Spark users can seamlessly use the following features:

-- DataFrame broadcast JOIN -> MAPJOIN hint
SELECT /*+ MAPJOIN(small_table) */
    l.order_id, s.product_name
FROM large_orders l
JOIN small_products s ON l.product_id = s.product_id;

-- Window functions are fully compatible
SELECT 
    emp_name,
    salary,
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rank
FROM employees;

-- Complex data type handling
SELECT 
    user_id,
    event_array[0] as first_event,           -- Array index starts from 0
    size(event_array) as event_count,
    explode(event_array) as individual_event -- Use directly, no LATERAL VIEW needed
FROM user_events;

Syntax Adjustment Points

1. Simplified Array Expansion Syntax

-- Spark syntax (not supported)
SELECT user_id, tag
FROM users LATERAL VIEW explode(tags) t AS tag;

-- Lakehouse syntax
SELECT user_id, explode(tags) as tag
FROM users;

2. Struct Operations Remain Consistent

-- Spark-compatible struct operations
WITH customer_profiles AS (
    SELECT 
        customer_id,
        named_struct(
            'name', customer_name,
            'contact', named_struct('email', email, 'phone', phone)
        ) as profile
    FROM customers
)
SELECT 
    customer_id,
    profile.name as customer_name,
    profile.contact.email as email
FROM customer_profiles;

Data Skew Handling Strategy

-- Salted JOIN to resolve data skew
WITH salted_large AS (
    SELECT *, 
           CONCAT(join_key, '_', ABS(HASH(join_key) % 10)) as salted_key
    FROM large_table
),
salted_small AS (
    SELECT *, 
           CONCAT(join_key, '_', sequence) as salted_key
    FROM small_table 
    CROSS JOIN (SELECT EXPLODE(SEQUENCE(0, 9)) as sequence)
)
SELECT l.data, s.info
FROM salted_large l
JOIN salted_small s ON l.salted_key = s.salted_key;

Hive User Migration Guide

Concept Continuity and Performance Upgrade

The partitioned table and batch processing concepts familiar to Hive users are preserved and enhanced in Lakehouse:

-- Partition pruning concept remains consistent
SELECT /*+ MAPJOIN(dim_table) */
    fact.order_id, dim.product_name
FROM fact_orders fact
JOIN dim_products dim ON fact.product_id = dim.product_id
WHERE fact.dt = '2024-06-01';           -- Partition column filtering

The Leap from MapReduce to Columnar Computing

-- Hive multi-stage Job -> Lakehouse unified query
WITH order_aggregation AS (
    SELECT 
        customer_id,
        SUM(amount) as total_amount,
        COUNT(*) as order_count
    FROM orders 
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
)
SELECT /*+ MAPJOIN(customers) */
    c.customer_name,
    a.total_amount,
    a.order_count
FROM order_aggregation a
JOIN customers c ON a.customer_id = c.customer_id
WHERE a.total_amount > 10000;

MAPJOIN Feature Enhancement

-- Hive manual control -> Lakehouse intelligent optimization
SELECT /*+ MAPJOIN(employees, departments) */  -- Supports multi-table broadcast
    o.order_id, e.emp_name, d.dept_name
FROM orders o
JOIN employees e ON o.emp_id = e.emp_id
JOIN departments d ON e.dept_id = d.dept_id;

MaxCompute User Migration Guide

Syntax Compatibility

MaxCompute users can directly use familiar syntax:

-- MAPJOIN syntax is directly compatible
SELECT /*+ MAPJOIN(small_table) */
    large.order_id, small.product_name
FROM large_orders large
JOIN small_products small ON large.product_id = small.product_id;

-- Partition query approach remains the same
SELECT * FROM orders 
WHERE order_date = '2024-06-01'
  AND status = 'completed';

Function Migration Mapping

-- Time function correspondence
SELECT 
    order_id,
    CURRENT_TIMESTAMP() as process_time,        -- MaxCompute: GETDATE()
    DATE_FORMAT(order_date, 'yyyy-MM') as month -- Syntax compatible
FROM orders;

SEMI JOIN Optimization

-- EXISTS query -> SEMI JOIN performance optimization
SELECT a.customer_id, a.customer_name
FROM customers a
SEMI JOIN orders b ON a.customer_id = b.customer_id;

Snowflake User Migration Guide

Cloud-Native Feature Mapping

-- Auto-optimization -> Explicit optimization control
SELECT /*+ MAPJOIN(customer_dim, product_dim) */
    cd.customer_name, pd.product_name, sf.sales_amount
FROM sales_fact sf
JOIN customer_dim cd ON sf.customer_id = cd.customer_id
JOIN product_dim pd ON sf.product_id = pd.product_id
WHERE sf.sale_date >= '2024-06-01';

Resource Management Comparison

  • Snowflake WAREHOUSE -> Lakehouse VCLUSTER
  • Auto-scaling -> Elastic compute resources
  • Pay-per-use -> Usage-based billing

Historical Data Queries

-- Time Travel queries
SELECT * FROM orders 
TIMESTAMP AS OF '2024-06-01 00:00:00';

-- Query within a time range
SELECT * FROM orders 
WHERE order_date >= '2024-06-01' 
  AND order_date <= '2024-06-01 23:59:59';

Traditional Database User Migration Guide

Mindset Transformation

Architecture Upgrade from OLTP to OLAP

-- OLTP mindset: single-record queries
-- Transition to:
-- OLAP mindset: batch analytical queries

SELECT /*+ MAPJOIN(customers) */
    c.customer_segment,
    COUNT(*) as order_count,
    AVG(o.order_amount) as avg_amount,
    SUM(o.order_amount) as total_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_segment;

Index Strategy Adjustment

-- Traditional indexes -> Columnar storage + Bloom filters
CREATE BLOOMFILTER INDEX idx_customer_bloom 
ON TABLE orders(customer_id);

-- Queries automatically apply optimization
SELECT /*+ MAPJOIN(customers) */
    c.customer_name, COUNT(o.order_id) as order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

Transaction Processing Transformation

-- Traditional transactions -> Batch MERGE operations
MERGE INTO customers c
USING (
    SELECT customer_id
    FROM order_summary 
    WHERE total_amount > 50000
) high_value ON c.customer_id = high_value.customer_id
WHEN MATCHED THEN UPDATE SET status = 'premium';

JOIN Conditions and Syntax Specification

Supported Condition Syntax

-- ON condition expression
SELECT * FROM table1 t1
JOIN table2 t2 ON t1.id = t2.id;

-- USING simplified syntax
SELECT * FROM table1 t1
JOIN table2 t2 USING (id);

-- Compound conditions
SELECT * FROM orders o
JOIN employees e ON o.emp_id = e.emp_id 
                 AND o.order_date >= e.hire_date;

Syntax Limitations and Alternatives

Avoid Using Subqueries in JOIN Conditions

-- Not recommended
SELECT e.emp_name
FROM employees e
JOIN departments d ON e.dept_id = (
    SELECT dept_id FROM departments WHERE dept_name = 'Engineering'
);

-- Recommended alternative
SELECT e.emp_name
FROM employees e
JOIN departments d ON e.dept_id = d.dept_id
WHERE d.dept_name = 'Engineering';

-- Or use CTE
WITH target_dept AS (
    SELECT dept_id FROM departments WHERE dept_name = 'Engineering'
)
SELECT e.emp_name
FROM employees e
JOIN target_dept t ON e.dept_id = t.dept_id;

Data Types and NULL Value Handling

NULL Value Display Characteristics

In Singdata Lakehouse, NULL values of different data types have specific display formats:

  • String types (STRING, VARCHAR): NULL displays as empty
  • Numeric types (INT, DOUBLE, DECIMAL): NULL displays as "nan"
  • Time types (DATE, TIMESTAMP): NULL displays as "NaT"
  • Logical checks: IS NULL and IS NOT NULL work correctly on all types

Safe NULL Value Handling

-- Safe NULL value handling
SELECT 
    emp_name,
    CASE 
        WHEN salary IS NULL OR CAST(salary AS STRING) = 'nan' 
        THEN 0.0 
        ELSE salary 
    END as safe_salary,
    COALESCE(dept_name, 'Unknown Department') as safe_dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;

Complex JOIN Scenarios

Multi-Table Join Analysis

-- Typical dimensional modeling query
SELECT /*+ MAPJOIN(dim_customer, dim_product, dim_date) */
    dd.year,
    dd.quarter,
    dc.customer_tier,
    dp.product_line,
    COUNT(*) as transaction_count,
    SUM(ft.amount) as total_revenue,
    AVG(ft.amount) as avg_transaction_value
FROM fact_transactions ft
JOIN dim_customer dc ON ft.customer_id = dc.customer_id
JOIN dim_product dp ON ft.product_id = dp.product_id  
JOIN dim_date dd ON ft.date_id = dd.date_id
WHERE dd.year >= 2024
GROUP BY dd.year, dd.quarter, dc.customer_tier, dp.product_line;

Hierarchical Data Processing

-- Multi-level org structure query
WITH employee_hierarchy AS (
    -- Level 1: Senior management
    SELECT employee_id, employee_name, 1 as level
    FROM employees WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Level 2: Middle management
    SELECT e.employee_id, e.employee_name, 2 as level
    FROM employees e
    JOIN employees m ON e.manager_id = m.employee_id
    WHERE m.manager_id IS NULL
    
    UNION ALL
    
    -- Level 3: Front-line employees
    SELECT e.employee_id, e.employee_name, 3 as level
    FROM employees e
    JOIN employees m1 ON e.manager_id = m1.employee_id
    JOIN employees m2 ON m1.manager_id = m2.employee_id
    WHERE m2.manager_id IS NULL
)
SELECT level, COUNT(*) as employee_count
FROM employee_hierarchy
GROUP BY level
ORDER BY level;

Time Series JOIN

-- Time window join analysis
SELECT o.order_id, l.line_item_id, o.order_date, l.ship_date
FROM orders o
JOIN line_items l ON o.order_id = l.order_id
                  AND o.order_date BETWEEN l.ship_date - INTERVAL '5' DAY 
                                       AND l.ship_date + INTERVAL '5' DAY
WHERE o.order_date >= '2024-01-01';

Performance Tuning Practice

Query Optimization Checklist

1. JOIN Strategy Selection

-- Small table x Large table: MAPJOIN preferred
SELECT /*+ MAPJOIN(dim_table) */
    fact.*, dim.dimension_name
FROM fact_table fact
JOIN dim_table dim ON fact.dim_id = dim.dim_id;

-- Large table x Large table: SORTMERGEJOIN
SELECT /*+ SORTMERGEJOIN(table1, table2) */
    t1.*, t2.*
FROM large_table1 t1
JOIN large_table2 t2 ON t1.join_key = t2.join_key;

2. Predicate Pushdown Optimization

-- Place filter conditions before the JOIN
WITH filtered_base AS (
    SELECT customer_id, order_id, amount
    FROM orders
    WHERE order_date >= '2024-01-01'      -- Filter first
      AND status = 'completed'
)
SELECT /*+ MAPJOIN(customers) */
    c.customer_name, f.amount
FROM filtered_base f
JOIN customers c ON f.customer_id = c.customer_id;

3. Column Pruning Strategy

-- Explicitly specify required columns
SELECT customer_id, customer_name, order_count
FROM customer_summary
WHERE registration_date >= '2024-01-01';

-- Use EXCEPT to exclude unnecessary columns
SELECT * EXCEPT(internal_notes, created_by, updated_by)
FROM customer_details;

Pagination Query Strategy

Cursor-Based Efficient Pagination

-- Recommended pagination approach
SELECT customer_id, order_id, order_date, amount
FROM orders 
WHERE customer_id > :last_customer_id  -- Cursor position
   OR (customer_id = :last_customer_id AND order_id > :last_order_id)
ORDER BY customer_id, order_id
LIMIT 1000;

Range Partition Pagination

-- Business-logic-based partition processing
SELECT * FROM orders 
WHERE order_date >= '2024-06-01'
  AND order_date < '2024-06-02'  -- Process in daily batches
ORDER BY order_id;

Best Practices Summary

General Optimization Principles

  1. Prioritize Small Table Broadcast: Prefer MAPJOIN for dimension-to-fact table joins
  2. Push Filters Upstream: Complete data filtering before the JOIN
  3. Choose the Right JOIN Type: Select the most appropriate JOIN type based on business requirements
  4. Avoid Deep Pagination: Use cursor-based pagination instead of OFFSET pagination
  5. Monitor Execution Plans: Pay attention to query performance and resource usage

Migration Success Factors

Technology BackgroundMigration FocusExpected Benefits
SparkMAPJOIN hint syntax adaptationMore refined JOIN optimization control
HiveColumnar storage mindset shiftSignificant query performance improvement
MaxComputeFunction syntax mappingReal-time interactive experience
SnowflakeExplicit optimization controlMore flexible performance tuning
Traditional DBEstablish OLAP thinkingLeap in big data processing capability

Performance Optimization Results

By properly applying the optimization strategies in this guide:

  • Query Performance: 10-100x improvement over traditional databases
  • Resource Efficiency: Columnar storage reduces I/O overhead by 60-80%
  • Development Efficiency: Unified SQL interface reduces learning costs
  • Operational Complexity: Cloud-native architecture simplifies management

Summary

Singdata Lakehouse's JOIN functionality provides enterprises with powerful and flexible data association capabilities. By following the migration strategies and best practices in this guide, users from different technology backgrounds can quickly unlock the system's maximum value, achieving a successful transition from traditional data processing to modern big data analytics.

Key Advantages

  • Complete Compatibility: Supports standard SQL JOIN syntax and semantics
  • Performance Optimization: Advanced optimization techniques such as MAPJOIN and SORTMERGEJOIN
  • Smooth Migration: Professional migration paths tailored for different technology backgrounds
  • Enterprise-Grade Features: Production-ready features including NULL value handling and type conversion

Get Started Now

Choose the migration guide that fits your technology background and begin your journey exploring Singdata Lakehouse's JOIN functionality. By putting the recommendations in this guide into practice, you will be able to build efficient, reliable data analytics solutions that meet various enterprise data processing needs.


Note: This document is compiled based on the Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, always verify the correctness and performance impact of all operations in a test environment.