Lakehouse Data Comparison and Merging Guide (Set Operations)

Overview

In data analysis, it is often necessary to merge, compare, or find differences between multiple query results. Singdata Lakehouse provides complete set operation support, including UNION (merge), INTERSECT (intersection), and EXCEPT (difference). This guide categorizes usage by business scenario to help you quickly master efficient data comparison and merging methods.


SQL Commands Covered

CommandPurposeApplicable Scenario
UNION ALLMerge result sets (preserve duplicates)Quickly merge multiple tables or partition data
UNIONMerge result sets (deduplicate)Need unique records after merging
INTERSECTReturn the intersection of two result setsFind common users, common orders, etc.
EXCEPTReturn data in first result set but not in secondFind new, churned, or different data

Prerequisites

The following examples use two simulated sales tables sales_2023 and sales_2024:

-- Create 2023 sales table
CREATE TABLE IF NOT EXISTS sales_2023 (
    customer_id INT,
    product STRING,
    amount DOUBLE
);

-- Create 2024 sales table
CREATE TABLE IF NOT EXISTS sales_2024 (
    customer_id INT,
    product STRING,
    amount DOUBLE
);

-- Insert 2023 data
INSERT INTO sales_2023 VALUES
(1, 'Phone', 5000),
(2, 'Laptop', 8000),
(3, 'Tablet', 3000);

-- Insert 2024 data
INSERT INTO sales_2024 VALUES
(2, 'Laptop', 8000),
(3, 'Tablet', 3000),
(4, 'Watch', 2000);

Merge Data

Use UNION ALL to merge multiple query results into one result set, preserving all rows (including duplicates). Best performance.

-- Merge two years of sales data
SELECT customer_id, product, amount FROM sales_2023
UNION ALL
SELECT customer_id, product, amount FROM sales_2024
ORDER BY customer_id;

Result:

customer_idproductamount
1Phone5000
2Laptop8000
2Laptop8000
3Tablet3000
3Tablet3000
4Watch2000

Deduplicated Merge

Use UNION to merge result sets and automatically remove duplicate rows.

-- Merge two years of sales data and deduplicate
SELECT customer_id, product, amount FROM sales_2023
UNION
SELECT customer_id, product, amount FROM sales_2024
ORDER BY customer_id;

Result:

customer_idproductamount
1Phone5000
2Laptop8000
3Tablet3000
4Watch2000

Find Common Data

Use INTERSECT to return records that exist in both result sets (intersection).

-- Find customers and products present in both years
SELECT customer_id, product, amount FROM sales_2023
INTERSECT
SELECT customer_id, product, amount FROM sales_2024
ORDER BY customer_id;

Result:

customer_idproductamount
2Laptop8000
3Tablet3000

Find Different Data

Use EXCEPT to return records present in the first result set but not in the second (difference).

-- Find customers in 2023 but not in 2024 (churned customers)
SELECT customer_id, product FROM sales_2023
EXCEPT
SELECT customer_id, product FROM sales_2024
ORDER BY customer_id;

Result:

customer_idproduct
1Phone
-- Find newly added customers in 2024
SELECT customer_id, product FROM sales_2024
EXCEPT
SELECT customer_id, product FROM sales_2023
ORDER BY customer_id;

Result:

customer_idproduct
4Watch

Multi-Table Comparison

Combine set operations for more complex data comparisons.

-- Find customers appearing in only one year (symmetric difference)
(SELECT customer_id FROM sales_2023
 EXCEPT
 SELECT customer_id FROM sales_2024)
UNION ALL
(SELECT customer_id FROM sales_2024
 EXCEPT
 SELECT customer_id FROM sales_2023)
ORDER BY customer_id;

Result:

customer_id
1
4

Clean Up Test Data

After completing set operation verification, it is recommended to clean up test tables:

-- Drop test tables
DROP TABLE IF EXISTS sales_2023;
DROP TABLE IF EXISTS sales_2024;

Important Notes

  1. Column Count and Type Matching: Set operations require both queries to have the same number of columns, with compatible data types for corresponding columns.
  2. Column Names from First Query: The result set's column names use those from the first SELECT.
  3. ORDER BY Position: ORDER BY can only be placed after the last query and applies to the entire result set.
  4. Performance Differences: UNION ALL has the best performance (no deduplication). UNION and INTERSECT require deduplication. EXCEPT requires hash matching. For large data volumes, prefer UNION ALL + post-processing.