Lakehouse Data Comparison and Merging Guide (Set Operations)
Overview
In data analysis, it is often necessary to merge, compare, or find differences between multiple query results. Singdata Lakehouse provides complete set operation support, including UNION (merge), INTERSECT (intersection), and EXCEPT (difference). This guide categorizes usage by business scenario to help you quickly master efficient data comparison and merging methods.
Quick Navigation
- Merge Data -- Use UNION ALL to merge multiple query results
- Deduplicated Merge -- Use UNION to merge and remove duplicate rows
- Find Common Data -- Use INTERSECT to find the intersection
- Find Different Data -- Use EXCEPT to find the difference
- Multi-Table Comparison -- Combine set operations for complex comparisons
SQL Commands Covered
| Command | Purpose | Applicable Scenario |
|---|---|---|
UNION ALL | Merge result sets (preserve duplicates) | Quickly merge multiple tables or partition data |
UNION | Merge result sets (deduplicate) | Need unique records after merging |
INTERSECT | Return the intersection of two result sets | Find common users, common orders, etc. |
EXCEPT | Return data in first result set but not in second | Find new, churned, or different data |
Prerequisites
The following examples use two simulated sales tables sales_2023 and sales_2024:
Merge Data
Use UNION ALL to merge multiple query results into one result set, preserving all rows (including duplicates). Best performance.
Result:
| customer_id | product | amount |
|---|---|---|
| 1 | Phone | 5000 |
| 2 | Laptop | 8000 |
| 2 | Laptop | 8000 |
| 3 | Tablet | 3000 |
| 3 | Tablet | 3000 |
| 4 | Watch | 2000 |
Deduplicated Merge
Use UNION to merge result sets and automatically remove duplicate rows.
Result:
| customer_id | product | amount |
|---|---|---|
| 1 | Phone | 5000 |
| 2 | Laptop | 8000 |
| 3 | Tablet | 3000 |
| 4 | Watch | 2000 |
Find Common Data
Use INTERSECT to return records that exist in both result sets (intersection).
Result:
| customer_id | product | amount |
|---|---|---|
| 2 | Laptop | 8000 |
| 3 | Tablet | 3000 |
Find Different Data
Use EXCEPT to return records present in the first result set but not in the second (difference).
Result:
| customer_id | product |
|---|---|
| 1 | Phone |
Result:
| customer_id | product |
|---|---|
| 4 | Watch |
Multi-Table Comparison
Combine set operations for more complex data comparisons.
Result:
| customer_id |
|---|
| 1 |
| 4 |
Clean Up Test Data
After completing set operation verification, it is recommended to clean up test tables:
Important Notes
- Column Count and Type Matching: Set operations require both queries to have the same number of columns, with compatible data types for corresponding columns.
- Column Names from First Query: The result set's column names use those from the first
SELECT. - ORDER BY Position:
ORDER BYcan only be placed after the last query and applies to the entire result set. - Performance Differences:
UNION ALLhas the best performance (no deduplication).UNIONandINTERSECTrequire deduplication.EXCEPTrequires hash matching. For large data volumes, preferUNION ALL+ post-processing.
