Lakehouse Data Deduplication Guide
Overview
In data warehouse construction, data deduplication is one of the most common requirements. Singdata Lakehouse provides multiple deduplication approaches, from simple DISTINCT to complex window function-based deduplication, meeting the needs of various business scenarios. This guide categorizes usage by scenario to help you quickly select and implement the optimal deduplication approach.
Quick Navigation
- Completely Duplicate Row Deduplication -- Use
DISTINCTfor quick deduplication - Keep the Latest Record -- Use
ROW_NUMBER()+QUALIFYto keep the latest data - Multi-Field Combination Deduplication -- Use
GROUP BY+ aggregate functions - Physical Deduplication (Rewrite Table) -- Write deduplicated results to a new table or overwrite the original
SQL Commands Covered
| Command/Function | Purpose | Applicable Scenario |
|---|---|---|
SELECT DISTINCT | Remove completely duplicate rows | Duplicate data where all fields are identical |
ROW_NUMBER() OVER (...) | Generate unique sequence numbers for each group | Need to keep specific rows by time/priority |
QUALIFY | Directly filter window function results | Substitute for subqueries, simplifies SQL |
GROUP BY + MAX/MIN/ANY_VALUE | Group aggregation deduplication | Only need to keep some fields or any one row |
INSERT OVERWRITE | Overwrite with deduplicated data | Physically clean duplicate data |
Prerequisites
The following examples use a simulated order table orders_with_dupes containing duplicate data:
Completely Duplicate Row Deduplication
When all fields in a row are exactly the same, using DISTINCT is the simplest and most efficient approach.
Result:
| order_id | customer_id | amount | order_time |
|---|---|---|---|
| O001 | 101 | 100 | 2024-06-01 10:00:00 |
| O002 | 102 | 200 | 2024-06-01 11:00:00 |
| O002 | 102 | 200 | 2024-06-01 11:05:00 |
| O003 | 103 | 300 | 2024-06-01 12:00:00 |
Keep the Latest Record
In business scenarios, it is often necessary to keep the latest record based on a timestamp (e.g., duplicate log reports, order statuses updated multiple times). Lakehouse recommends using ROW_NUMBER() with the QUALIFY clause for cleaner syntax.
Result:
| order_id | customer_id | amount | order_time |
|---|---|---|---|
| O001 | 101 | 100 | 2024-06-01 10:00:00 |
| O002 | 102 | 200 | 2024-06-01 11:05:00 |
| O003 | 103 | 300 | 2024-06-01 12:00:00 |
Syntax Advantages
Using QUALIFY avoids nested subqueries, making SQL more readable:
Multi-Field Combination Deduplication
When duplicate data differs only in some fields (e.g., different timestamps) and you only need to keep key fields, use GROUP BY with aggregate functions.
Applicable Scenarios:
- Only need dimensional statistics, not concerned with which specific record.
- Data volume is very large and window function performance is worse than aggregate functions.
Physical Deduplication (Rewrite Table)
Query-level deduplication is only logical. To physically clean duplicate data, write deduplicated results back to the original table or a new table.
Option 1: Write to a New Table (Recommended)
Option 2: Overwrite Original Table
Clean Up Test Data
After completing deduplication verification, it is recommended to clean up test tables to free storage space:
Important Notes
-
Performance Optimization:
- For large table deduplication, first use
WHEREto filter out obviously invalid data. - If the table is partitioned, deduplicating by partition can significantly reduce computation.
- For large table deduplication, first use
-
NULL Value Handling:
DISTINCTtreats multipleNULLvalues as the same, keeping only one row.- When sorting with
ROW_NUMBER(),NULLvalues default to last (inDESC) or first (inASC), controllable viaNULLS FIRST/LAST.
-
Dynamic Table Deduplication:
- If deduplication logic is used for a Dynamic Table, prefer the
GROUP BYapproach, as incremental computation support is more complete. QUALIFYis fully supported in Dynamic Tables, but complex window functions may cause the refresh mode to switch to full.
- If deduplication logic is used for a Dynamic Table, prefer the
