Lakehouse Data Deduplication Guide

Overview

In data warehouse construction, data deduplication is one of the most common requirements. Singdata Lakehouse provides multiple deduplication approaches, from simple DISTINCT to complex window function-based deduplication, meeting the needs of various business scenarios. This guide categorizes usage by scenario to help you quickly select and implement the optimal deduplication approach.


SQL Commands Covered

Command/FunctionPurposeApplicable Scenario
SELECT DISTINCTRemove completely duplicate rowsDuplicate data where all fields are identical
ROW_NUMBER() OVER (...)Generate unique sequence numbers for each groupNeed to keep specific rows by time/priority
QUALIFYDirectly filter window function resultsSubstitute for subqueries, simplifies SQL
GROUP BY + MAX/MIN/ANY_VALUEGroup aggregation deduplicationOnly need to keep some fields or any one row
INSERT OVERWRITEOverwrite with deduplicated dataPhysically clean duplicate data

Prerequisites

The following examples use a simulated order table orders_with_dupes containing duplicate data:

-- Create test table
CREATE TABLE IF NOT EXISTS orders_with_dupes (
    order_id STRING,
    customer_id INT,
    amount DOUBLE,
    order_time TIMESTAMP
);

-- Insert test data (including duplicates)
INSERT INTO orders_with_dupes VALUES
('O001', 101, 100.0, '2024-06-01 10:00:00'),
('O001', 101, 100.0, '2024-06-01 10:00:00'),  -- Completely duplicate
('O002', 102, 200.0, '2024-06-01 11:00:00'),
('O002', 102, 200.0, '2024-06-01 11:05:00'),  -- Same order ID, different time
('O003', 103, 300.0, '2024-06-01 12:00:00');

Completely Duplicate Row Deduplication

When all fields in a row are exactly the same, using DISTINCT is the simplest and most efficient approach.

-- Query deduplicated results
SELECT DISTINCT *
FROM orders_with_dupes;

Result:

order_idcustomer_idamountorder_time
O0011011002024-06-01 10:00:00
O0021022002024-06-01 11:00:00
O0021022002024-06-01 11:05:00
O0031033002024-06-01 12:00:00

Keep the Latest Record

In business scenarios, it is often necessary to keep the latest record based on a timestamp (e.g., duplicate log reports, order statuses updated multiple times). Lakehouse recommends using ROW_NUMBER() with the QUALIFY clause for cleaner syntax.

-- Keep the record with the latest order_time for each order_id
SELECT 
    order_id,
    customer_id,
    amount,
    order_time
FROM orders_with_dupes
WINDOW w AS (PARTITION BY order_id ORDER BY order_time DESC)
QUALIFY ROW_NUMBER() OVER w = 1;

Result:

order_idcustomer_idamountorder_time
O0011011002024-06-01 10:00:00
O0021022002024-06-01 11:05:00
O0031033002024-06-01 12:00:00

Syntax Advantages

Using QUALIFY avoids nested subqueries, making SQL more readable:

-- Traditional approach (nested subquery)
SELECT order_id, customer_id, amount, order_time
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_time DESC) as rn
    FROM orders_with_dupes
) t
WHERE rn = 1;

-- Lakehouse recommended approach (QUALIFY)
SELECT order_id, customer_id, amount, order_time
FROM orders_with_dupes
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_time DESC) = 1;

Multi-Field Combination Deduplication

When duplicate data differs only in some fields (e.g., different timestamps) and you only need to keep key fields, use GROUP BY with aggregate functions.

-- Group by order_id, keep the max amount and latest time
SELECT 
    order_id,
    customer_id,
    MAX(amount) as max_amount,
    MAX(order_time) as latest_time
FROM orders_with_dupes
GROUP BY order_id, customer_id;

Applicable Scenarios:

  • Only need dimensional statistics, not concerned with which specific record.
  • Data volume is very large and window function performance is worse than aggregate functions.

Physical Deduplication (Rewrite Table)

Query-level deduplication is only logical. To physically clean duplicate data, write deduplicated results back to the original table or a new table.

Option 1: Write to a New Table (Recommended)

-- Create a new deduplicated table
CREATE TABLE orders_deduped AS
SELECT *
FROM orders_with_dupes
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_time DESC) = 1;

Option 2: Overwrite Original Table

-- Overwrite original table data (use with caution)
INSERT OVERWRITE TABLE orders_with_dupes
SELECT *
FROM orders_with_dupes
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_time DESC) = 1;

Clean Up Test Data

After completing deduplication verification, it is recommended to clean up test tables to free storage space:

-- Drop test tables
DROP TABLE IF EXISTS orders_with_dupes;
DROP TABLE IF EXISTS orders_deduped;

Important Notes

  1. Performance Optimization:

    • For large table deduplication, first use WHERE to filter out obviously invalid data.
    • If the table is partitioned, deduplicating by partition can significantly reduce computation.
  2. NULL Value Handling:

    • DISTINCT treats multiple NULL values as the same, keeping only one row.
    • When sorting with ROW_NUMBER(), NULL values default to last (in DESC) or first (in ASC), controllable via NULLS FIRST/LAST.
  3. Dynamic Table Deduplication:

    • If deduplication logic is used for a Dynamic Table, prefer the GROUP BY approach, as incremental computation support is more complete.
    • QUALIFY is fully supported in Dynamic Tables, but complex window functions may cause the refresh mode to switch to full.