Lakehouse Data Update and Cleanup Guide

Overview

In data warehouse operations, it is often necessary to correct erroneous data, clean up expired records, or reset table data. Singdata Lakehouse provides three data modification methods: UPDATE, DELETE, and TRUNCATE, each suitable for different scenarios. This guide categorizes usage by business scenario to help you quickly master safe and efficient data update and cleanup methods.


SQL Commands Covered

CommandPurposeApplicable Scenario
UPDATE ... SET ... WHEREUpdate rows by conditionFix erroneous data, status changes
UPDATE ... ORDER BY ... LIMITBatch updateSafe updates for large data volumes
DELETE FROM ... WHEREDelete rows by conditionClean expired or invalid data
TRUNCATE TABLEClear entire tableQuickly reset table data

Prerequisites

The following examples use a simulated employee table employees_update:

-- Create test table
CREATE TABLE IF NOT EXISTS employees_update (
    emp_id INT,
    emp_name STRING,
    dept STRING,
    salary DOUBLE,
    status STRING
);

-- Insert test data
INSERT INTO employees_update VALUES
(1, 'Alice', 'Engineering', 12000, 'active'),
(2, 'Bob', 'Engineering', 9500, 'active'),
(3, 'Carol', 'Marketing', 8500, 'active'),
(4, 'David', 'Marketing', 7800, 'inactive'),
(5, 'Eve', 'HR', 6000, 'active');

Conditional Data Update

Use the UPDATE statement to modify rows that meet conditions. Always use the WHERE clause to limit the update scope.

-- Give Engineering department employees a 10% raise
UPDATE employees_update 
SET salary = salary * 1.1
WHERE dept = 'Engineering';

Verify Result:

SELECT emp_id, emp_name, dept, salary FROM employees_update ORDER BY emp_id;
emp_idemp_namedeptsalary
1AliceEngineering13200
2BobEngineering10450
3CarolMarketing8500
4DavidMarketing7800
5EveHR6000

Batch Update

When the number of rows to update is very large, use ORDER BY + LIMIT for batch updates to avoid long locks.

-- Update 2 rows at a time, ordered by emp_id for consistency
UPDATE employees_update 
SET status = 'reviewed'
WHERE status = 'active'
ORDER BY emp_id
LIMIT 2;

Verify Result:

emp_idemp_namestatus
1Alicereviewed
2Bobreviewed
3Carolactive
5Eveactive

Conditional Data Deletion

Use the DELETE statement to delete rows that meet conditions. Again, use the WHERE clause to limit the deletion scope.

-- Delete employees with inactive status
DELETE FROM employees_update 
WHERE status = 'inactive';

Verify Result:

SELECT * FROM employees_update ORDER BY emp_id;
emp_idemp_namedeptsalarystatus
1AliceEngineering13200reviewed
2BobEngineering10450reviewed
3CarolMarketing8500active
5EveHR6000active

Truncate Table Data

Use TRUNCATE TABLE to quickly clear all table data while preserving the table structure. More efficient than DELETE.

-- Clear table data
TRUNCATE TABLE employees_update;

Verify Result:

SELECT COUNT(*) FROM employees_update;
COUNT(*)
0

When updating a table based on data from another table, use subqueries or MERGE INTO.

-- Re-insert data for demonstration
INSERT INTO employees_update VALUES
(1, 'Alice', 'Engineering', 13200, 'reviewed'),
(2, 'Bob', 'Engineering', 10450, 'reviewed');

-- Create salary adjustment table
CREATE TABLE IF NOT EXISTS salary_adjustments (
    emp_id INT,
    adjust_amount DOUBLE
);

INSERT INTO salary_adjustments VALUES
(1, 500),
(2, 300);

-- Use subquery to update salaries
UPDATE employees_update 
SET salary = salary + (
    SELECT adjust_amount 
    FROM salary_adjustments 
    WHERE salary_adjustments.emp_id = employees_update.emp_id
)
WHERE emp_id IN (SELECT emp_id FROM salary_adjustments);

Verify Result:

emp_idemp_namesalary
1Alice13700
2Bob10750

Clean Up Test Data

After completing update and cleanup verification, it is recommended to clean up test tables:

-- Drop test tables
DROP TABLE IF EXISTS employees_update;
DROP TABLE IF EXISTS salary_adjustments;

Important Notes

  1. WHERE Clause: Always use WHERE with UPDATE and DELETE to avoid accidental full-table operations.
  2. Transactionality: A single UPDATE or DELETE is an atomic operation -- either all rows succeed or all fail.
  3. TRUNCATE Is Irrecoverable: TRUNCATE operations do not retain historical versions and cannot be recovered through Time Travel.
  4. Dynamic Table Limitation: Dynamic Tables do not support direct UPDATE or DELETE; data is refreshed automatically by upstream table changes.
  5. Batch Operations: For large-volume updates/deletes, use ORDER BY + LIMIT to execute in batches to avoid prolonged resource usage.