Lakehouse Small File Compaction Optimization Guide (OPTIMIZE)

Overview

In data warehouses, frequent small-batch writes (such as real-time CDC, streaming ingestion) can cause numerous small files to accumulate at the table storage layer, severely impacting query performance. Singdata Lakehouse provides the OPTIMIZE command, which automatically merges small files into larger files, reducing I/O overhead and improving query speed. This guide is organized by business scenario to help you quickly master small file compaction methods.


SQL Commands Covered

CommandPurposeUse Case
OPTIMIZE table_nameFull table small file compactionRegular maintenance to improve overall query performance
OPTIMIZE table_name WHERE ...Partition-level compactionCompact specific partitions only, saving resources
SET cz.sql.compaction.after.commit = trueEnable auto-compactionAutomatic file health maintenance for real-time write scenarios

Prerequisites

The following examples use a simulated log table logs_optimize:

-- Create test table
CREATE TABLE IF NOT EXISTS logs_optimize (
    log_id INT,
    message STRING,
    log_date DATE
);

-- Simulate multiple small-batch writes (producing small files)
INSERT INTO logs_optimize VALUES (1, 'Log A', '2024-06-01');
INSERT INTO logs_optimize VALUES (2, 'Log B', '2024-06-01');
INSERT INTO logs_optimize VALUES (3, 'Log C', '2024-06-02');

Full Table Small File Compaction

Use the OPTIMIZE command to compact small files across the entire table.

-- Execute full table optimization
OPTIMIZE logs_optimize;

Execution Notes:

  • The system scans all data files in the table and merges files smaller than the threshold into larger files.
  • If the table already has few files or is in a healthy state, it may return No compaction job generated, indicating no compaction is needed.
  • The compaction process does not affect ongoing queries; old and new files are smoothly switched via MVCC.

Partition-level Compaction

If the table is partitioned by date, you can compact only specific partitions to avoid the resource consumption of a full table scan.

-- Compact only the 2024-06-01 partition (requires table to be partitioned by log_date)
OPTIMIZE logs_optimize WHERE log_date = '2024-06-01';

View Compaction Results

After compaction completes, you can view optimization results via DESC HISTORY or file statistics.

-- View table history (including OPTIMIZE operation records)
DESC HISTORY logs_optimize;

Returned Information:

  • operation: Operation type (OPTIMIZE)
  • total_rows: Total row count (unchanged)
  • total_bytes: Storage size (typically reduced due to compression)

Auto-compaction Configuration

For real-time write scenarios, manual OPTIMIZE execution may not be timely enough. You can enable auto-compaction.

-- Enable session-level auto-compaction
SET cz.sql.compaction.after.commit = true;

Effect:

  • After each INSERT or DELETE commit, the system automatically triggers background small file compaction.
  • Suitable for streaming data ingestion scenarios (such as Kafka Pipe, CDC synchronization).

Clean Up Test Data

After completing optimization verification, it is recommended to clean up the test table:

-- Drop test table
DROP TABLE IF EXISTS logs_optimize;

Notes

  1. Execution Timing: It is recommended to execute OPTIMIZE during off-peak hours to avoid competing for compute resources with high-priority queries.
  2. Frequency Control: Daily execution is unnecessary. Usually, weekly or monthly execution is sufficient to maintain file health.
  3. Storage Changes: After compaction, the number of files decreases, but total data volume may vary slightly due to compression algorithms.
  4. Dynamic Tables: The refresh process of Dynamic Tables automatically handles underlying files; manual OPTIMIZE is usually unnecessary.
  5. Resource Consumption: OPTIMIZE is a compute-intensive operation that consumes VCluster CPU and memory resources.