Lakehouse Small File Compaction Optimization Guide (OPTIMIZE)
Overview
In data warehouses, frequent small-batch writes (such as real-time CDC, streaming ingestion) can cause numerous small files to accumulate at the table storage layer, severely impacting query performance. Singdata Lakehouse provides the OPTIMIZE command, which automatically merges small files into larger files, reducing I/O overhead and improving query speed. This guide is organized by business scenario to help you quickly master small file compaction methods.
Quick Navigation
- Full Table Small File Compaction -- Merge small files across all partitions
- Partition-level Compaction -- Merge small files in specific partitions only
- View Compaction Results -- Compare file counts before and after compaction
- Auto-compaction Configuration -- Enable automatic compaction after commit
SQL Commands Covered
| Command | Purpose | Use Case |
|---|---|---|
OPTIMIZE table_name | Full table small file compaction | Regular maintenance to improve overall query performance |
OPTIMIZE table_name WHERE ... | Partition-level compaction | Compact specific partitions only, saving resources |
SET cz.sql.compaction.after.commit = true | Enable auto-compaction | Automatic file health maintenance for real-time write scenarios |
Prerequisites
The following examples use a simulated log table logs_optimize:
Full Table Small File Compaction
Use the OPTIMIZE command to compact small files across the entire table.
Execution Notes:
- The system scans all data files in the table and merges files smaller than the threshold into larger files.
- If the table already has few files or is in a healthy state, it may return
No compaction job generated, indicating no compaction is needed. - The compaction process does not affect ongoing queries; old and new files are smoothly switched via MVCC.
Partition-level Compaction
If the table is partitioned by date, you can compact only specific partitions to avoid the resource consumption of a full table scan.
View Compaction Results
After compaction completes, you can view optimization results via DESC HISTORY or file statistics.
Returned Information:
operation: Operation type (OPTIMIZE)total_rows: Total row count (unchanged)total_bytes: Storage size (typically reduced due to compression)
Auto-compaction Configuration
For real-time write scenarios, manual OPTIMIZE execution may not be timely enough. You can enable auto-compaction.
Effect:
- After each
INSERTorDELETEcommit, the system automatically triggers background small file compaction. - Suitable for streaming data ingestion scenarios (such as Kafka Pipe, CDC synchronization).
Clean Up Test Data
After completing optimization verification, it is recommended to clean up the test table:
Notes
- Execution Timing: It is recommended to execute
OPTIMIZEduring off-peak hours to avoid competing for compute resources with high-priority queries. - Frequency Control: Daily execution is unnecessary. Usually, weekly or monthly execution is sufficient to maintain file health.
- Storage Changes: After compaction, the number of files decreases, but total data volume may vary slightly due to compression algorithms.
- Dynamic Tables: The refresh process of Dynamic Tables automatically handles underlying files; manual
OPTIMIZEis usually unnecessary. - Resource Consumption:
OPTIMIZEis a compute-intensive operation that consumes VCluster CPU and memory resources.
