Automatic Small File Compaction Feature
During the data writing process, Lakehouse offers an automatic small file compaction feature designed to enhance storage efficiency and query performance. This feature is particularly effective for handling a large number of small files generated by high-frequency write operations, as these small files can cause frequent I/O operations during queries, thereby affecting overall performance. It is important to note that enabling this feature may slow down write speeds, as the compaction process requires additional time to execute.
Advantages of the Feature:
- Reduced I/O Operations: Compacting small files can reduce the number of I/O operations during queries, thereby accelerating data access speeds.
- Automatic Triggering: The compaction process is automatically executed during data insertion, without requiring any additional manual operations from the user.
Notes
- This command can only be executed on a General Purpose Virtual Cluster. It will not take effect on an analytical cluster.
Triggering Small File Compaction During DML Statements
To trigger small file compaction together with DML statements, add the following setting:
Use Cases:
- High-Frequency Writing: In scenarios where data needs to be written frequently, such as dynamic table refreshes, it is recommended to enable this feature. High-frequency writing often generates a large number of small files, and automatic compaction can effectively manage these files.
Example Usage:
Suppose you have a table named test_sql_compaction
for storing partitioned data.
Before enabling the automatic compaction feature, each insert operation may generate multiple small files. For example, inserting 10 records:
After insertion, you can check the number of generated files:
Insert data again with compaction enabled:
After compaction, check the file count again and you will find that the small files have been merged:
Using the OPTIMIZE Command to Trigger Small File Compaction
Overview
OPTIMIZE
is an asynchronous file optimization command provided by Lakehouse, used to merge small files, optimize storage layout, and enhance query performance. This command only supports execution on a GP (General Purpose) compute cluster. After execution, it immediately returns a task ID, while the actual optimization operation is executed asynchronously in the background.
Syntax
Parameter Description
- table_name (Required)
- The name of the target table to be optimized, in the format
[schema_name.]table_name
.
- The name of the target table to be optimized, in the format
- WHERE predicate (Required)
- Partition filter condition, which must include a complete partition column match condition.
- Supported formats:
partition_column = 'value'
or composite partitions likedt='2023-01-01' AND region='us'
. - OPTIONS: Reserved parameters for Lakehouse.
Example Usage:
Suppose you have a table named test_sql_compaction
for storing partitioned data.
Before enabling the automatic compaction feature, each insert operation may generate multiple small files. For example, inserting 10 records:
After insertion, you can check the number of generated files:
Execute the small file compaction command:
After compaction, check the file count again and you will find that the small files have been merged: