Automatic Small File Compaction Feature

During the data writing process, Lakehouse offers an automatic small file compaction feature designed to enhance storage efficiency and query performance. This feature is particularly effective for handling a large number of small files generated by high-frequency write operations, as these small files can cause frequent I/O operations during queries, thereby affecting overall performance. It is important to note that enabling this feature may slow down write speeds, as the compaction process requires additional time to execute.

Advantages of the Feature:

  • Reduced I/O Operations: Compacting small files can reduce the number of I/O operations during queries, thereby accelerating data access speeds.
  • Automatic Triggering: The compaction process is automatically executed during data insertion, without requiring any additional manual operations from the user.

Notes

  • This command can only be executed on a General Purpose Virtual Cluster. It will not take effect on an analytical cluster.

Triggering Small File Compaction During DML Statements

To trigger small file compaction together with DML statements, add the following setting:

SET cz.sql.compaction.after.commit = true;

Use Cases:

  • High-Frequency Writing: In scenarios where data needs to be written frequently, such as dynamic table refreshes, it is recommended to enable this feature. High-frequency writing often generates a large number of small files, and automatic compaction can effectively manage these files.

Example Usage:

Suppose you have a table named test_sql_compaction for storing partitioned data.

CREATE TABLE test_sql_compaction (
    id BIGINT,
    name STRING
) PARTITIONED BY (ds STRING);

Before enabling the automatic compaction feature, each insert operation may generate multiple small files. For example, inserting 10 records:

-- Insert 10 records
INSERT INTO test_sql_compaction PARTITION (ds ='1') VALUES(1, 'a');
-- Repeat this INSERT operation 10 times

After insertion, you can check the number of generated files:

SHOW PARTITIONS EXTENDED test_sql_compaction;

Insert data again with compaction enabled:

SET cz.sql.compaction.after.commit = true;
INSERT INTO test_sql_compaction PARTITION (ds ='1') VALUES(1, 'a');

After compaction, check the file count again and you will find that the small files have been merged:

SHOW PARTITIONS EXTENDED test_sql_compaction;

Using the OPTIMIZE Command to Trigger Small File Compaction

Overview

OPTIMIZE is an asynchronous file optimization command provided by Lakehouse, used to merge small files, optimize storage layout, and enhance query performance. This command only supports execution on a GP (General Purpose) compute cluster. After execution, it immediately returns a task ID, while the actual optimization operation is executed asynchronously in the background.

Syntax

OPTIMIZE table_name
[WHERE predicate]  -- Optional partition filter condition
[OPTIONS ('key' = 'value')]  -- Optional configuration parameters

Parameter Description

  1. table_name (Required)
    • The name of the target table to be optimized, in the format [schema_name.]table_name.
  2. WHERE predicate (Required)
    • Partition filter condition, which must include a complete partition column match condition.
    • Supported formats: partition_column = 'value' or composite partitions like dt='2023-01-01' AND region='us'.
    • OPTIONS: Reserved parameters for Lakehouse.

Example Usage:

Suppose you have a table named test_sql_compaction for storing partitioned data.

DROP TABLE test_sql_compaction;
CREATE TABLE test_sql_compaction (
    id BIGINT,
    name STRING
) PARTITIONED BY (ds STRING);

Before enabling the automatic compaction feature, each insert operation may generate multiple small files. For example, inserting 10 records:

-- Insert 10 records
INSERT INTO test_sql_compaction PARTITION (ds ='1') VALUES(1, 'a');
-- Repeat this INSERT operation 10 times

After insertion, you can check the number of generated files:

SHOW PARTITIONS EXTENDED test_sql_compaction;

Execute the small file compaction command:

OPTIMIZE test_sql_compaction;

After compaction, check the file count again and you will find that the small files have been merged:

SHOW PARTITIONS EXTENDED test_sql_compaction;