OPTIMIZE Command

Overview

OPTIMIZE is the core operation command in Singdata Lakehouse for table data optimization and compaction. By consolidating small files, cleaning up delete markers, and reorganizing data layout, it can significantly improve query performance and storage efficiency. Although Lakehouse automatically performs file compaction periodically in the background by default, in scenarios with frequent updates or where fine-grained control over compaction frequency is needed, users can manually invoke this command to meet specific business requirements. This command supports both asynchronous and synchronous execution modes, providing flexible optimization solutions for different scenarios.

Syntax

OPTIMIZE table_name 
[WHERE predicate]  -- Optional partition filter condition
[OPTIONS(...)]
  1. table_name (Required)
  • The name of the target table to optimize, in the format [schema_name.]table_name.
  1. WHERE predicate (Optional)
  • Partition filter condition, must include a complete partition column match condition.
  • Supported formats: partition_column = 'value' or compound partition dt='2023-01-01' AND region='us'.
  1. OPTIONS (Optional)
  • Lakehouse reserved parameters for controlling optimization behavior.

Notes

  • This feature can only run in a General Purpose Virtual Cluster (GENERAL PURPOSE VIRTUAL CLUSTER); it will not take effect in an analytical cluster.

Core Features

  • Small File Compaction: Merges multiple small data files into larger files, reducing file metadata overhead.
  • Delete Marker Cleanup: Cleans up delete markers generated by UPDATE/DELETE operations, reclaiming storage space.
  • Data Reorganization: Rearranges data layout to improve query performance.

Execution Modes

1. Asynchronous Execution Mode (Default)

Asynchronous execution is the default behavior of OPTIMIZE. The operation runs in the background without blocking the current connection.

Characteristics

  • Non-blocking: Returns a Job ID immediately, and the operation executes in the background.

Syntax

-- Default asynchronous execution
OPTIMIZE table_name;

-- Explicitly specify asynchronous execution
OPTIMIZE table_name OPTIONS('cz.sql.optimize.table.async' = 'true');

2. Synchronous Execution Mode

Synchronous execution blocks the current connection until the optimization operation is fully completed before returning the result.

Characteristics

  • Blocking: Only returns after the operation completes, during which the connection is occupied.
  • Real-time Feedback: Immediately obtains detailed execution statistics and success status.
  • Applicable Scenarios: Development and testing, small table optimization, verifying optimization effects.
  • Deterministic: Ensures the operation is fully completed.

Syntax

OPTIMIZE table_name OPTIONS('cz.sql.optimize.table.async' = 'false');

Applicable Scenarios

  • Storage cleanup after extensive UPDATE/DELETE operations
  • Regular maintenance tasks to free temporary storage space
  • Optimizing the overall storage layout of tables