OPTIMIZE Command

Overview

OPTIMIZE is similar to a VACUUM operation in traditional databases — it merges multiple small files into larger files to accelerate subsequent queries without changing the actual data content in the table.

OPTIMIZE is the core operation command in Singdata Lakehouse for table data optimization and compaction. By consolidating small files, cleaning up delete markers, and reorganizing data layout, it can significantly improve query performance and storage efficiency. Although Lakehouse automatically performs file compaction periodically in the background by default, in scenarios with frequent updates or where fine-grained control over compaction frequency is needed, you can manually invoke this command to meet specific business requirements. This command supports both asynchronous and synchronous execution modes, providing flexible optimization solutions for different scenarios.

Syntax

OPTIMIZE table_name [WHERE predicate] -- Optional partition filter condition [OPTIONS(...)]

  1. table_name (Required)
  • The name of the target table to optimize, in the format [schema_name.]table_name.
  1. WHERE predicate (Optional)
  • Partition filter condition; must include a complete partition column match condition.
  • Supported formats: partition_column = 'value' or compound partition dt='2023-01-01' AND region='us'.
  1. OPTIONS (Optional)
  • Lakehouse reserved parameters for controlling optimization behavior.

Notes

  • This feature can only run in a General Purpose Virtual Cluster (GENERAL PURPOSE VIRTUAL CLUSTER); it will not take effect in an analytical cluster.

Core Features

  • Small File Compaction: Merges multiple small data files into larger files, reducing file metadata overhead.
  • Delete Marker Cleanup: Cleans up delete markers generated by UPDATE/DELETE operations, reclaiming storage space.
  • Data Reorganization: Rearranges data layout to improve query performance.

Execution Modes

1. Asynchronous Execution Mode (Default)

Asynchronous execution is the default behavior of OPTIMIZE. The operation runs in the background without blocking the current connection.

Characteristics

  • Non-blocking: Returns a Job ID immediately; the operation executes in the background.

Syntax

-- Default asynchronous execution OPTIMIZE table_name; -- Explicitly specify asynchronous execution OPTIMIZE table_name OPTIONS('cz.sql.optimize.table.async' = 'true');

2. Synchronous Execution Mode

Synchronous execution blocks the current connection until the optimization operation is fully completed before returning the result.

Characteristics

  • Blocking: Only returns after the operation completes; the connection is occupied during this time.
  • Real-time Feedback: Immediately obtains detailed execution statistics and success status.
  • Applicable Scenarios: Development and testing, small table optimization, verifying optimization effects.
  • Deterministic: Ensures the operation is fully completed.

Syntax

OPTIMIZE table_name OPTIONS('cz.sql.optimize.table.async' = 'false');

Usage Examples

Example 1: Asynchronous Optimization of an Entire Table (Default)

OPTIMIZE doc_test.orders;

Asynchronous mode returns immediately; the optimization task runs in the background. If the table files are already compact enough, the response is:

No compaction job generated

If a compaction job is triggered, a Job ID is returned. You can check progress via SHOW JOBS.

Example 2: Synchronous Optimization of a Specific Partition

doc_test.orders is partitioned by order_date. You can run synchronous optimization on a single partition:

OPTIMIZE doc_test.orders WHERE order_date = '2024-02-05' OPTIONS('cz.sql.optimize.table.async' = 'false');

Synchronous mode blocks until the operation completes. Example response:

No compaction job generated

Example 3: Synchronous Optimization of an Entire Table

OPTIMIZE doc_test.orders OPTIONS('cz.sql.optimize.table.async' = 'false');

Applicable Scenarios

  • Storage cleanup after extensive UPDATE/DELETE operations
  • Regular maintenance tasks to free temporary storage space
  • Optimizing the overall storage layout of tables