Volume + Pipe + Dynamic Table End-to-End Practice
"Data lake acceleration" refers to using the three capabilities of object storage mounting (Volume), continuous data ingestion (Pipe), and incremental computation (Dynamic Table) to directly query, process, and consume file data in object storage using Serverless compute—without migrating data—replacing traditional Spark/Hive ETL and Presto/Trino ad hoc queries.
Applicable scenarios:
- Automatic file ingestion: CSV/Parquet files periodically uploaded to OSS/COS/S3 are automatically detected and ingested by Pipe, no manual trigger required
- Incremental ETL: After files are ingested, Dynamic Table automatically computes aggregated metrics incrementally, T+1 reports generated without delay
- Legacy data activation: Large volumes of historical files in object storage can be queried directly via Volume mount, no data migration required
Core data flow:
Core Concepts
| Object | Description | Analogy |
|---|---|---|
| External Volume | Mounts OSS/COS/S3 path for zero-copy access | "Filesystem" of Lakehouse |
| Pipe | Continuously running data ingestion pipeline, automatically detects new files | Conveyor belt—files are ingested as soon as they are uploaded |
| Dynamic Table | Materialized aggregation table that automatically refreshes incrementally | Replaces scheduled ETL jobs |
The three work together to form a self-driving data pipeline: file upload → automatic ingestion → automatic aggregation, fully automated with no manual scheduling.
SQL Commands Involved
| Command / Function | Purpose | Use Case |
|---|---|---|
CREATE STORAGE CONNECTION | Establish object storage authentication channel | One-time setup, shared by all Volumes |
CREATE EXTERNAL VOLUME | Mount object storage path to a Schema | Configure once per Bucket subdirectory |
COPY INTO VOLUME | Export data to Volume | Generate files for downstream consumption |
SELECT FROM VOLUME | Directly query files in Volume | Ad hoc queries, data exploration |
DIRECTORY() | List files in a Volume | View file list, validate exports |
ALTER VOLUME REFRESH | Manually refresh Volume directory cache | Use when AUTO_REFRESH=FALSE |
CREATE PIPE | Create continuous data ingestion pipeline | Automatic file ingestion |
ALTER PIPE | Pause/resume Pipe | Operations management |
DESC PIPE EXTENDED | View Pipe status and configuration | Monitoring, troubleshooting |
load_history() | Query table's historical load records | Validate Pipe loading, troubleshoot deduplication |
CREATE DYNAMIC TABLE | Create auto-incrementally refreshing aggregation table | Replace scheduled ETL jobs |
REFRESH DYNAMIC TABLE | Manually trigger Dynamic Table refresh | Immediately refresh after initial creation |
SHOW DYNAMIC TABLE REFRESH HISTORY | View refresh history | Monitor incremental refresh status |
Prerequisites
The following uses Alibaba Cloud OSS as an example, completing the full pipeline using the semantic_model_test schema and DEFAULT Virtual Cluster.
End-to-End Practice
Step 1: Create Storage Connection
Establish the authentication channel between Lakehouse and OSS.
Step 2: Create External Volume
Mount the OSS Bucket subdirectory as a Lakehouse Volume.
Key parameter descriptions:
| Parameter | Description |
|---|---|
LOCATION | OSS path, must point to a specific subdirectory, not the bucket root path |
USING CONNECTION | References the Storage Connection created in step 1 |
DIRECTORY.ENABLE | Enables directory metadata index; allows using DIRECTORY() function to query file list |
AUTO_REFRESH | Set to TRUE for auto-refresh; when set to FALSE, manual ALTER VOLUME REFRESH is required |
RECURSIVE | Recursively scan subdirectories |
Step 3: Create Source Table and Export to Volume
Verify bidirectional read/write capability of the Volume.
Step 4: Validate Volume File Read/Write
Step 5: Create Pipe for Continuous Ingestion
Pipe continuously monitors the Volume for new files and automatically ingests them into the target table.
Pipe Management
Trigger Pipe Loading
Pipe starts running immediately after creation (polls approximately every 30 seconds). Writing new files to the Volume path triggers loading:
Step 6: Create Dynamic Table for Incremental Consumption
Based on the Pipe-ingested table, create a Dynamic Table for automatic incremental aggregation.
View DT Refresh History
Full Data Flow Validation
| Stage | Data Volume | Description |
|---|---|---|
| Source | 5 rows | Raw data (INSERT) |
| ODS | 5 rows | Pipe ingestion (CSV → table) |
| Summary | 3 rows | Dynamic Table aggregation (3 category groups) |
Best Practices
File Size Recommendations
| Format | Recommended Size | Description |
|---|---|---|
| gzip compressed | ~50 MB | Files that are too large reduce parallelism |
| CSV uncompressed | 128-256 MB | Balance between scan speed and file count |
| Parquet uncompressed | 128-256 MB | Columnar storage, more efficient for queries |
Volume and Pipe Design Principles
- Each Pipe has its own Volume: Different Pipes cannot share the same Volume to avoid interference
- Volume points to a subdirectory: Do not point to the bucket root path, as this will cause Pipe creation errors
- LIST_PURGE vs EVENT_NOTIFICATION:
LIST_PURGE: Simple configuration, suitable for most scenarios, deletes source files after loadingEVENT_NOTIFICATION: Low latency, retains source files, but only supports OSS+S3, and requires additional MNS/SQS configuration
Dynamic Table Design Principles
- Use GP type Virtual Cluster (such as
DEFAULT): GP type supports small file merging; AP type does not - Enable change_tracking: If the source table does not have it enabled, DT performs a full refresh every time with no incremental support
- REFRESH immediately after creation: Ensures first data availability and resets the refresh baseline time
Data Lifecycle
Test Validation Results
The following results are from actual testing on an Alibaba Cloud Shanghai instance (f8866243):
| Test Item | Result | Details |
|---|---|---|
| Storage Connection creation | ✅ | OSS connection normal |
| External Volume mount | ✅ | Directory access normal; AUTO_REFRESH=FALSE requires manual refresh |
| SELECT FROM VOLUME (CSV) | ✅ | Without header, column names are f0-f4; Parquet preserves column names |
| SELECT FROM VOLUME (Parquet) | ✅ | Column names and types both preserved |
| COPY INTO TABLE (CSV) | ✅ | 5 rows correctly imported |
| COPY INTO TABLE (Parquet) | ✅ | 5 rows correctly imported |
| COPY INTO VOLUME export | ⚠️ | Must include SUBDIRECTORY, otherwise syntax error |
| Pipe LIST_PURGE creation | ✅ | Status immediately becomes RUNNING |
| Pipe load trigger | ✅ | Auto-loaded in ~30 seconds; load_history records complete |
| Pipe PURGE deletion | ✅ | Source files auto-deleted after successful load |
| Pipe pause/resume | ✅ | Files not loaded during pause; re-scanned after resume |
| Pipe deduplication | ✅ | Same-name files correctly blocked by load_history (7-day retention) |
| Dynamic Table incremental refresh | ✅ | INCREMENTAL mode, aggregation completed in 346ms |
Notes
| Note | Impact | Recommendation |
|---|---|---|
COPY INTO VOLUME requires SUBDIRECTORY | Without it, syntax error | Use SUBDIRECTORY '/' for root path |
| Generic CSV column names | Without header, column names are f0-f4 | Use OPTIONS('header'='true') or switch to Parquet |
Manual refresh needed when AUTO_REFRESH=FALSE | Directory does not update | Execute ALTER VOLUME name REFRESH |
| Pipe same-name file deduplication | Same-name files not loaded after pause/resume | Rename file on re-upload, or wait 7 days for expiration |
load_history column name | last_copy_time not last_load_time | Pay attention to column name when querying |
| Virtual Cluster auto-sleep | Suspends after 60s without queries | Serverless mode pays on-demand, no concern needed |
| Pipe COPY statement is immutable | When logic adjustment is needed | DROP PIPE then CREATE again |
| AP type Virtual Cluster does not support small file merging | Query performance degrades over time | Always use GP type (DEFAULT) |
Related Documents
- Multi-Cloud Unified Data Lake Acceleration — Alibaba Cloud/Tencent Cloud/AWS real-world comparison
- Volume Overview — Volume concepts, types, and file operations
- Object Storage Pipe — LIST_PURGE and EVENT_NOTIFICATION complete configuration
- Pipe Overview — Pipe vs Table Stream comparison
- Dynamic Table Overview — Incremental computation mechanism
- Create External Volume — Complete DDL syntax
- Import Data from Volume — COPY INTO syntax
- Export Data to Volume — Export syntax
- Query SHOW JOBS — Filter Pipe jobs by query_tag
