Multi-Cloud Unified Data Lake Acceleration

The core concept of "data lake acceleration" is no data migration—directly querying and processing files in existing object storage using Singdata Lakehouse's Serverless compute, replacing traditional Spark/Hive ETL and Presto/Trino ad hoc queries.

Alibaba Cloud OSS, Tencent Cloud COS, and AWS S3—the three mainstream object storage services—are unified through a single approach: Volume mount → Pipe continuous ingestion → Dynamic Table incremental aggregation. Aside from different parameter names when creating Storage Connections, all other SQL syntax is identical.


Why a Multi-Cloud Unified Approach

  • Enterprise data is spread across multiple cloud providers and needs a unified query entry point
  • Different cloud object storage APIs differ, but Lakehouse data processing logic should remain consistent
  • Reduces the mental overhead and operational costs of switching between cloud environments

Singdata Lakehouse's abstraction layer solves this perfectly: the same SQL, with just a Connection parameter change, runs on different clouds.


SQL Commands Involved

Command / FunctionPurposeMulti-Cloud Differences
CREATE STORAGE CONNECTIONEstablish object storage authentication channelThe only step with differences (different parameter names)
CREATE EXTERNAL VOLUMEMount object storage pathSyntax fully unified (only change protocol prefix)
COPY INTO VOLUMEExport data to VolumeFully unified
SELECT FROM VOLUMEDirectly query Volume filesFully unified
DIRECTORY()List files in a VolumeFully unified
COPY INTOImport data from Volume to tableFully unified
CREATE PIPECreate continuous ingestion pipelineFully unified
ALTER PIPEPause/resume PipeFully unified
load_history()View historical load recordsFully unified
CREATE DYNAMIC TABLECreate incremental refresh aggregation tableFully unified
REFRESH DYNAMIC TABLEManually trigger refreshFully unified

Core Architecture


Cross-Cloud Comparison

The exact same Volume + Pipe test cases were executed on Alibaba Cloud Shanghai (f8866243) and Tencent Cloud Shanghai (0c3c358d):

Configuration Differences

ConfigurationAlibaba Cloud OSSTencent Cloud COS
Connection typeTYPE OSSTYPE COS
Auth parameter namesaccess_id / access_key (lowercase)ACCESS_KEY / SECRET_KEY (uppercase)
EndpointENDPOINT = 'oss-cn-shanghai.aliyuncs.com'No ENDPOINT required
Region parameterEmbedded in ENDPOINTREGION = 'ap-shanghai'
APP_IDNot applicableAPP_ID = '1253896122'
Volume location syntaxLOCATION 'oss://bucket/path/'LOCATION 'cos://bucket-appid/path/'
Recommended Endpointoss-cn-shanghai-internal.aliyuncs.com (internal)Auto-resolved, no configuration needed

Feature Consistency

Test ItemAlibaba Cloud OSSTencent Cloud COSConclusion
Storage Connection creationDifferent parameter names, otherwise identical
External Volume creationSyntax fully identical
COPY INTO VOLUME export CSVIdentical
COPY INTO VOLUME export ParquetIdentical
DIRECTORY() file listingIdentical
SELECT FROM VOLUME (CSV)✅ f0-f4 column names✅ f0-f4 column namesIdentical
SELECT FROM VOLUME (Parquet)✅ preserves column names✅ preserves column namesIdentical
COPY INTO TABLE from VolumeIdentical
PIPE LIST_PURGE creationIdentical
PIPE load trigger✅ ~30s✅ ~30sIdentical
PIPE PURGE delete source filesIdentical
PIPE load_history dedupIdentical
PIPE pause/resumeIdentical
Dynamic Table incremental refreshIdentical

Key finding: Aside from different parameter names when creating the Connection, all other 12 test items are completely identical. Volume, Pipe, and Dynamic Table SQL syntax has no differences.


Unified Implementation

Step 1: Create Storage Connection (the only step with differences)

-- ============ Alibaba Cloud OSS ============ CREATE STORAGE CONNECTION IF NOT EXISTS my_storage_conn TYPE OSS access_id = '<AccessKey ID>' access_key = '<AccessKey Secret>' ENDPOINT = 'oss-cn-shanghai-internal.aliyuncs.com'; -- ============ Tencent Cloud COS ============ CREATE STORAGE CONNECTION IF NOT EXISTS my_storage_conn TYPE COS ACCESS_KEY = '<SecretId>' SECRET_KEY = '<SecretKey>' REGION = 'ap-shanghai' APP_ID = '<APP_ID>'; -- ============ AWS S3 ============ CREATE STORAGE CONNECTION IF NOT EXISTS my_storage_conn TYPE S3 ACCESS_KEY = '<Access Key ID>' SECRET_KEY = '<Secret Access Key>' REGION = 'us-east-1';

Step 2: Create External Volume (unified syntax across all three clouds)

-- ✅ Unified syntax for all three clouds: only change the protocol prefix in LOCATION CREATE EXTERNAL VOLUME my_data_vol LOCATION 'oss://my-bucket/data/' -- Alibaba Cloud: oss:// -- LOCATION 'cos://my-bucket-appid/data/' -- Tencent Cloud: cos:// -- LOCATION 's3://my-bucket/data/' -- AWS: s3:// USING CONNECTION my_storage_conn DIRECTORY = (ENABLE = TRUE, AUTO_REFRESH = FALSE) RECURSIVE = TRUE COMMENT 'Multi-cloud unified data lake acceleration Volume';

Step 3: Data Import/Export (fully unified)

-- Export to Volume (unified across all three clouds) COPY INTO VOLUME my_data_vol SUBDIRECTORY 'export/' FROM TABLE source_table FILE_FORMAT = (TYPE = PARQUET); -- Directly query from Volume (unified across all three clouds) SELECT * FROM VOLUME my_data_vol USING PARQUET FILES('export/part00001.parquet'); -- Import to table (unified across all three clouds) COPY INTO target_table FROM VOLUME my_data_vol USING PARQUET SUBDIRECTORY 'export/';

Step 4: Pipe Continuous Ingestion (fully unified)

-- Create dedicated Volume for Pipe CREATE EXTERNAL VOLUME pipe_vol LOCATION 'oss://my-bucket/incoming/' -- change only the protocol prefix across clouds USING CONNECTION my_storage_conn DIRECTORY = (ENABLE = TRUE, AUTO_REFRESH = TRUE) RECURSIVE = TRUE; -- Create Pipe (unified across all three clouds) CREATE PIPE my_pipe INGEST_MODE = 'LIST_PURGE' VIRTUAL_CLUSTER = 'DEFAULT' COMMENT 'Multi-cloud unified continuous ingestion pipeline' AS COPY INTO target_table FROM VOLUME pipe_vol USING CSV PURGE = TRUE;

Step 5: Dynamic Table Incremental Aggregation (fully unified)

CREATE OR REPLACE DYNAMIC TABLE summary_table REFRESH INTERVAL 1 DAY vcluster DEFAULT COMMENT 'Multi-cloud unified aggregated metrics' AS SELECT category, COUNT(*) AS cnt, SUM(amount) AS total FROM target_table GROUP BY category;


Best Practices for Multi-Cloud Unification

1. Code Reuse Strategy

Project directory structure: ├── connections/ │ ├── aliyun_oss.sql ← only this file differs │ ├── tencent_cos.sql ← only this file differs │ └── aws_s3.sql ← only this file differs ├── volumes/ │ └── create_volumes.sql ← universal across all three clouds (protocol prefix can be replaced via variable) ├── tables/ │ └── ddl.sql ← fully unified ├── pipes/ │ └── create_pipes.sql ← fully unified └── dynamic_tables/ └── aggregates.sql ← fully unified

Only the Connection creation SQL needs to be written per cloud. The remaining 90% of the code can be reused directly.

2. Naming Convention Recommendations

ObjectConventionExample
Storage Connection<cloud>_<purpose>oss_prod_conn, cos_archive_conn
External Volume<source_system>_volorders_vol, logs_vol
Pipe<source_system>_pipeorders_pipe, logs_pipe

Do not embed cloud provider information in Volume/Pipe names—these objects may appear in multi-cloud reuse scenarios.

3. Cost Optimization

StrategyDescription
Use internal EndpointsAlibaba Cloud: *-internal.aliyuncs.com, no egress fees, lower latency
T+1 refresh frequencyMost analytics scenarios do not need minute-level refresh; 1 DAY is sufficient
PURGE=trueLIST_PURGE mode auto-deletes source files, prevents OSS/COS storage accumulation
GP type Virtual ClusterUse DEFAULT (GENERAL type), Serverless on-demand billing
Keep file size at 128-256MBLarge CSV/Parquet files are 3-5x more efficient than many small files

4. Security Recommendations

  • Do not use root account AK/SK; create sub-accounts with minimum permissions (Bucket read + specific directory write)
  • Internal Endpoints can bind VPC policies to restrict access sources
  • AK/SK in Storage Connection is not visible in SHOW STORAGE CONNECTIONS (masked)
  • When using AWS S3, prefer IAM Role (ROLE_ARN) over long-lived AK/SK

FAQ

Q: How is cross-cloud data transfer latency calculated?

Volume does not migrate data—files always remain in the original object storage. Query network latency depends on the link from the Lakehouse instance's region to the object storage. Same-region internal Endpoint is recommended (such as oss-cn-shanghai-internal.aliyuncs.com), with latency typically under 10ms.

Q: Can you access Tencent Cloud COS from an Alibaba Cloud instance?

Not supported. Storage Connection must be in the same cloud provider as the Lakehouse instance. For cross-cloud queries, the following alternatives are available:

  1. Create a Lakehouse instance in the target cloud, use External Catalog for federated queries
  2. Sync data cross-cloud to Lakehouse internal tables via data integration (Studio Sync)
  3. Use Private Link to bridge the network, then access via External Schema

Q: Is EVENT_NOTIFICATION mode for Pipe supported on all three clouds?

CloudLIST_PURGEEVENT_NOTIFICATION
Alibaba Cloud OSS✅ (requires MNS queue configuration)
Tencent Cloud COS
AWS S3✅ (requires SQS queue configuration)

EVENT_NOTIFICATION mode supports Alibaba Cloud OSS and AWS S3, with lower latency (near real-time) and without deleting source files. Tencent Cloud COS does not support it yet.

Q: Why does COPY INTO VOLUME require SUBDIRECTORY?

COPY INTO VOLUME without the SUBDIRECTORY clause will throw Syntax error at or near 'FROM'. This is a mandatory SQL parser requirement, unrelated to the cloud platform. To export to the Volume root path, use SUBDIRECTORY '/'.