Multi-Engine Iceberg Data Lake Federation Query Pipeline Best Practices
Upstream Spark, Flink, or PyIceberg writes data in Iceberg format to OSS/S3/COS. Singdata Lakehouse federates queries directly through an External Catalog (Iceberg REST) without copying data. Dynamic Tables then handle incremental Silver/Gold layer processing and output report metrics. This guide uses 30 simulated orders and 10 product dimension records as the dataset, demonstrating the complete end-to-end setup of this architecture.
Overview
The core challenge with multi-engine Iceberg data lakes is: multiple write engines each maintain their own Iceberg tables, and the analytics layer needs to join this data with internal tables while correctly handling Iceberg DELETE/UPDATE semantics.
Singdata Lakehouse addresses these core problems with the following combination:
| Problem | Solution |
|---|---|
| Iceberg tables written by Spark have DELETE files that PIPE cannot recognize | External Catalog reads snapshots via REST API and correctly applies deletion vectors |
| Multiple engines write separately, schema may evolve | External Catalog automatically tracks Iceberg schema versions with no manual column mapping required |
| Large data volumes, no desire to fully copy into Lakehouse | External Catalog zero-copy federation, data files remain in OSS/S3 |
| Analytics layer needs multi-layer Silver/Gold processing | Dynamic Tables use External Catalog tables as upstream and auto-incrementally refresh |
| Downstream Spark/Trino needs to read Lakehouse internal tables | Lakehouse itself provides a standard Iceberg REST Catalog interface for bidirectional interoperability |
SQL Commands Used
| Command / Feature | Purpose | Notes |
|---|---|---|
CREATE STORAGE CONNECTION | Declare credentials for accessing OSS/S3/COS | Used by External Catalog when reading Parquet data files |
CREATE CATALOG CONNECTION TYPE ICEBERG_REST | Connect to the Iceberg REST Catalog service | Stores authentication info (URI, OAuth, etc.) |
CREATE EXTERNAL CATALOG | Mount an Iceberg Catalog, mapping to three-level catalog.schema.table naming | Federation query entry point |
SELECT catalog.schema.table | Federated query of Iceberg data without landing it | Supports snapshot skipping and delete file merging |
CREATE DYNAMIC TABLE | Use External Catalog tables as upstream, define Silver/Gold processing logic | Declarative SQL, system auto-incrementally refreshes |
REFRESH DYNAMIC TABLE | Trigger a manual refresh | Used for initial builds or debugging |
PIPE vs External Catalog (Iceberg REST): Choosing the Right Approach
When ingesting Iceberg data, two common approaches have different applicable scenarios:
| Dimension | PIPE (LIST_PURGE / EVENT_NOTIFICATION) | External Catalog (Iceberg REST) |
|---|---|---|
| File understanding | Scans Parquet files, does not read Iceberg metadata | Reads snapshots/manifests via REST API |
| DELETE/UPDATE handling | Cannot identify delete files, only sees data files | Correctly applies deletion vectors, results are accurate |
| Schema evolution | Requires manual column mapping maintenance, error-prone | Auto-detects column changes, follows Iceberg schema versions |
| Data landing | Data written to Lakehouse internal tables | Data stays in OSS/S3, zero-copy federation |
| Applicable scenarios | One-time historical file import, append-only writes | Multi-engine shared Iceberg, scenarios with UPDATE/DELETE |
| Prerequisites | Requires Volume + Storage Connection | Requires Iceberg REST Catalog service |
Prerequisites
All examples in this guide run under the best_practice_iceberg_fed schema.
External Catalog Layer: Connecting to an Iceberg REST Catalog
Prerequisites
External Catalog (Iceberg REST) requires the following environment to be set up in advance:
- Write side: An engine capable of writing Iceberg format (Apache Spark, Flink, PyIceberg)
- Iceberg REST Catalog service: Choose any of the following:
- Open-source self-hosted: Apache Polaris, Apache Gravitino, Project Nessie
- Cloud-managed: Snowflake Open Catalog, AWS Glue (Iceberg REST mode)
- Object storage: OSS (Alibaba Cloud), S3 (AWS), or COS (Tencent Cloud) to store Parquet data files
Step 1: Create a Storage Connection
A Storage Connection stores the credentials needed to access Parquet data files. The External Catalog uses this connection for authentication when reading data files.
OSS (Alibaba Cloud) example:
S3 (AWS) example:
Step 2: Create a Catalog Connection (TYPE ICEBERG_REST)
A Catalog Connection stores the API endpoint and authentication credentials for the Iceberg REST Catalog service.
Generic Iceberg REST Catalog (no auth, e.g. Nessie / self-hosted Gravitino):
With OAuth authentication (e.g. Apache Polaris / Snowflake Open Catalog):
Step 3: Create an External Catalog
Create an External Catalog based on the Catalog Connection, mapped to three-level catalog.schema.table naming:
After successful creation, you can view the schemas and tables in the Catalog:
Simulated Data Layer: Local Tables as Iceberg External Table Substitutes
When there is no actual Iceberg REST Catalog environment, use Lakehouse internal tables to simulate the effect of reading from Iceberg, and verify the downstream Dynamic Table processing logic.
Create the Product Dimension Table
Import from a local CSV file (recommended):
You can also directly insert a small batch of test data inline (no CSV file needed):
Create the Order Fact Table (Simulating Iceberg External Table Reads)
Import from a local CSV file (recommended):
You can also directly insert a small batch of test data inline (no CSV file needed):
Verify data was written:
Result:
| order_count |
|---|
| 30 |
Silver Layer: Dynamic Table Cleansing + Dimension Join
The Silver layer joins the order table (corresponding to the Iceberg external table) with the product dimension table, filters out cancelled orders, and calculates actual revenue and gross profit.
In an actual Iceberg federation environment, replace doc_orders_local with iceberg_catalog.ecommerce.orders to use the same DDL.
Trigger the first manual refresh and then query the Silver layer results:
Result:
| order_id | customer_id | product_name | category | region | order_date | net_revenue | gross_profit |
|---|---|---|---|---|---|---|---|
| ORD001 | C101 | Smartphone X1 | Electronics | East | 2024-01-05 | 3299.0 | 1499.0 |
| ORD002 | C102 | Laptop Pro 14 | Electronics | West | 2024-01-06 | 7599.05 | 3609.05 |
| ORD003 | C103 | Wireless Earbuds | Electronics | East | 2024-01-07 | 998.0 | 638.0 |
| ORD004 | C104 | Cotton T-Shirt | Apparel | South | 2024-01-08 | 348.3 | 267.3 |
| ORD005 | C105 | Running Shoes | Apparel | North | 2024-01-09 | 699.0 | 479.0 |
| ORD006 | C101 | Coffee Maker | Kitchen | East | 2024-01-10 | 899.0 | 549.0 |
| ORD007 | C106 | Yoga Mat | Sports | West | 2024-01-11 | 398.0 | 308.0 |
| ORD008 | C107 | Backpack 30L | Accessories | South | 2024-01-12 | 299.0 | 219.0 |
| ORD009 | C108 | Smartphone X1 | Electronics | North | 2024-01-13 | 5938.2 | 2698.2 |
| ORD010 | C109 | LED Desk Lamp | Furniture | East | 2024-01-14 | 199.0 | 144.0 |
The Silver layer filters out ORD026 (status='cancelled', 3,299 yuan), retaining only 29 completed orders. net_revenue is net of discount and gross_profit is net of cost.
Gold Layer: Dynamic Table Aggregated Metrics
The Gold layer aggregates on top of Silver by region, category, and year-month, outputting order count, total revenue, gross profit, and profit margin per dimension for direct BI tool queries.
Result:
| region | category | order_year | order_month | order_count | total_revenue | total_profit | profit_margin_pct |
|---|---|---|---|---|---|---|---|
| North | Electronics | 2024 | 1 | 2 | 13937.2 | 6497.2 | 46.62 |
| East | Electronics | 2024 | 2 | 1 | 7999.0 | 3799.0 | 47.49 |
| West | Electronics | 2024 | 1 | 1 | 7599.05 | 3609.05 | 47.49 |
| North | Electronics | 2024 | 2 | 1 | 7599.05 | 3609.05 | 47.49 |
| East | Electronics | 2024 | 1 | 3 | 4796.0 | 2456.0 | 51.21 |
| South | Electronics | 2024 | 1 | 1 | 3299.0 | 1499.0 | 45.44 |
| West | Electronics | 2024 | 2 | 1 | 1347.3 | 861.3 | 63.93 |
| South | Health | 2024 | 2 | 1 | 1047.0 | 687.0 | 65.62 |
| East | Kitchen | 2024 | 1 | 1 | 899.0 | 549.0 | 61.07 |
| West | Kitchen | 2024 | 2 | 1 | 899.0 | 549.0 | 61.07 |
Electronics has the highest revenue (North January: 13,937 yuan), while Health and Kitchen categories have relatively higher profit margins (60%+). BI tools can connect directly to this table to power regional sales dashboards.
Configure Refresh Scheduling
Dynamic Table periodic refresh is managed through Lakehouse Studio tasks, not written in the DDL. The advantage is that you can attach monitoring alerts and data quality check rules to the same task, providing a unified operations entry point.
The following tasks have been created under the best_practices/iceberg_fed/ path:
| Task Name | SQL Content | Schedule |
|---|---|---|
refresh_dt_silver_orders | REFRESH DYNAMIC TABLE best_practice_iceberg_fed.dt_silver_orders | Every hour (0 0/1 * * ?) |
refresh_dt_gold_metrics | REFRESH DYNAMIC TABLE best_practice_iceberg_fed.dt_gold_regional_metrics | Every hour (0 0/1 * * ?) |
cz-cli workflow for creating tasks:
Bidirectional Interoperability: Lakehouse as an Iceberg REST Provider
In addition to reading external Iceberg tables, Singdata Lakehouse itself provides a standard Iceberg REST Catalog interface externally. External engines such as Spark and Trino can reversely read Lakehouse internal tables, enabling bidirectional data sharing:
- Direction 1 (main flow of this guide): External Spark writes Iceberg → Lakehouse External Catalog federated read
- Direction 2 (reverse): Lakehouse internal tables → expose Iceberg REST API externally → external Spark/Trino reads
For configuration to provide Iceberg REST API externally, see: Accessing Lakehouse from Spark via Iceberg REST Catalog
Notes
- External Catalog and Catalog Connection creation validates REST API reachability; if the Iceberg REST Catalog service is not running or the network is unavailable, DDL will fail
- Currently, only the
instance_adminrole can query External Catalogs; after writing results to internal tables via Dynamic Tables, downstream access can be controlled with normal table permissions - When a Dynamic Table references an External Catalog table, use three-level naming in the DDL (
catalog.schema.table); use two-level naming for internal tables (schema.table) REFRESH DYNAMIC TABLEon a Dynamic Table that references an External Catalog table triggers a complete Iceberg snapshot read; if the upstream Iceberg table changes frequently, the scheduling interval should be no shorter than 5 minutes to avoid excessive REST API calls- In
CREATE CATALOG CONNECTION TYPE ICEBERG_REST, there is no=afterTYPEand no commas between parameters — this is a common syntax error
Related Documentation
- Create Catalog Connection — Full syntax and parameter reference for TYPE ICEBERG_REST
- Create External Catalog — CREATE EXTERNAL CATALOG command reference
- External Catalog Overview — External Catalog vs External Schema selection guide
- Dynamic Table — Dynamic Table core concepts and usage guide
- Accessing Lakehouse from Spark via Iceberg REST Catalog — Bidirectional interoperability configuration
