Spark Connecting to Lakehouse (Iceberg REST)
Overview
Lakehouse provides standard Apache Iceberg Catalog REST API interfaces, allowing external compute engines (such as Apache Spark) to access and query Iceberg tables stored in the Lakehouse data lake (such as OSS object storage) through a unified REST protocol. This enables flexible selection of different compute engines for data analysis while maintaining unified data storage.
Core Features
- Standard Compatibility: Compatible with Apache Iceberg REST Catalog specification
- Engine Support: Supports the Spark compute engine
- Credential Delegation: Manages storage access permissions via vended-credentials mode
- Multi-Cloud Support: Supports Alibaba Cloud OSS (future versions will support AWS S3, Tencent Cloud COS, etc.)
Usage Limitations
Data Type Compatibility
When accessing Singdata Lakehouse tables via the Spark engine, the following data type limitations exist:
Currently unsupported data types:
- Integer types:
SMALLINT,TINYINT - Semi-structured types:
JSON - Vector types:
VECTOR
Quick Start
Prerequisites
-
Account and password for a Singdata Lakehouse instance
-
Target compute engine environment: Spark 3.5+
-
Required dependency packages:
-
Apache Iceberg library (Scala 2.12 / Spark 3.5.x):
org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1 -
Corresponding cloud object storage SDK (e.g., Alibaba Cloud OSS:
com.aliyun.oss:aliyun-sdk-oss:3.18.1)
-
PySpark Integration Example
Environment Preparation
Set SPARK_HOME environment variable (adjust to actual installation path):
Authentication configuration for connecting to Singdata Lakehouse
Configure authentication information:
Generate Basic Authentication header:
Create Spark Session
Usage Example
View all namespaces (schemas):
View tables in a specified namespace:
View table structure:
Query data:
Use DataFrame API:
Detailed Configuration Parameters
| Parameter | Description | Example Value | Required? |
|---|---|---|---|
| Spark & Iceberg Basic Configuration | |||
| spark.jars.packages | Specifies the dependency packages to be automatically downloaded from Maven Central when the Spark session starts. Includes Iceberg Spark runtime and the SDK for interacting with Alibaba Cloud OSS. | org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,com.aliyun.oss:aliyun-sdk-oss:3.18.1 | Yes |
| spark.sql.extensions | Injects Iceberg extension features into Spark SQL. This allows Spark to parse and execute Iceberg-specific DDL and DML statements (e.g., CREATE TABLE ... USING iceberg). | org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions | Yes |
| Lakehouse REST Catalog Core Configuration | |||
| spark.sql.catalog.clickzetta_catalog | Fixed value. Registers a new catalog named clickzetta_catalog with the Iceberg SparkCatalog implementation. This is the entry point for defining an Iceberg Catalog. | org.apache.iceberg.spark.SparkCatalog | Yes |
| spark.sql.catalog.clickzetta_catalog.type | Fixed value. Specifies the type of clickzetta_catalog as rest. This tells Iceberg that the Catalog is a remote service communicating via REST API. | rest | Yes |
| spark.sql.catalog.clickzetta_catalog.uri | The API endpoint address of the REST Catalog service. Spark sends all metadata management requests (e.g., create table, get table info) to this URL. | https://{endpoint}/api/v1/catalog/iceberg-rest. For the endpoint value, refer to the documentation | Yes |
| spark.sql.catalog.clickzetta_catalog.header.instanceName | Custom HTTP request header sent to the REST Catalog. Used to identify your specific instance to the Singdata service. | your_instance_id (replace with your instance ID) | Yes |
| spark.sql.catalog.clickzetta_catalog.header.Workspace | Custom HTTP request header sent to the REST Catalog. Used to specify the workspace to operate on within your Singdata instance. | your_workspace (replace with your workspace name) | Yes |
| spark.sql.catalog.clickzetta_catalog.header.Authorization | Authorization token for API authentication. Typically a Bearer token used to verify client identity. This value should be obtained and passed securely. | auth_header (a variable containing authentication information), e.g.: "Basic VUFUX1RFU1Q6QWJjZDEyMzQ1Ng==" | Yes |
| spark.sql.catalog.clickzetta_catalog.header.X-Iceberg-Access-Delegation | A special request header used to enable the vended credentials mode. Setting it to vended-credentials indicates that the client (Spark) expects the Catalog service to return temporary security credentials for accessing the underlying storage (OSS). This is a more secure access mode that avoids exposing long-term cloud storage credentials on the client side. | vended-credentials | Yes |
| Data Storage (OSS) Configuration | |||
| spark.sql.catalog.clickzetta_catalog.io-impl | Specifies the FileIO implementation for reading and writing data files (e.g., Parquet, ORC). Uses OSSFileIO to interact with Alibaba Cloud OSS. | org.apache.iceberg.aliyun.oss.OSSFileIO | Yes |
| spark.sql.catalog.clickzetta_catalog.oss.endpoint | The regional endpoint of Alibaba Cloud Object Storage Service (OSS). The client accesses OSS buckets through this address. | oss-cn-hangzhou.aliyuncs.com (modify based on your OSS bucket region; refer to the documentation) | Yes |
| Optional/Auxiliary Configuration | |||
| spark.sql.defaultCatalog | Sets the default Catalog for Spark SQL. When set, the Catalog name does not need to be explicitly specified before table names in SQL queries (e.g., you can use SELECT * FROM my_table instead of SELECT * FROM clickzetta_catalog.public.my_table). | clickzetta_catalog | No |
| spark.sql.catalog.clickzetta_catalog.default-namespace | Sets the default namespace (or database/Schema) within clickzetta_catalog. If set, table operations will default to this namespace when none is specified. | public | No (but recommended) |
| spark.sql.catalog.clickzetta_catalog.metrics-reporter-impl | Configures the Iceberg metrics reporter implementation. LoggingMetricsReporter outputs operation metrics (e.g., scan duration, file count) to Spark logs, helpful for debugging and performance analysis. | org.apache.iceberg.metrics.LoggingMetricsReporter | No |
Troubleshooting
Common Issues and Solutions
-
Authentication Failure
- Check whether the username and password are correct.
- Confirm whether the Base64 encoding is correct.
- Verify whether the account has the appropriate permissions.
-
Connection Timeout
- Check network connectivity.
- Confirm the API endpoint address is correct.
- Adjust timeout parameters.
-
Table Not Found
- Confirm workspace and namespace settings are correct.
- Use
SHOW TABLESto confirm the table name. - Check user permissions.
