External Object (Catalog, Schema, Table) Usage Guide
1. Overview
The Lakehouse architecture provides powerful external object federation capabilities, allowing users to access and analyze data stored across multiple heterogeneous data sources without moving data. This guide covers the use cases, configuration methods, and best practices for external Catalogs, external Schemas, and external tables.
1.1 External Object Hierarchy
- External Catalog: Top-level container that maps to an external data system
- External Schema: Intermediate container, similar to a database
- External Table: Bottom-level object that directly accesses data files from external data sources
2. Use Cases
2.1 External Catalog Use Cases
1. Unified Multi-Source Management
- Scenario: An enterprise has multiple data platforms (Hive, Databricks, etc.)
- Benefit: Access all data sources directly in Lakehouse without data migration
- Application: Data governance, unified metadata management
2. Cross-Platform Federation Queries
- Scenario: Need to analyze data stored in different systems simultaneously
- Benefit: Real-time queries across multiple data sources without ETL
- Application: Cross-system reporting, comprehensive analysis
3. Unified Lakehouse Architecture
- Scenario: Building a unified data analytics platform
- Benefit: Combines the flexibility of a data lake with the performance of a data warehouse
- Application: Full-chain analysis from historical to real-time data
4. Incremental Data Migration
- Scenario: Migrating data from legacy systems to new systems in phases
- Benefit: Maintains business continuity during migration
- Application: System upgrades, architecture transformation
2.2 External Schema Use Cases
1. Hive Metadata Integration
- Scenario: Connecting to an existing Hive Metastore Service (HMS)
- Benefit: Reuse existing metadata without redefining table schemas
- Application: Big data platform integration
2. Schema-Level Access Control
- Scenario: Assigning permissions at the Schema level
- Benefit: Simplifies permission management and improves security
- Application: Multi-department data sharing
3. External Database Mapping
- Scenario: Importing an external database as a whole into Lakehouse
- Benefit: Preserves the original data organization structure
- Application: Database migration, cross-database analysis
2.3 External Table Use Cases
1. Direct Object Storage Query
- Scenario: Analyzing data files stored in object storage (S3, OSS, COS, etc.)
- Benefit: Avoids data copying and saves storage space
- Application: Log analysis, large file processing
2. Streaming Data Ingestion
- Scenario: Connecting to message queue systems such as Kafka
- Benefit: Real-time data querying and processing
- Application: Real-time monitoring, event processing
3. Data Lake Format Support
- Scenario: Accessing open-source data lake formats such as Delta and Hudi
- Benefit: Leverages the open-source ecosystem and avoids data silos
- Application: Data lake construction, open-source compatibility
4. Hot-Cold Data Tiering
- Scenario: Storing cold data in lower-cost external storage
- Benefit: Optimizes storage costs and query performance
- Application: Data archiving, cost optimization
3. Configuration Guide
3.1 External Catalog Configuration
Create a Catalog Connection
Create an External Catalog
3.2 External Schema Configuration
3.3 External Table Configuration
4. Query Usage
4.1 Querying Tables in an External Catalog
4.2 Querying Tables in an External Schema
4.3 Querying External Tables
5. Performance Optimization Best Practices
5.1 Data Import Strategy
Since data in external tables is stored outside Lakehouse, query performance may be lower than internal tables. For frequently queried data, consider importing it into an internal table:
5.2 Partitioning and Filtering
Use partition information and filter conditions to reduce data scan volume:
6. Management and Monitoring
6.1 Viewing External Objects
7. Typical Use Cases
7.1 Unified Query Across Data Lake and Data Warehouse
Scenario: An enterprise has both a Hive data lake and a Databricks data warehouse and needs cross-platform analysis.
Solution:
- Create external Catalogs connected to both Hive and Databricks
- Use federation queries to join data from both platforms
- Build a unified view to provide a consistent data access layer
7.2 Historical Data Archiving and Querying
Scenario: Archive historical data to object storage while still needing occasional queries.
Solution:
- Store historical data in Delta or Parquet format in object storage
- Create external tables mapped to the archived data
- Query on demand without occupying primary storage space
7.3 Integrating Real-Time and Batch Data
Scenario: Need to analyze both real-time data from Kafka and historical data from the data warehouse simultaneously.
Solution:
- Create a Kafka external table to handle real-time data
- Use federation queries to join real-time data with historical data
- Build real-time dashboards to display integrated analysis results
8. Troubleshooting
8.1 Connection Issues
- Symptom: Unable to connect to an external data source
- Solution:
- Check network connectivity and firewall settings
- Verify that connection credentials are valid
- Confirm that the external service is available
8.2 Performance Issues
- Symptom: External table queries are slow
- Solution:
- Use partition filtering to reduce data scan volume
- Consider importing frequently queried data into an internal table
9. Summary
The external object feature in Lakehouse provides powerful data federation capabilities, enabling enterprises to integrate multiple heterogeneous data sources without moving data. By leveraging external Catalogs, external Schemas, and external tables appropriately, you can build a unified data analytics platform, achieve a Lakehouse architecture, and maximize data value.
Related Documentation
- Lakehouse On-Site Acceleration Solution Implementation Guide — Rapid POC validation, replace Spark/Hive and Presto/Trino without moving data
- CREATE EXTERNAL SCHEMA — Complete External Schema syntax reference
- External Catalog Federation Queries — External Catalog usage guide
