Overview
In the data lake architecture, Hive Catalog is a key component used to associate the data lake with external metadata storage (such as Hive Metastore). By creating a Hive Catalog, users can achieve unified management and access to metadata, thereby directly reading data stored in external systems. Apache Hive has become the core of the data warehouse ecosystem, not only as a SQL engine for big data analysis and ETL but also as a data management platform for discovering, defining, and evolving data. Meanwhile, Lakehouse supports writing and reading Hive data.
Usage Restrictions
- Please ensure that the network between the lakehouse and the hive cluster is connected before use.
- Currently, Singdata Lakehouse's external catalog feature supports the following external data sources:
- Hive on OSS (Alibaba Cloud Object Storage Service)
- Hive on COS (Tencent Cloud Object Storage Service)
- Hive on S3 (AWS Object Storage Service)
- Hive on GCS (Google Cloud Object Storage Service)
- Supports both writing and reading. Write formats support parquet, orc, and text file formats.
Create External Catalog
Steps to Create Hive Catalog
- Create Storage Connection: First, you need to create a storage connection to access the object storage service.
- Create Catalog Connection: Use the storage connection information and Hive Metastore address to create a Catalog Connection.
- Create External Catalog: Use the Catalog Connection to create an external Catalog to access external data in the data lake.
Create Storage Connection
For creating a storage connection, refer to the document, Create STORAGE CONNECTION
Create Catalog Connection
Create External Catalog
Using Catalog
Use Hive Tables and Lakehouse Tables for Join Queries
Among them, test_external_catalog.my_external_test.test is a table in Hive, and public.test is an internal table in Lakehouse.