Usage Workflow: External Function
Document Objective:
Through this usage workflow, you can achieve:
- Invoke a JAVA NLP offline model (see website) to parse the sentiment of strings in Singdata Lakehouse tables
- Invoke Alibaba Cloud Visual Intelligence Open Platform services (see website) to parse image data pointed to by URLs in Singdata Lakehouse tables
(This best practice uses an Alibaba Cloud-based Singdata Lakehouse environment.)
Operation Steps:
Step0: Preparation (Authorization Operations)
The goal of this step is to allow the Singdata Lakehouse cluster to access the customer's Function Compute (FC) and Object Storage Service (OSS) on Alibaba Cloud. To achieve this, you need to create a role so that Singdata Lakehouse can assume this role to access FC and OSS services on Alibaba Cloud.
1. Alibaba Cloud Console: Create a permission policy (e.g., CzUdfOssAccess) in the Alibaba Cloud Resource Access Management (RAM) console:
- Enter the Alibaba Cloud RAM console
- In the left navigation bar: Permissions -> Policies, on the Policies page, select Create Policy
- On the Create Policy page, select the Script Editor tab (replace the bucket name in the brackets [] below).
2. Alibaba Cloud Console: Create a role in Alibaba Cloud RAM (e.g., CzUDFRole):
- In the RAM console left navigation bar: Identities -> Roles, click Create Role
- On the Create Role page, select the type as Alibaba Cloud Account, fill in a custom Role Name (e.g., CzUDFRole) in the role configuration, choose Other Cloud Account under Select Trusted Cloud Account, and enter: 1384322691904283 (the Singdata Lakehouse Shanghai main cloud account), click Finish.
- Edit the AliyunFCFullAccess Permission Policy, adding the "acs:Service": "fc.aliyuncs.com" section below.
- After creation, click Grant Permission to Role:
- Under System Policies, grant the AliyunFCFullAccess policy to the role CzUDFRole.
- Under Custom Policies, grant the newly created policy (CzUdfOssAccess) to this role.
3. Obtain the RoleARN information for the role CzUDFRole from its details page:
- Modify CzUDFRole's Trust Policy:
Scenario 1: Invoking the JAVA NLP Offline Model:
1. Write Code
- Write a UDF based on the Hive UDF API. Below is sample code for implementing case conversion:
- Compile the code to generate a Jar package and other dependency files, then package them into a zip archive.
2. Upload Function Package to the Specified Path
For example: oss://hz-oss-lakehouse/functions/sentiment/UDF_code/SentimentAnalysis.zip
Main function class: com.clickzetta.nlp.GenericUDFSentiment
There are two ways to upload files to the specified path:
- Upload directly via an OSS client
- In the Lakehouse JDBC client (the Lakehouse Web UI does not support file uploads via the PUT command), upload the package to a Volume object using the PUT command, and reference the volume path in the function creation DDL. For example:
You can also specify an internal volume. Although you can use an internal volume, the code_bucket parameter in the API CONNECTION creation must be filled with an external address.
- User Volume address format:
volume:user://~/upper.jar-
userindicates using the User Volume protocol. -
~represents the current user and is a fixed value. -
upper.jarrepresents the target file name.
-
- Table Volume address format:
volume:table://table_name/upper.jartableindicates using the Table Volume protocol.table_namerepresents the table name and should be filled in based on the actual situation.upper.jarrepresents the target file name.
3. Create a Connection
Parameter Explanation:
-
api_connection: Creates an API-type Connection for invoking third-party service interfaces;
-
type: The connection type is cloud function: cloud_function, where the specific properties are:
- provider: The cloud function provider, e.g., aliyun
- region: The region where the cloud function is located, e.g., 'cn-shanghai'
- role_arn: The role assumed when creating the cloud function, e.g., acs:ram::12228000000000000:role/czudfrole
- code_bucket: The object storage bucket name where the cloud function program files are located
4. Create External Function in Lakehouse
Parameter Explanation:
- as: Followed by the main class name of the Java function
- using: Only supports compiled Java programs. The parameter archive indicates the package is a zip-format file; jar indicates a Java Jar package file. You can directly reference the file's OSS path; if the file has been uploaded to a Volume object via the PUT command, you can also directly reference the function file via the Volume path, for example:
USING ARCHIVE 'volume://fc_volume/udfs/SentimentAnalysis.zip' - connection: Represents the connection object used in the program, e.g., udf_sentiment_bj; with the following property:
- remote.udf.api: For Java UDF, fill in java8.hive2.v0
5. Execute Semantic Sentiment Analysis:
Construct Test Data:
Execute Semantic Analysis:
Scenario 2: Python UDF: Invoking Third-Party Vision Processing Platform API for Image Recognition
1. Code file is video_contents.py:
2. Reuse the connection from Scenario 1
3. Create a New External Function
Parameter Explanation:
- After AS, specify the Python module name + main class name. For example, if the main program file is video_contents.py and the main class name is image_to_text, the parameter after AS is
'video_contents.image_to_text' - using archive / file: Python files must be packaged into a zip file; single-file scripts are also supported, specified using the file parameter
-
connection: Represents the connection object used in the program, e.g., udf_sentiment_bj; with the following property:
- remote.udf.api: For Python language functions, fill in python3.mc.v0
4. Create Test Data for Verification
Import the following 4 images into OSS and generate public URLs stored in the Lakehouse table. You can directly construct the test table using the following SQL:
Execute the query:
