Storage Connection + API Connection + External Function: Combined Practice

External Functions can look daunting at first — you need to configure an OSS Bucket, a serverless function runtime, a RAM role, write Python code, build a zip, and write DDL... three separate objects, many concepts, many steps, easy to get discouraged.

Once you get everything running, you'll find that the complexity is mostly concentrated in one-time environment setup. After that initial setup, the cost of adding new functions is extremely low: write Python logic → package → one DDL, done in three minutes. This guide walks you through four progressive scenarios to cover the full workflow across Alibaba Cloud, Tencent Cloud, and AWS.

By the end you'll realize: External Functions are not complex. You just configure "where to store code" and "where to run code" once, and every new function after that is just code + one SQL statement.

Full code on GitHub: clickzetta_external_function


The Iron Triangle: How the Three Objects Work Together

An External Function is not just one object — it is the result of three objects working together. Understanding their roles makes the entire mechanism clear:

ObjectAnalogyWhat it doesHow often configured
Storage ConnectionParking lotAuthenticates object storage (OSS/COS/S3), allows Lakehouse to read and write code packagesOnce per Schema
API ConnectionWorkshopAuthenticates cloud function runtime (FC/SCF/Lambda), defines where code runsOnce per Region
External VolumeShelfMounts the object storage Bucket to the Schema, enabling the PUT command to upload filesOnce per Bucket
CREATE EXTERNAL FUNCTIONRegistryMaps a function name → entry class → zip packageOnce per function

The first three are one-time setups — once you configure Storage Connection + API Connection + External Volume, all functions share them. Adding a new function afterwards only requires one CREATE EXTERNAL FUNCTION statement.


Prerequisites (One-time, Shared Across All Four Scenarios)

All four scenarios share the same cloud environment configuration. This step is done only once, and all four scenarios can use it directly.

Step 1: Choose your cloud, configure config.json

Confirm which cloud your Lakehouse is on. The external function runtime (FC/SCF/Lambda) must be on the same cloud and in the same region:

cz-cli profile list # service column: # alicloud.api.clickzetta.com → Alibaba Cloud # tencentcloud.api.clickzetta.com → Tencent Cloud # aws.api.clickzetta.com → AWS

git clone https://github.com/clickzetta/clickzetta_external_function.git cd clickzetta_external_function cp config.example.json config.json

Open config.json and change only the platform field:

"platform": "aliyun" // or "tencent" or "aws"

Then follow SETUP.md to complete the cloud-specific environment setup (OSS/COS/S3 Bucket, FC/SCF/Lambda, RAM/CAM/IAM role, Bailian API Key) and fill in the details in config.json.

Step 2: Install cz-cli and verify

cz-cli profile use <your-profile> cz-cli sql "SELECT current_schema()" # Fill the output schema name into config.json → schema field

Step 3: Universal steps for all four scenarios

The execution flow for all four scenarios is identical:

Fill config.json → check (validate config) → package (build code) → render (generate SQL) → deploy (execute deployment) → verify (call and verify)

Corresponding commands:

python ../1-check-config.py # ① Validate configuration python 2-package.py # ② Package code (add --deps for AI functions) python ../3-render-sql.py # ③ Replace placeholders, generate SQL cz-cli sql -f dist/4-deploy_generated.sql --write # ④ Deploy


Scenario 1: Python External Function Quick Start

Deploy

cd python_quickstart python ../1-check-config.py python 2-package.py python ../3-render-sql.py cz-cli sql -f dist/4-deploy_generated.sql --write

What 4-deploy_generated.sql does:

-- 1. Storage Connection (OSS authentication) CREATE STORAGE CONNECTION IF NOT EXISTS oss_sh_conn TYPE OSS access_id = '<your-id>' access_key = '<your-key>' ENDPOINT = 'oss-cn-shanghai.aliyuncs.com'; -- 2. API Connection (Function Compute FC authentication) CREATE API CONNECTION IF NOT EXISTS shanghai_func_conn TYPE CLOUD_FUNCTION PROVIDER = 'aliyun' REGION = 'cn-shanghai' ROLE_ARN = '<your-role-arn>' CODE_BUCKET = '<your-bucket>'; -- 3. External Volume (mount Bucket) CREATE EXTERNAL VOLUME IF NOT EXISTS external_functions_prod LOCATION 'oss://<bucket>/' USING CONNECTION oss_sh_conn; -- 4. Upload zip PUT '<project>/dist/my_upper.zip' TO VOLUME external_functions_prod FILE 'my_upper.zip'; -- 5. Register function CREATE EXTERNAL FUNCTION IF NOT EXISTS <schema>.my_upper AS 'my_upper.my_upper' USING ARCHIVE 'volume://external_functions_prod/my_upper.zip' CONNECTION shanghai_func_conn WITH PROPERTIES ('remote.udf.api'='python3.mc.v0','remote.udf.protocol'='http.arrow.v0'); -- 6. Verify SELECT <schema>.my_upper('hello'); -- returns HELLO

Function Source Code

src/my_upper.py — one class, one evaluate method:

@annotate("*->string") class my_upper(object): def evaluate(self, s): return s.upper() if s else s

Local Testing

FC environments have no stdout and no stack traces. Always test locally before each deployment:

python3 -c " import sys; sys.path.insert(0, 'src') from my_upper import my_upper print(my_upper().evaluate('hello')) # HELLO "

Key Takeaways

  • remote.udf.api='python3.mc.v0' specifies the Python 3.10 runtime for FC
  • Calling the function requires the schema prefix: SELECT <schema>.my_upper('hello') — omitting it results in function not found
  • The first call may take 5-10 seconds (FC cold start); subsequent calls are normal

Scenario 2: Python ML Functions + Third-Party Dependency Packaging

Function List

FunctionLibraries usedPurpose
pii_maskrePhone/email/ID masking
feature_normalizenumpy + sklearnNumeric column normalization (minmax/zscore)
anomaly_detectnumpy + sklearnIsolation Forest anomaly detection
sentiment_scorejiebaChinese sentiment scoring (0-1)
tfidf_keywordssklearnTF-IDF keyword extraction

The Core Problem: FC Runs Linux, macOS Packages Won't Work

The FC runtime is Linux x86_64 + Python 3.10. Running pip install scikit-learn on macOS produces .dylib files that cannot be loaded on FC.

Solution: use two separate requirements files with two different install methods.

FileContentsInstall method
requirements.txtPackages with C extensions (scikit-learn, numpy)pip install --platform manylinux2014_x86_64 --only-binary :all:
requirements_pure.txtPure Python packages (jieba)Normal pip install

Why can't they be in the same file? Putting jieba (pure Python) into requirements.txt with --only-binary :all: causes pip to throw No matching distribution found — pure Python packages don't have binary wheels.

Deploy

cd python_advanced python 2-package.py # dual-mode packaging (~100 MB) python ../1-check-config.py python ../3-render-sql.py cz-cli sql -f dist/4-deploy_generated.sql --write

Local Testing

pip install -r requirements.txt -r requirements_pure.txt python3 -c " import sys; sys.path.insert(0, 'src') from ml_toolkit import pii_mask, feature_normalize, sentiment_score print(pii_mask().evaluate('My phone is 13812345678')) print(feature_normalize().evaluate('[1,2,3,4,5]', 'minmax')) print(sentiment_score().evaluate('The product quality is excellent')) "

Using in SQL

SELECT <schema>.pii_mask('Phone 13812345678, email alice@example.com'); SELECT <schema>.feature_normalize('[10,20,30,40,50]', 'minmax'); SELECT <schema>.anomaly_detect('[1,2,3,4,100]'); SELECT <schema>.sentiment_score('The product quality is excellent, shipping was fast, very satisfied!'); SELECT <schema>.tfidf_keywords('["AI and machine learning are the future","Deep learning achieves breakthroughs in image recognition"]', 3);

Key Takeaways

  • C extension packages require Linux binary wheels — macOS .dylib files cannot run on FC
  • Pure Python packages must be separated — they cannot be mixed with binary packages in the same requirements file
  • Zip size determines cold start time: scikit-learn + numpy is about 100MB; the first call takes 5-10 seconds

Scenario 3: 30 AI SQL Functions — Package Once, Call Anywhere

Design Highlights

One zip, 30 functions: All 30 functions share a single clickzetta_ai_functions_full.zip. The DDL only differs in the class name AS 'ai_functions_complete.ai_xxx'. Packaging and uploading a separate zip for each function would mean 30 zip files and exploding management complexity.

API Key as a SQL parameter: The Bailian API Key is not hardcoded in the source code or bundled into the zip — it is passed as a function parameter:

SELECT <schema>.ai_text_summarize('Artificial intelligence is changing the world.', '<your-api-key>');

If the API Key were hardcoded in the zip, a zip leak would mean a Key leak. Passing it as a parameter lets callers manage their own keys.

Function Categories

CategoryCountTypical functions
Text processing8Summarization, translation, sentiment analysis, entity extraction, keywords, classification, cleaning, tagging
Vector processing5Embedding, similarity, clustering preparation, similar search, document search
Multimodal8Image description, OCR, image analysis, image embedding, image similarity, video summarization, chart analysis, document parsing
Business scenarios9Customer intent, sales scoring, review analysis, risk detection, contract extraction, resume parsing, customer segmentation, product description, industry classification

Deploy

cd python_ai_function pip install -r requirements.txt # for local testing only python 2-package.py --deps # package code + dashscope Linux dependencies python 1-check-config.py # standalone validation (different config structure) python 3-render-sql.py cz-cli sql -f dist/4-deploy_generated.sql --write

Local Testing

FC has no logs and no stdout. Test locally before deploying to save a lot of time:

python3 -c " import sys; sys.path.insert(0, 'src') from ai_functions_complete import ai_text_summarize, ai_text_translate print(ai_text_summarize().evaluate('Hello world', '<your-api-key>')) print(ai_text_translate().evaluate('Hello', 'Chinese', '<your-api-key>')) "

Using in SQL

-- Summarization SELECT <schema>.ai_text_summarize('Artificial intelligence is changing the world...', '<key>'); -- Translation SELECT <schema>.ai_text_translate('Hello, how are you?', 'Chinese', '<key>'); -- Sentiment analysis SELECT <schema>.ai_text_sentiment_analyze('The product quality is excellent!', '<key>'); -- Embedding + similarity search SELECT <schema>.ai_semantic_similarity('Apples are tasty', 'Apples are a healthy fruit', '<key>'); -- Image description SELECT <schema>.ai_image_describe('<image-url>', '<key>'); -- Contract extraction SELECT <schema>.ai_contract_extract('<contract text>', '<key>');

Returns JSON; use JSON_EXTRACT to retrieve values:

SELECT JSON_EXTRACT( <schema>.ai_text_summarize('Artificial intelligence is changing the world...', '<key>'), '$.summary' );

Key Takeaways

  • 30 functions share one zip — adding a new function = 20 lines of prompt + one DDL
  • API Key is not hardcoded — passed as a SQL parameter, so a zip leak does not compromise security
  • config.json serves dual roles: embedded in the zip at build time (for runtime model config), and used to replace SQL placeholders at render time

Scenario 4: Java UDF/UDAF/UDTF

Quick Overview of the Three Types

TypeBase classInput → OutputDDL special propertyExample function
UDFGenericUDF1 row → 1 rowpii_mask PII masking
UDAFGenericUDAFResolver2Multiple rows → 1 rowAGGREGATORagg_stats SUM/AVG/MIN/MAX/COUNT
UDTFGenericUDTF1 row → N rowsTABLE_VALUEDlog_explode log row expansion

Differences from Python External Functions

PythonJava
RuntimePython 3.10Java 8 (Java 9+ not supported)
DDL propertypython3.mc.v0java8.hive2.v0
Function typesUDFUDF / UDAF / UDTF
PackagingzipMaven jar-with-dependencies → zip
Dependenciespip --platform manylinuxMaven scope=provided (Hive runtime included)

Deploy

cd java_udf python 2-package.py # Maven compile + zip packaging python ../1-check-config.py python ../3-render-sql.py cz-cli sql -f dist/4-deploy_generated.sql --write

UDF Example

SELECT <schema>.pii_mask('My phone is 13812345678, email alice@example.com');

UDAF Example

INSERT INTO <schema>.java_udf_test_scores VALUES (3.5), (4.2), (2.8), (5.0), (3.9); SELECT <schema>.agg_stats(val) FROM <schema>.java_udf_test_scores; -- → [sum, avg, min, max, count]

UDTF Example

UDTF requires the LATERAL syntax and cannot be used with SELECT func(x):

SELECT t.ts, t.event FROM ( SELECT '[2025-01-15 10:30:00] User login [2025-01-15 10:35:00] Query order' AS log ) s, LATERAL <schema>.log_explode(s.log) t; -- → each log line is expanded into one row

Key Takeaways

  • Java version must be 8 — FC only has a Java 8 runtime
  • Hive dependency scope=provided: FC includes hive-exec.jar; do not bundle it in the zip to avoid conflicts
  • UDAF DDL must include AGGREGATOR, UDTF must include TABLE_VALUED — omitting them causes creation to succeed but calls to fail
  • UDTF must use LATERAL syntax; SELECT func(x) directly is not supported

Function Quick Reference

Python (quickstart + advanced + AI function)

FunctionPurposeSource
my_upperString to uppercasequickstart
pii_maskPhone/email/ID maskingadvanced
feature_normalizeNumeric column normalizationadvanced
anomaly_detectIsolation Forest anomaly detectionadvanced
sentiment_scoreChinese sentiment scoringadvanced
tfidf_keywordsTF-IDF keyword extractionadvanced
ai_text_summarizeText summarizationAI function
ai_text_translateText translationAI function
ai_text_sentiment_analyzeSentiment analysisAI function
ai_text_extract_entitiesEntity extractionAI function
ai_text_extract_keywordsKeyword extractionAI function
ai_text_classifyText classificationAI function
ai_text_clean_normalizeText cleaningAI function
ai_auto_tag_generateAuto taggingAI function
ai_text_to_embeddingText embeddingAI function
ai_semantic_similaritySemantic similarityAI function
ai_text_clustering_prepareClustering preparationAI function
ai_find_similar_textSimilar text searchAI function
ai_document_searchDocument searchAI function
ai_image_describeImage descriptionAI function
ai_image_ocrImage OCRAI function
ai_image_analyzeImage analysisAI function
ai_image_to_embeddingImage embeddingAI function
ai_image_similarityImage similarityAI function
ai_video_summarizeVideo summarizationAI function
ai_chart_analyzeChart analysisAI function
ai_document_parseDocument parsingAI function
ai_customer_intent_analyzeCustomer intent analysisAI function
ai_sales_lead_scoreSales lead scoringAI function
ai_review_analyzeReview analysisAI function
ai_risk_text_detectRisk text detectionAI function
ai_contract_extractContract extractionAI function
ai_resume_parseResume parsingAI function
ai_customer_segmentCustomer segmentationAI function
ai_product_description_generateProduct description generationAI function
ai_industry_classificationIndustry classificationAI function

Java

FunctionTypePurpose
pii_maskUDFPII masking
agg_statsUDAFSUM/AVG/MIN/MAX/COUNT
log_explodeUDTFLog row expansion

Troubleshooting

ErrorCauseSolution
function not foundMissing schema prefixAdd <schema>. prefix when calling
HTTP_GENERAL_ERROR(640)RAM trust policy not configured / Bucket in a different regionCheck RAM role trust policy (must include 1384322691904283); confirm Bucket and FC are in the same region
AccessDeniedRAM role missing OSS permissionsAdd AliyunOSSFullAccess or a custom OSS policy
ImportError: No module named 'sklearn'Dependencies not packaged in zipRe-run python 2-package.py (advanced) / python 2-package.py --deps (AI function)
OSError: cannot open shared object filemacOS .dylib was packagedConfirm --platform manylinux2014_x86_64 was used
ClassNotFoundExceptionWrong Java class name or package nameCheck that the AS path matches the actual class name inside the jar
UDAF created successfully but call failsDDL missing AGGREGATORCheck WITH PROPERTIES ('remote.udf.category'='AGGREGATOR')
UDTF created successfully but not a table functionDDL missing TABLE_VALUEDCheck WITH PROPERTIES ('remote.udf.category'='TABLE_VALUED')
First call takes a long time to returnFC cold startWait 5-10 seconds; it has not hung
Changes to config.json not reflected after deploymentForgot to re-renderRe-run python ../3-render-sql.py