Singdata Lakehouse AI Functions Overview

AI Functions work like "an AI assistant built into SQL" — you write a SELECT, and the system automatically calls a large language model for each row of data. Results appear directly in the query result set, with no Python code, no external service setup, and no need to export data to external systems.

When to use AI Functions vs. other approaches:

ScenarioRecommended approach
Classify, extract, translate, or analyze sentiment on text in a table row by rowAI Functions (this page)
Custom complex logic or multi-step AI workflowsPython UDF + AI SDK
One-off calls or interactive Q&ACall the model API directly
Existing external AI service, just need to call it from SQLAPI Connection + AI Functions

The core mechanism: AI computation runs inside the SQL execution engine. Each row's model call is completed within the platform — no data leaves the system, ensuring both data security and batch processing efficiency.


AI ETL Pipeline Architecture

AI ETL Pipeline Architecture

AI Functions process various data sources (text, images, audio, structured data) through AI at the SQL engine layer and output structured results, writing directly back to the data warehouse or flowing to downstream applications — forming a complete AI ETL pipeline:

Data Sources → SQL + AI Functions → Structured Output → Data Warehouse / Vector Index / BI / Recommendation Systems


Quick Start

The examples below show typical usage patterns. Replace endpoint: with your actual configured Endpoint name and the table name with your actual business table.

-- Sentiment analysis on user reviews SELECT review_id, review_text, AI_SENTIMENT('endpoint:qwen3-max-preview', review_text) AS sentiment FROM user_reviews;

-- Extract structured fields from unstructured text SELECT order_id, AI_EXTRACT( 'endpoint:qwen3.5-plus', remark, JSON'{"product":"product name", "qty":"quantity", "issue":"issue description"}' ) AS extracted FROM orders;


Function Categories

Text Understanding and Generation

FunctionDescription
AI_COMPLETEGeneral-purpose LLM completion with custom prompts; suitable for complex reasoning, code generation, and more
AI_SUMMARIZEGenerate text summaries; supports max_words to control summary length
AI_TRANSLATEMulti-language translation with automatic source language detection; supports 20+ languages
AI_FIX_GRAMMARGrammar and spelling correction; supports Chinese, English, and mixed-language text

Text Analysis and Classification

FunctionDescription
AI_CLASSIFYClassify text or images into user-defined categories without writing prompts
AI_SENTIMENTSentiment analysis returning positive / negative / neutral / mixed
AI_EXTRACTExtract structured JSON fields from unstructured text or images
AI_MASKIdentify and mask PII sensitive information in text, replacing it with [MASKED]
FunctionDescription
AI_EMBEDDINGConvert text to high-dimensional vectors for semantic retrieval, recommendations, and more
AI_SIMILARITYCompute cosine similarity between two texts based on embeddings; returns a score in [0, 1]

Multimodal Processing

FunctionDescription
AI_TRANSCRIBETranscribe audio files in a Volume to text (ASR)
AI_CLASSIFYSupports image input for classifying image content
AI_EXTRACTSupports image input for extracting structured information from images
AI_COMPLETESupports image input for generating responses combining images and text prompts

Choose a Function by Use Case

Business scenarioRecommended function
Product/content classification, ticket routingAI_CLASSIFY
Contract/invoice/shipping label information extractionAI_EXTRACT
User review sentiment analysis, public opinion monitoringAI_SENTIMENT
News summarization, conversation summarizationAI_SUMMARIZE
Multi-language content translationAI_TRANSLATE
UGC content cleaning, text correctionAI_FIX_GRAMMAR
Data masking, compliance processingAI_MASK
Semantic search, similarity recommendationsAI_EMBEDDING + AI_SIMILARITY
Customer service recording transcription and analysisAI_TRANSCRIBE + AI_CLASSIFY / AI_EXTRACT
Image content recognition and structuringAI_CLASSIFY / AI_EXTRACT (image mode)
Complex reasoning, code generation, custom tasksAI_COMPLETE

Model Connection Methods

The first parameter of all AI Functions is model, which supports two connection methods:

Method 1: API Gateway Endpoint (Recommended)

A platform administrator pre-configures model services in the API Gateway. Regular users reference them with the endpoint: prefix, without needing to know the underlying connection details.

'endpoint:qwen3-max-preview' -- general text tasks 'endpoint:qwen3.5-plus' -- classification / extraction tasks 'endpoint:text-embedding-v4' -- vector tasks 'endpoint:qwen3-asr-flash' -- speech transcription

Method 2: API Connection Object

Users create their own connection objects via CREATE API CONNECTION, suitable for custom service addresses, private deployment models, and similar scenarios.

CREATE API CONNECTION conn_bailian TYPE ai_function PROVIDER = 'bailian' BASE_URL = 'https://dashscope.aliyuncs.com/api/v1' API_KEY = 'sk-xxxxxxxxxxxxxxxxxxxxxxxx'; -- Reference format: <connection_name>:<model_name> SELECT AI_SENTIMENT('conn_bailian:qwen3.5-plus', 'This product is great!');


Typical Pipeline Examples

Customer Service Recording Analysis Pipeline

-- Audio transcription → sentiment analysis → classification routing SELECT call_id, transcript, AI_SENTIMENT('endpoint:qwen3-max-preview', transcript) AS sentiment, AI_CLASSIFY('endpoint:qwen3.5-plus', transcript, ARRAY('complaint', 'inquiry', 'suggestion', 'praise')) AS category FROM ( SELECT call_id, AI_TRANSCRIBE('endpoint:qwen3-asr-flash', GET_PRESIGNED_URL(USER VOLUME, audio_path, 3600)) AS transcript FROM call_records );

Document Structuring and Ingestion Pipeline

-- Mask → extract → write to structured table INSERT INTO structured_contracts (id, party_a, party_b, amount, masked_text) SELECT id, JSON_EXTRACT_STRING(info, '$.party_a') AS party_a, JSON_EXTRACT_STRING(info, '$.party_b') AS party_b, JSON_EXTRACT_STRING(info, '$.amount') AS amount, AI_MASK('endpoint:qwen3-max-preview', content, ARRAY('name', 'ID number', 'bank account')) AS masked_text FROM ( SELECT id, content, AI_EXTRACT('endpoint:qwen3.5-plus', content, JSON'{"party_a":"party A name","party_b":"party B name","amount":"contract amount"}') AS info FROM raw_contracts WHERE content IS NOT NULL );

-- Step 1: Pre-generate and store vectors INSERT INTO product_vectors (product_id, embedding) SELECT product_id, AI_EMBEDDING('cz_bailian:text-embedding-v4', description, JSON '{"input": "document"}') FROM products WHERE description IS NOT NULL AND LENGTH(description) > 0; -- Step 2: Semantic search (sorted by similarity) SELECT p.product_name, AI_SIMILARITY('cz_bailian:text-embedding-v4', 'lightweight laptop', p.description) AS score FROM products p ORDER BY score DESC LIMIT 10;


Common options Parameter

AI_CLASSIFY, AI_EXTRACT, AI_SIMILARITY, AI_TRANSCRIBE, and other functions support an optional options JSON parameter:

KeyTypeDescription
response.timeoutSTRINGPer-request timeout in seconds, e.g. "300"
task.concurrencySTRINGBatch processing concurrency, e.g. "12"
model.paramsJSONParameters passed through to the model, e.g. {"enable_thinking": false}

SELECT AI_CLASSIFY( 'endpoint:qwen3.5-plus', product_desc, ARRAY('electronics', 'clothing', 'food'), JSON'{"model.params":{"enable_thinking":false},"response.timeout":"300","task.concurrency":"12"}' ) AS category FROM products;


Usage Notes

  • Model selection: Use qwen3.5-plus or qwen3-max-preview for text understanding tasks; use a dedicated embedding model (e.g. text-embedding-v4) for vector tasks; use an ASR model (e.g. qwen3-asr-flash) for speech transcription.
  • Thinking mode: Some models (e.g. the qwen3 series) enable thinking mode by default, which increases latency and token consumption. For batch processing, disable it via model.params: {"enable_thinking": false}.
  • NULL behavior: When a model cannot process input (e.g. empty content, exceeds length limit), most functions return NULL without affecting other rows. See the Error Behavior section of each function's documentation for specifics.
  • Image input: Images must first be uploaded to a Volume, then accessed via GET_PRESIGNED_URL() to generate a pre-signed URL, passed to the function using (url AS image) syntax.
  • Prefer specialized functions: When a task can be completed with a specialized function (e.g. AI_TRANSLATE, AI_SENTIMENT), use it — these functions have built-in prompts optimized for specific tasks, producing more stable results at lower cost.

Prerequisites

  1. A model Endpoint has been configured in the API Gateway, or a connection object has been created via CREATE API CONNECTION.
  2. The current user has permission to call the relevant Endpoint or Connection.
  3. For image and audio processing, files must be uploaded to a Volume and accessed via GET_PRESIGNED_URL().