AI Functions

Singdata Lakehouse AI Functions are a set of AI-enhanced functions built directly into the SQL engine. Without leaving the data platform or writing Python scripts, you can call large language models (LLMs) and embedding models directly in standard SQL queries to intelligently process text, images, audio, and more.

For a detailed introduction, see AI Functions Overview.


Model Access Methods

The first parameter of every AI Function is model, which supports two access methods:

Method 1: API Gateway Endpoint (recommended)

A platform administrator pre-configures model services in the API Gateway, and you reference them using the endpoint: prefix:

'endpoint:qwen3-max-preview' 'endpoint:qwen3.5-plus' 'endpoint:text-embedding-v4' 'endpoint:qwen3-asr-flash'

Method 2: API Connection object

You create a connection object yourself using CREATE API CONNECTION, which is suitable for custom service addresses or private deployment scenarios:

CREATE API CONNECTION conn_bailian TYPE ai_function PROVIDER = 'bailian' BASE_URL = 'https://dashscope.aliyuncs.com/api/v1' API_KEY = 'sk-xxxxxxxxxxxxxxxxxxxxxxxx'; -- Reference format: <connection_name>:<model_name> SELECT AI_SENTIMENT('conn_bailian:qwen3.5-plus', 'This product is great!');


Function Reference

Text Understanding and Generation

AI_COMPLETE

A general-purpose LLM completion function that supports custom prompts. Use it for complex reasoning, code generation, and other custom scenarios that specialized functions cannot cover.

ai_complete(<model>, <prompt> [, json '{}']) -- Image mode ai_complete(<model>, (<prompt> AS prompt, <image_url> AS image) [, json '{}'])

Return value: STRING — the response text generated by the model.

Usage notes:

  • When a task can be handled by a specialized function (such as AI_TRANSLATE or AI_SENTIMENT), prefer the specialized function for more stable results and lower cost.
  • Image mode requires a multimodal-capable model (e.g., doubao-seed-2-0-pro-260215); passing an image to a text-only model will not raise an error but the image content will be ignored.
  • The qwen3 series models have thinking mode enabled by default. For batch processing, it is recommended to disable it via options: json '{"model.params":{"enable_thinking":false}}'
  • Returns NULL when prompt is NULL or an empty string.

Full documentation: AI_COMPLETE


AI_SUMMARIZE

Generates a concise summary of the input text. Supports controlling summary length via max_words.

AI_SUMMARIZE(model, content [, max_words])

Return value: STRING — a summary of the input text.

Usage notes:

  • max_words defaults to 50. It is a target value, not a hard limit; actual output may vary by ±20%.
  • When max_words=0, the original text is returned without calling the model.
  • A negative max_words value raises an error.
  • Returns NULL for NULL input; returns an empty string for empty string input.
  • The output language automatically follows the input language.

Full documentation: AI_SUMMARIZE


AI_TRANSLATE

Translates input text into a specified language. The source language is detected automatically, and 20+ languages are supported.

AI_TRANSLATE(model, content, to_lang)

Return value: STRING — the translated text.

Usage notes:

  • to_lang must be a valid ISO-639 language code (e.g., 'zh', 'en', 'ja'). Full language names are not supported.
  • Manual specification of the source language is not supported; it is always detected automatically.
  • If the input and target languages are the same, the original text is returned unchanged.
  • Can be combined with AI_SUMMARIZE (summarize first, then translate).

Full documentation: AI_TRANSLATE


AI_FIX_GRAMMAR

Automatically corrects grammar, spelling, and punctuation errors in input text. Supports Chinese, English, and mixed-language text.

AI_FIX_GRAMMAR(<model>, <content>)

Return value: STRING — the corrected text. If the input contains no grammar errors, the original text is returned unchanged.

Usage notes:

  • Returns NULL for NULL input; returns an empty string for empty string input.
  • Mixed Chinese-English text is intelligently unified to the dominant language.
  • The focus is on grammar correction; in rare cases the meaning may be slightly altered. Manual spot-checking is recommended for semantically sensitive scenarios.
  • It is recommended to use this function to clean text before calling AI_SENTIMENT or AI_SUMMARIZE.

Full documentation: AI_FIX_GRAMMAR


Text Analysis and Classification

AI_CLASSIFY

Classifies text or images into user-defined categories without writing a prompt. Supports 29+ languages and cross-language classification.

AI_CLASSIFY(model, content, labels [, options])

Return value: STRING — the name of the best-matching category (a plain string, not JSON).

Usage notes:

  • labels is an ARRAY(STRING). 2–10 categories are recommended, and category names should have descriptive semantics.
  • For image input, use the GET_PRESIGNED_URL(USER VOLUME, path, expiry) AS image syntax.
  • For batch classification, setting enable_thinking:false is recommended to speed up responses.
  • Returns NULL for NULL input; returns an empty string "" for empty string input.
  • Labels can all be in English and will still classify correctly even when the input is in another language.

Full documentation: AI_CLASSIFY


AI_SENTIMENT

Performs sentiment analysis on input text. Supports multiple languages including Chinese, English, and Japanese.

AI_SENTIMENT(<model>, <text>)

Return value: STRING — one of the following values:

ValueMeaning
positivePositive sentiment
negativeNegative sentiment
neutralNo clear sentiment
mixedBoth positive and negative sentiment present

Usage notes:

  • Returns NULL for NULL or empty string input.
  • Can recognize complex semantics such as sarcasm/irony, double negatives, and emoji sentiment.
  • mixed means the text contains both positive and negative evaluations; neutral means the text carries no sentiment.
  • Results are non-deterministic; the same input may produce slightly different results across multiple executions.

Full documentation: AI_SENTIMENT


AI_EXTRACT

Extracts structured JSON data from unstructured text or images according to specified fields, without writing a prompt.

AI_EXTRACT(model, content, schema [, options])

Return value: STRING (JSON format) — contains all fields defined in schema and their extracted values. All field values are of string type.

Usage notes:

  • schema format is JSON'{"key":"field description", ...}'. Values support simple labels or question-style descriptions.
  • No more than 10 fields are recommended; up to approximately 20 are supported.
  • If a field cannot be found in the text, its value is an empty string "".
  • Returns NULL for NULL or empty string input.
  • The return value is a JSON string and can be further parsed with json_extract_string.

Full documentation: AI_EXTRACT


AI_MASK

Identifies and masks PII sensitive information in text, replacing it with a [MASKED] placeholder. Labels are user-defined.

AI_MASK(<model>, <content>, <labels>)

Return value: STRING — the masked text. If the text does not contain information matching the specified labels, the original text is returned unchanged.

Usage notes:

  • labels is an ARRAY(STRING). The count must be between 1 and 20; an empty array is not allowed.
  • More specific labels yield more accurate results (e.g., "national ID number" is more precise than "document").
  • Matching the label language to the text language produces better results.
  • All masked information is uniformly replaced with [MASKED], regardless of type.
  • AI masking is LLM-based and may have missed or false detections. Manual spot-checking is recommended for compliance scenarios.

Full documentation: AI_MASK


AI_EMBEDDING

Converts text into high-dimensional embedding vectors for downstream tasks such as semantic retrieval, recommendations, and clustering.

AI_EMBEDDING(<model>, <input> [, <model_parameters>])

Return value: ARRAY<FLOAT> — the embedding vector of the input text.

Usage notes:

  • An embedding model (e.g., text-embedding-v4) must be used; chat completion models cannot be used.
  • Results are deterministic; the same input always returns the same vector.
  • In retrieval scenarios, use "input":"document" when indexing documents and "input":"query" for user queries.
  • Returns NULL for NULL input; raises an error for empty string input — filter before calling.
  • Input length limit is 8,192 tokens; a maximum of 10 items can be processed per call.

Full documentation: AI_EMBEDDING


AI_SIMILARITY

Calculates the cosine similarity between two pieces of text based on an embedding model, returning a score in [0, 1].

AI_SIMILARITY(<model>, <text1>, <text2> [, <options>])

Return value: FLOAT — cosine similarity, which in practice typically falls in the [0, 1] range.

RangeMeaning
1.0Semantically identical
> 0.7Highly similar
0.3 ~ 0.7Somewhat related
< 0.3Largely unrelated

Usage notes:

  • An embedding model must be used; LLMs cannot be used.
  • Results are deterministic, and the function is symmetric (a,b and b,a return the same result).
  • Returns 0 when either parameter is NULL.
  • Supports cross-language similarity calculation (e.g., comparing Chinese and English semantics).
  • In batch CROSS JOIN scenarios, token consumption = rows² × average token count. Evaluate your quota in advance.

Full documentation: AI_SIMILARITY


Multimodal Processing

AI_TRANSCRIBE

Transcribes audio files in a Volume to plain text (ASR). Supports multiple languages including Chinese and English.

AI_TRANSCRIBE(<model>, <audio_url> [, <options>])

Return value: STRING — the plain-text transcription of the audio content, without timestamps or speaker information.

Usage notes:

  • audio_url must be a URL starting with http:// or https://, typically obtained via GET_PRESIGNED_URL().
  • Supports WAV, MP3, FLAC, and M4A formats. 16kHz mono WAV is recommended.
  • A presigned URL validity of 36,000 seconds (10 hours) is recommended to avoid URL expiration during batch processing.
  • Returns plain text that can be used directly as input to functions like AI_CLASSIFY or AI_EXTRACT.
  • Silent or near-silent files may produce a small amount of hallucinated text; downstream length filtering is recommended.

Full documentation: AI_TRANSCRIBE


Common options Parameter

Functions such as AI_CLASSIFY, AI_EXTRACT, AI_SIMILARITY, and AI_TRANSCRIBE support an optional options JSON parameter:

Parameter keyTypeDescription
response.timeoutSTRINGPer-request timeout in seconds, e.g., "300"
task.concurrencySTRINGConcurrency for batch processing, e.g., "12"
model.paramsJSONParameters passed through to the model, e.g., {"enable_thinking": false}

JSON'{"model.params":{"enable_thinking":false},"response.timeout":"300","task.concurrency":"12"}'


Prerequisites

  1. A model endpoint has been configured in the API Gateway, or a connection object has been created via CREATE API CONNECTION.
  2. The current user has permission to call the relevant endpoint or connection.
  3. For image and audio processing scenarios, files must be uploaded to a Volume and an access URL obtained via GET_PRESIGNED_URL().


Customizing AI Functions via External Functions

In addition to the built-in AI Functions, Singdata Lakehouse also supports developing custom AI functions through the External Function framework. This is suitable for scenarios that require connecting to private models, offline inference packages, or custom business logic.

External functions (also known as Remote Functions) are a special type of user-defined function (UDF) that allows you to define function logic in Python or Java, with the core computation offloaded to an external remote service (Alibaba Cloud Function Compute FC and Tencent Cloud SCF are supported). During execution, the following can be called:

  • Online services: Online services provided externally via API, such as large language model APIs and online AI API services provided by cloud platforms.
  • Offline features: Offline service packages that bundle specific function code, dependency libraries, models, and data files, such as image recognition models downloaded from Hugging Face.

Singdata Lakehouse stores the connection and access information for external function compute services in metadata by creating an API Connection. External functions call the external compute service via HTTP protocol, process data, and return results.

After receiving prior user authorization, the Lakehouse platform automatically deploys the function to the function compute service under the customer's account when the external function is created. When you use an external function in a SQL query, the external function handles the secure connection to the external compute service, data processing, and returns the query results.

Main process for creating external functions: Usage Process: External Function

Development guides: