AI_CLASSIFY

AI_CLASSIFY is an AI text/image classification function provided by Singdata Lakehouse. It automatically assigns input content to user-defined categories — no model training, no prompt writing. One line of SQL is all it takes.


Syntax

AI_CLASSIFY(model, content, labels [, options])

ParameterTypeRequiredDescription
modelSTRINGYesModel identifier; supports endpoint: and connection: sources
contentSTRING or image referenceYesText to classify, or GET_PRESIGNED_URL(...) AS image
labelsARRAYYesCategory array: ARRAY('category1', 'category2', ...)
optionsJSON literalNoOptional parameters (timeout, concurrency, model params)

Return value: STRING — the best-matching category name (plain string, not JSON).


model Parameter

Method 1: API Gateway Endpoint (Recommended)

A platform administrator pre-configures model services in the API Gateway. Regular users reference them with the endpoint: prefix, without needing to know the underlying connection details.

'endpoint:<endpoint_name>' -- Examples 'endpoint:qwen3.5-plus' 'endpoint:qwen3-max-preview'

Method 2: API Connection Object

Users create their own connection objects via CREATE API CONNECTION, suitable for custom service addresses, authentication keys, or private deployment models.

-- Create a connection object CREATE API CONNECTION conn_bailian TYPE ai_function PROVIDER = 'bailian' BASE_URL = 'https://dashscope.aliyuncs.com/api/v1' API_KEY = 'sk-xxxxxxxxxxxxxxxxxxxxxxxx'; -- Reference with connection: prefix SELECT AI_CLASSIFY('conn_bailian:qwen3.5-plus', 'iPhone', ARRAY('electronics', 'clothing', 'food'));

CREATE API CONNECTION field descriptions:

FieldDescription
TYPEFixed as ai_function
PROVIDERModel provider identifier, e.g. 'bailian', 'openai', 'anthropic'
BASE_URLBase API URL of the model service
API_KEYAuthentication key for calling the service

Quick Start

-- Text classification SELECT AI_CLASSIFY( 'endpoint:qwen3.5-plus', 'iPhone', ARRAY('electronics', 'clothing', 'food') ); -- Returns: electronics


Use Cases

Case 1: Product Classification

SELECT product_name, AI_CLASSIFY('endpoint:qwen3.5-plus', product_desc, ARRAY('electronics', 'clothing', 'food')) AS category FROM products;

product_namecategory
iPhoneelectronics
Dior dressclothing
Oreo cookiesfood

Case 2: Image Classification

SELECT relative_path, AI_CLASSIFY( 'endpoint:qwen3.5-plus', (GET_PRESIGNED_URL(USER VOLUME, relative_path, 36000) AS image), ARRAY('electronics', 'menswear', 'womenswear', 'food', 'automotive') ) AS classification FROM (SHOW USER VOLUME DIRECTORY SUBDIRECTORY 'images/products');

Case 3: News Classification

SELECT headline, AI_CLASSIFY('endpoint:qwen3.5-plus', headline, ARRAY('tech', 'sports', 'finance', 'entertainment')) AS topic FROM news_articles;

Case 4: Customer Support Ticket Routing

SELECT ticket_id, AI_CLASSIFY( 'endpoint:qwen3.5-plus', description, ARRAY('payment issue', 'shipping issue', 'product quality', 'account issue', 'feature request') ) AS department FROM support_tickets;

Case 5: Batch Classification with options

SELECT product_name, AI_CLASSIFY( 'endpoint:qwen3.5-plus', product_desc, ARRAY('electronics', 'clothing', 'food'), JSON '{"model.params":{"enable_thinking":false},"response.timeout":"300","task.concurrency":"12"}' ) AS category FROM products;


Multilingual Support

AI_CLASSIFY natively supports classification in 29+ languages based on the model you choose, including:

Language familySupported languages
CJKChinese, Japanese, Korean
LatinEnglish, French, Spanish, Portuguese, German, Italian
Southeast AsianVietnamese, Thai, Indonesian
OtherArabic, Russian, Polish, Dutch, Turkish, and more

Same-language classification

Input and labels in the same language:

-- Japanese SELECT AI_CLASSIFY('endpoint:qwen3.5-plus', '東京オリンピックで日本は金メダル27個を獲得しました', ARRAY('テクノロジー', 'スポーツ', '金融', 'エンタメ') ); -- Returns: スポーツ -- Arabic SELECT AI_CLASSIFY('endpoint:qwen3.5-plus', 'أعلن البنك المركزي عن رفع أسعار الفائدة', ARRAY('تقنية', 'مالية', 'رياضة', 'ترفيه') ); -- Returns: مالية

Cross-language classification

Input and labels can be in different languages:

-- Chinese input + English labels SELECT AI_CLASSIFY('endpoint:qwen3.5-plus', '特斯拉发布了全新的自动驾驶系统', ARRAY('technology', 'sports', 'finance', 'entertainment') ); -- Returns: technology -- English input + Chinese labels SELECT AI_CLASSIFY('endpoint:qwen3.5-plus', 'Bitcoin surged past 150000 as institutional investors poured billions', ARRAY('科技', '体育', '金融', '娱乐') ); -- Returns: 金融


options Parameter

JSON '{"model.params":{"enable_thinking":false},"response.timeout":"300","task.concurrency":"12"}'

ParameterTypeDescription
model.params.enable_thinkingbooleanSet to false to disable thinking mode for faster responses (recommended for batch classification)
response.timeoutstring (seconds)Per-call timeout
task.concurrencystring (integer)Batch processing concurrency

NULL and Empty Input Behavior

InputReturn valueNotes
content is NULLNULLNULL is passed through
content is empty string""Returns empty string (not NULL)
Normal textMatching category namePlain string

Best Practices

  1. Use descriptive category names — Use meaningful names (e.g. "electronics" rather than "cat_1"). The model understands categories through semantic meaning.

  2. Keep the number of categories reasonable — 2–10 categories works best. Too many categories may reduce accuracy.

  3. Disable thinking for speed — For batch classification, set enable_thinking:false to significantly reduce response time.

  4. Filter before classifying — For large tables, use WHERE to narrow the scope first, avoiding unnecessary model calls.

  5. Leverage cross-language capability — Labels can be in English even when input is in another language, making downstream processing consistent.

  6. Image classification — Pass images via GET_PRESIGNED_URL(USER VOLUME, path, expiry) AS image; the model classifies based on image content.

  7. Guard against empty strings — For columns that may contain empty strings, add WHERE content IS NOT NULL AND content != '' before classifying.


Limitations

ItemDescription
Model parameterAn endpoint must be specified
Minimum labels1 (recommended ≥ 2; with a single label, that label is always returned)
Maximum labelsRecommended ≤ 20; too many reduces accuracy
Return valueSingle label (one category name string)
Image inputMust use GET_PRESIGNED_URL(...) AS image syntax
QuotaSubject to AI Gateway tenant token quota limits

Error Handling

Error scenarioError messageResolution
Endpoint does not existCZLH-67000 No available endpoints foundCheck that the endpoint name is correct
Quota exceededTenant quota exceededContact your administrator to increase quota
Image not foundFailed to fetch image from URLCheck the Volume file path