AI_SIMILARITY
Overview
AI_SIMILARITY is a semantic similarity function provided by Singdata Lakehouse. It converts two text inputs into vectors using an embedding model and computes their cosine similarity, returning a FLOAT value. Use it for semantic search, product recommendations, text deduplication, content matching, and similar scenarios.
Unlike LLM functions such as AI_COMPLETE, AI_SIMILARITY is based on an embedding model — results are deterministic. The same input always returns the same result, and it runs faster.
Singdata pushes AI computation down to the storage and execution engine layer. Data is processed intelligently within the platform without leaving the system, ensuring data security while significantly reducing task latency.
Syntax
Parameters
Required Parameters
model
Specifies the embedding model to use. Supports two sources:
Source 1: API Gateway Endpoint (Recommended)
A platform administrator pre-configures model services in the API Gateway. Regular users reference them with the endpoint: prefix, without needing to know the underlying connection details.
Source 2: API Connection Object
Users create their own connection objects via CREATE API CONNECTION, suitable for custom service addresses, authentication keys, or private deployment models.
CREATE API CONNECTION field descriptions:
| Field | Description |
|---|---|
TYPE | Fixed as ai_function |
PROVIDER | Model provider identifier, e.g. 'bailian', 'openai', 'anthropic' |
BASE_URL | Base API URL of the model service |
API_KEY | Authentication key for calling the service |
text1
The first input text, type STRING. Supports Chinese, English, and other languages.
text2
The second input text, type STRING. Supports Chinese, English, and other languages.
Optional Parameters
options
JSON literal for controlling model parameters, timeout, and concurrency.
| Parameter | Description |
|---|---|
model.params.dimensions | Embedding vector dimensions (default 1024; can be set to 2048, etc., depending on model support) |
response.timeout | HTTP request timeout in seconds |
task.concurrency | Concurrency for batch processing |
Return Value
FLOAT type. Based on cosine similarity, the theoretical range is [-1, 1]; in practice it typically falls in [0, 1].
| Range | Meaning |
|---|---|
| 1.0 | The two texts are identical (or semantically equivalent) |
| > 0.7 | Highly similar |
| 0.3 ~ 0.7 | Somewhat related |
| < 0.3 | Largely unrelated |
| 0 | Either input is NULL, or one is an empty string and the other is not |
Error Behavior
By default, if the function cannot process the input, it returns 0 without raising an error. Specific boundary behaviors:
| Input condition | Return value |
|---|---|
| Either parameter is NULL | 0 |
Both are empty strings '' | 1 |
| One empty string, one non-empty | 0 |
| Two identical non-empty texts | 1.0 |
Usage Notes
- Results are deterministic: The same input always returns the same result — suitable for business scenarios requiring stable ordering (e.g. search result ranking).
- The function is symmetric:
AI_SIMILARITY(model, a, b)andAI_SIMILARITY(model, b, a)return identical results. - Supports multilingual and cross-lingual: Supports Chinese, English, and other languages, including cross-language similarity (e.g. comparing Chinese and English semantics).
- Text input only:
AI_SIMILARITYdoes not support image input; useAI_EXTRACTfor image processing. - Set thresholds appropriately: Adjust filter thresholds based on your use case — > 0.9 for exact matches, > 0.7 for highly related, > 0.5 for somewhat related.
- Be aware of quota consumption: Each call consumes tokens for both text1 and text2. In CROSS JOIN scenarios, token consumption = rows² × average token count; estimate before running.
- Filter before computing: For large tables, use
WHEREto narrow the scope first, then compute similarity, avoiding unnecessary API calls.
Examples
Basic Usage
Cross-Language Similarity
Semantic Search (Sorted by Similarity)
Similarity Threshold Filtering
Text Deduplication (Find Near-Duplicates)
Using a CTE to Avoid Redundant Calls
Using an API Connection
Limitations
modelparameter is required: Omitting it causes the errorAI function must have at least two arguments.- Invalid
modelformat causes an error:modelmust use'endpoint:<name>'or'<connection_name>:<model_name>'format; incorrect format causesInvalid model coordinates. - Text input only: Image input is not supported; use
AI_EXTRACTfor image processing. - Input length is model-limited: Input text length is limited by the underlying embedding model's context window.
- Quota limits: Subject to AI Gateway tenant monthly token quota limits; when quota is exceeded, the entire query fails with
Tenant quota exceeded: Monthly quota limit.... - Non-existent Endpoint causes an error: Error message is
No available endpoints found; check that the endpoint name is correct.
