Scenario: Use vector similarity search to implement semantic matching, recommendation systems, and knowledge base retrieval (RAG). Convert unstructured data such as text and images into high-dimensional vectors, then find the most similar results by computing vector distances.
All examples in this guide use the following test data:
-- Product semantic vector table
-- In production, use the ai_embedding function to convert product descriptions to high-dimensional vectors:
-- UPDATE product_embeddings
-- SET embedding = ai_embedding('endpoint:my_embedding_model', description);
CREATE TABLE product_embeddings (
id BIGINT,
name VARCHAR(100),
description VARCHAR(500),
category VARCHAR(50),
embedding VECTOR(768),
-- Create a vector index (HNSW algorithm) to accelerate cosine distance retrieval
INDEX idx_embedding(embedding) USING VECTOR PROPERTIES (
'scalar.type' = 'f32',
'distance.function' = 'cosine_distance'
)
);
-- Sample data: in production, vectors are generated by an embedding model.
-- Simplified vectors are used here for demonstration.
-- Note: real-world vector dimensions are typically 768, 1024, or 1536.
INSERT INTO product_embeddings (id, name, description, category, embedding) VALUES
(1, 'Apple', 'Fresh red apple from organic farm, sweet and juicy.',
'Fruit', vector(0.9, 0.1, 0.0, 0.0) || fill_vector(764, 0.0)),
(2, 'Banana', 'Ripe yellow banana, rich in potassium and energy.',
'Fruit', vector(0.1, 0.9, 0.0, 0.0) || fill_vector(764, 0.0)),
(3, 'Carrot', 'Organic carrot, crunchy and full of vitamins.',
'Vegetable',vector(0.0, 0.0, 0.9, 0.1) || fill_vector(764, 0.0)),
(4, 'Dog', 'Friendly golden retriever, great family pet.',
'Animal', vector(0.1, 0.0, 0.1, 0.8) || fill_vector(764, 0.0)),
(5, 'Elephant', 'Majestic elephant, largest land animal.',
'Animal', vector(0.0, 0.1, 0.0, 0.9) || fill_vector(764, 0.0));
⚠️ Note:
The VECTOR type cannot be output directly as text in query results; use it with distance functions instead.
In production, use the ai_embedding function to convert text to high-dimensional vectors:
SELECT ai_embedding('endpoint:my_embedding_model', 'Fresh apple',
JSON '{"dimensions": "768"}') AS embedding;
The vector index uses the HNSW algorithm to accelerate retrieval and automatically indexes data as it is written.
Scenario 0: Generating Vectors with ai_embedding
Problem
How to convert text descriptions into high-dimensional vectors for subsequent similarity search.
SQL Implementation
-- Use ai_embedding to convert product descriptions to 768-dimensional vectors
SELECT
id,
name,
description,
ai_embedding('endpoint:my_embedding_model', description,
JSON '{"dimensions": "768"}') AS embedding
FROM products
WHERE id <= 5;
Notes:
ai_embedding requires a pre-configured embedding model endpoint (e.g., OpenAI text-embedding-3-small, Alibaba Cloud text-embedding-v4, etc.)
Common dimensions: 256 (lightweight), 768 (standard), 1024/1536 (high precision)
After generating vectors, store them in a VECTOR type column for subsequent similarity search
Batch Vector Generation
-- Batch-generate vectors for existing data
UPDATE product_embeddings_768
SET embedding = ai_embedding('endpoint:my_embedding_model', description,
JSON '{"dimensions": "768"}')
WHERE embedding IS NULL;
Scenario 1: Semantic Similarity Search
Problem
Given a query vector (e.g., the embedding for "fruit"), find the most similar products.
SQL Implementation
-- Given a query text, convert it to a vector with ai_embedding and then search
-- In production, the query vector is also generated by ai_embedding:
-- ai_embedding('endpoint:my_embedding_model', 'fresh sweet fruit')
SELECT
id,
name,
category,
description,
cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh sweet fruit')) AS similarity
FROM product_embeddings
ORDER BY similarity ASC
LIMIT 3;
Output:
id
name
category
description
similarity
1
Apple
Fruit
Fresh red apple...
0.009
2
Banana
Fruit
Ripe yellow banana...
0.652
4
Dog
Animal
Friendly golden retriever...
0.881
Key notes:
A smaller cosine_distance value means greater similarity (0 = identical, 2 = completely opposite)
The query vector is generated in real time by ai_embedding and must have the same dimension as the stored vectors (e.g., 768)
Scenario 2: Comparing Distance Functions
Problem
Different distance metrics suit different business scenarios; you need to compare their results.
SQL Implementation
-- Compare the effect of different distance metrics
-- The query vector is generated in real time by ai_embedding
SELECT
id,
name,
cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS cos_dist,
l2_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS l2_dist,
dot_product(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS dot_prod
FROM product_embeddings
ORDER BY cos_dist ASC;
Output:
id
name
cos_dist
l2_dist
dot_prod
1
Apple
0.009
0.141
0.74
2
Banana
0.652
0.990
0.26
3
Carrot
0.957
1.273
0.00
4
Dog
0.881
1.086
0.10
5
Elephant
0.949
1.225
0.02
Selection guide:
Cosine distance: focuses on direction rather than magnitude (text semantics, recommendations)
Euclidean distance: focuses on absolute difference (image features, spatial coordinates)
Dot product: recommendation system scoring (higher value = more relevant)
Scenario 3: Vector Normalization and Baseline Vectors
Problem
Create a zero vector as a baseline, or compute similarity after normalizing vectors.
SQL Implementation
-- Create a zero vector and compute distances
SELECT
id,
name,
l2_distance(embedding, fill_vector(768, 0.0)) AS dist_to_zero,
l2_norm(embedding) AS vector_length
FROM product_embeddings
ORDER BY dist_to_zero ASC
LIMIT 3;
Output:
id
name
dist_to_zero
vector_length
4
Dog
0.812
0.812
1
Apple
0.906
0.906
2
Banana
0.906
0.906
⚠️ Note: l2_distance(v, fill_vector(4, 0.0)) is equivalent to l2_norm(v).
l2_normalize(v) returns the normalized vector, which can be used to convert dot product to cosine similarity, but the VECTOR type cannot be output as text directly.
Scenario 4: Binary Quantization and Hamming Distance
Problem
Quantize floating-point vectors into 0/1 binary vectors for fast approximate search.
SQL Implementation
-- Compute Hamming distance after binary quantization (avoid outputting vector columns directly)
SELECT
a.id AS id_a,
a.name AS name_a,
b.id AS id_b,
b.name AS name_b,
hamming_distance(
binary_quantize(a.embedding),
binary_quantize(b.embedding)
) AS hamming_dist
FROM product_embeddings a
JOIN product_embeddings b ON a.id < b.id
ORDER BY hamming_dist ASC
LIMIT 5;
Output:
id_a
name_a
id_b
name_b
hamming_dist
3
Carrot
4
Dog
1.0
1
Apple
2
Banana
2.0
...
...
...
...
...
Notes:
binary_quantize returns a vector(tinyint, N) type, which cannot be output as text directly
hamming_distance is designed for fast distance computation on binary vectors
Hamming distance = the number of positions where the two vectors differ
Scenario 5: Jaccard Similarity (Set Overlap)
Problem
Compute the Jaccard distance between two binary vectors (representing sets) to measure set overlap.
SQL Implementation
-- Jaccard distance requires vectors of tinyint type
-- Distance = 0 means identical; distance = 1 means completely different
SELECT
jaccard_distance(vector(1Y, 1Y, 0Y, 0Y), vector(1Y, 0Y, 1Y, 0Y)) AS jaccard_dist,
-- Jaccard similarity = 1 - distance
1 - jaccard_distance(vector(1Y, 1Y, 0Y, 0Y), vector(1Y, 0Y, 1Y, 0Y)) AS jaccard_sim
Output:
jaccard_dist
jaccard_sim
0.667
0.333
Notes:
Jaccard distance = 1 - (intersection size / union size)
Suitable for tag overlap, user interest similarity, and similar scenarios
Scenario 7: Accelerating Retrieval with a Vector Index
Problem
When data reaches millions of rows or more, brute-force computation of all vector distances leads to extremely slow queries. A vector index is essential for large-scale vector retrieval and can reduce query time from minutes to milliseconds.
Why Do You Need a Vector Index?
Scenario
No Index (Brute Force)
With Index (HNSW)
Query method
Full table scan, computes all vector distances
Approximate nearest neighbor (ANN), computes only a subset of vectors
Time complexity
O(N), grows linearly with data volume
O(log N), grows logarithmically with data volume
Small dataset (< 10,000 rows)
Performance difference is negligible
Performance difference is negligible
Large dataset (> 100,000 rows)
Query time increases significantly
Query time remains roughly constant
Conclusion: When data exceeds 10,000 rows, creating a vector index is strongly recommended.
SQL Implementation
Option 1: Declare the vector index at table creation (recommended)
CREATE TABLE product_embeddings (
id BIGINT,
name VARCHAR(100),
embedding VECTOR(768),
INDEX idx_embedding(embedding) USING VECTOR PROPERTIES (
'scalar.type' = 'f32',
'distance.function' = 'cosine_distance'
)
);
Option 2: Add a vector index to an existing table
-- Create the index after the table and data already exist
CREATE VECTOR INDEX idx_embedding ON TABLE product_embeddings(embedding) PROPERTIES (
'scalar.type' = 'f32',
'distance.function' = 'cosine_distance'
);
-- Existing data requires a manual build
BUILD INDEX idx_embedding ON product_embeddings;
Verifying the Index
-- View indexes on the table
SHOW INDEX IN product_embeddings;
Output:
index_name
index_type
idx_embedding
vector
Key notes:
Index declared at table creation: data is indexed automatically on insert; no BUILD is needed
Index added to an existing table: the new index only applies to data written after creation; existing data requires BUILD INDEX
The vector index uses the HNSW algorithm, suitable for large-scale vector retrieval
Vector Index + Text Filtering
-- Issue: combining a vector index with WHERE filters on other columns degrades to brute-force retrieval
-- Option 1: use a subquery to do vector retrieval first, then filter in the outer query
SELECT id, name, dist FROM (
SELECT id, name, cosine_distance(embedding, query_vec) as dist
FROM product_embeddings
WHERE cosine_distance(embedding, query_vec) < 0.5
ORDER BY dist LIMIT 100
) WHERE name LIKE '%Apple%';
-- Option 2: combine vector index with inverted index
SET cz.sql.index.prewhere.enabled=true;
SELECT id, name, cosine_distance(embedding, query_vec) as dist
FROM product_embeddings
WHERE match_regexp(name, '.*Apple.*', map('analyzer', 'keyword'))
AND cosine_distance(embedding, query_vec) < 0.5
ORDER BY dist LIMIT 100;
Scenario 8: End-to-End Example — RAG Knowledge Base Retrieval
Problem
Build a simple RAG (Retrieval-Augmented Generation) knowledge base retrieval flow:
Convert the user query to a vector
Retrieve the Top K most relevant documents
Filter results by category
SQL Implementation
-- RAG retrieval flow:
-- 1. User query → ai_embedding converts it to a vector
-- 2. Vector retrieval → Top K relevant documents
-- 3. Retrieved results + original query → LLM generates an answer
SELECT
id,
name,
category,
description,
cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'large land animal')) AS relevance
FROM product_embeddings
WHERE category = 'Animal' -- optional: filter by category
ORDER BY relevance ASC
LIMIT 2;
Output:
id
name
category
relevance
4
Dog
Animal
0.000
5
Elephant
Animal
0.134
RAG flow summary:
Query text → Embedding model → Query vector
SQL vector retrieval → Top K relevant documents
Retrieved results + original query → LLM generates an answer
Common Issues
1. Vector type cannot be output directly
-- Wrong: querying a vector column directly will error
SELECT embedding FROM product_embeddings;
-- Error: data type vector(float,4) is not currently supported by textfile writer
-- Correct: use with a distance function, or output only scalar results
SELECT
id,
name,
cosine_distance(embedding, vector(0.5, 0.5, 0.5, 0.5)) AS dist
FROM product_embeddings;
-- Correct: compute distance after binary quantization
SELECT l2_distance(binary_quantize(embedding), binary_quantize(vector(0.5, 0.5, 0.5, 0.5))) AS bin_dist
FROM product_embeddings;
2. Dimension consistency for ai_embedding-generated vectors
-- Wrong: query vector dimension does not match stored vector dimension
cosine_distance(embedding, ai_embedding('endpoint:my_model', 'query'))
-- embedding is 768-dimensional; ai_embedding default may be 1536-dimensional
-- Correct: specify the dimensions parameter to ensure consistency
ai_embedding('endpoint:my_model', 'query', JSON '{"dimensions": "768"}')
-- cosine_distance returns a distance value (0–2), not a similarity score
-- Similarity = 1 - distance
SELECT
name,
1 - cosine_distance(embedding, query_vec) AS cosine_similarity
FROM product_embeddings;
-- Verification: identical vectors → distance=0, orthogonal → distance=1, opposite → distance=2
SELECT
cosine_distance(vector(1.0, 0.0), vector(1.0, 0.0)) AS same, -- 0
cosine_distance(vector(1.0, 0.0), vector(0.0, 1.0)) AS ortho, -- 1
cosine_distance(vector(1.0, 0.0), vector(-1.0, 0.0)) AS opp; -- 2
5. Vector dimensions must match
-- Wrong: dimension mismatch will error
cosine_distance(vector(1.0, 2.0), vector(1.0, 2.0, 3.0))
-- Correct: ensure the query vector has the same dimension as the stored vector
cosine_distance(embedding, vector(0.8, 0.2, 0.0, 0.0)) -- 4-dim vs 4-dim
6. fill_vector parameter order
-- Wrong: parameters are in the wrong order
fill_vector(vector(0.0, 0.0), 4, 0.5)
-- Correct: fill_vector(dimension, fill_value)
fill_vector(4, 0.0) -- create a 4-dimensional zero vector
7. Vector arithmetic functions
-- vector_add_scalar: add a scalar to each element of a vector (second parameter must be an integer)
SELECT l2_distance(
vector_add_scalar(vector(1.0, 2.0, 3.0), 2),
vector(3.0, 4.0, 5.0)
) AS dist; -- result: 0, confirming correct computation
-- l2_normalize: L2 norm ≈ 1 after normalization
SELECT l2_norm(l2_normalize(vector(1.0, 2.0, 3.0, 4.0))) AS norm; -- result: 0.99999994
Performance Optimization Tips
Scenario
Optimization Strategy
Large-scale vector retrieval
Create a vector index (HNSW) to avoid full table scans
Binary quantization speedup
Use binary_quantize to reduce dimensionality; Hamming distance is faster to compute
Pre-filtering
Use WHERE to filter by category or time range before computing vector distances
Batch queries
Use JOIN instead of multiple single-row queries
Vector index tuning
Adjust the cz.vector.index.search.ef parameter to balance performance and accuracy
-- Recommended: filter first, then compute distances
SELECT id, name, cosine_distance(embedding, query_vec) AS dist
FROM product_embeddings
WHERE category = 'Fruit' -- narrow the scope first
ORDER BY dist ASC
LIMIT 3;
-- Not recommended: compute distances across the full table, then filter
SELECT id, name, cosine_distance(embedding, query_vec) AS dist
FROM product_embeddings
ORDER BY dist ASC
LIMIT 100; -- then filter by category in the application layer
-- Recommended: use vector index to accelerate
SET cz.vector.index.search.ef=64; -- adjust the exploration factor
SELECT id, name, cosine_distance(embedding, query_vec) AS dist
FROM product_embeddings
WHERE cosine_distance(embedding, query_vec) < 0.5 -- distance threshold triggers the index
ORDER BY dist ASC
LIMIT 10;