Vector Search and RAG Applications Guide

Scenario: Use vector similarity search to implement semantic matching, recommendation systems, and knowledge base retrieval (RAG). Convert unstructured data such as text and images into high-dimensional vectors, then find the most similar results by computing vector distances.

Core functions: vector (create a vector), cosine_distance (cosine distance), l2_distance (Euclidean distance), dot_product (dot product), l2_norm (L2 norm), l2_normalize (L2 normalization), fill_vector (fill a vector), binary_quantize (binary quantization).

Quick Reference

Function	Purpose	Return Value	Use Case
`ai_embedding(model, text)`	Convert text to a high-dimensional vector	VECTOR(N)	RAG retrieval, semantic search, vector storage
`vector(v1, v2, ...)`	Create a vector	VECTOR type	Constructing query vectors, test data
`cosine_distance(a, b)`	Cosine distance (0–2)	FLOAT	Text semantic similarity, recommendation systems
`l2_distance(a, b)`	Euclidean distance (≥0)	FLOAT	Image feature matching, spatial distance
`dot_product(a, b)`	Vector dot product	FLOAT	Recommendation system scoring
`l2_norm(v)`	L2 norm (vector length)	FLOAT	Vector normalization preprocessing
`l2_normalize(v)`	L2 normalization	VECTOR type	Convert dot product to cosine similarity
`fill_vector(n, val)`	Create a filled vector	VECTOR type	Zero vector / unit vector baseline
`binary_quantize(v)`	Binary quantization of a vector	VECTOR(tinyint, N)	Dimensionality reduction, Hamming distance search

Prerequisites

All examples in this guide use the following test data:

-- Product semantic vector table -- In production, use the ai_embedding function to convert product descriptions to high-dimensional vectors: -- UPDATE product_embeddings -- SET embedding = ai_embedding('endpoint:my_embedding_model', description); CREATE TABLE product_embeddings ( id BIGINT, name VARCHAR(100), description VARCHAR(500), category VARCHAR(50), embedding VECTOR(768), -- Create a vector index (HNSW algorithm) to accelerate cosine distance retrieval INDEX idx_embedding(embedding) USING VECTOR PROPERTIES ( 'scalar.type' = 'f32', 'distance.function' = 'cosine_distance' ) ); -- Sample data: in production, vectors are generated by an embedding model. -- Simplified vectors are used here for demonstration. -- Note: real-world vector dimensions are typically 768, 1024, or 1536. INSERT INTO product_embeddings (id, name, description, category, embedding) VALUES (1, 'Apple', 'Fresh red apple from organic farm, sweet and juicy.', 'Fruit', vector(0.9, 0.1, 0.0, 0.0) || fill_vector(764, 0.0)), (2, 'Banana', 'Ripe yellow banana, rich in potassium and energy.', 'Fruit', vector(0.1, 0.9, 0.0, 0.0) || fill_vector(764, 0.0)), (3, 'Carrot', 'Organic carrot, crunchy and full of vitamins.', 'Vegetable',vector(0.0, 0.0, 0.9, 0.1) || fill_vector(764, 0.0)), (4, 'Dog', 'Friendly golden retriever, great family pet.', 'Animal', vector(0.1, 0.0, 0.1, 0.8) || fill_vector(764, 0.0)), (5, 'Elephant', 'Majestic elephant, largest land animal.', 'Animal', vector(0.0, 0.1, 0.0, 0.9) || fill_vector(764, 0.0));

⚠️ Note:

The VECTOR type cannot be output directly as text in query results; use it with distance functions instead.
In production, use the ai_embedding function to convert text to high-dimensional vectors:
SELECT ai_embedding('endpoint:my_embedding_model', 'Fresh apple', JSON '{"dimensions": "768"}') AS embedding;
The vector index uses the HNSW algorithm to accelerate retrieval and automatically indexes data as it is written.

Scenario 0: Generating Vectors with ai_embedding

Problem

How to convert text descriptions into high-dimensional vectors for subsequent similarity search.

SQL Implementation

-- Use ai_embedding to convert product descriptions to 768-dimensional vectors SELECT id, name, description, ai_embedding('endpoint:my_embedding_model', description, JSON '{"dimensions": "768"}') AS embedding FROM products WHERE id <= 5;

Notes:

ai_embedding requires a pre-configured embedding model endpoint (e.g., OpenAI text-embedding-3-small, Alibaba Cloud text-embedding-v4, etc.)
Common dimensions: 256 (lightweight), 768 (standard), 1024/1536 (high precision)
After generating vectors, store them in a VECTOR type column for subsequent similarity search

Batch Vector Generation

-- Batch-generate vectors for existing data UPDATE product_embeddings_768 SET embedding = ai_embedding('endpoint:my_embedding_model', description, JSON '{"dimensions": "768"}') WHERE embedding IS NULL;

Scenario 1: Semantic Similarity Search

Problem

Given a query vector (e.g., the embedding for "fruit"), find the most similar products.

SQL Implementation

-- Given a query text, convert it to a vector with ai_embedding and then search -- In production, the query vector is also generated by ai_embedding: -- ai_embedding('endpoint:my_embedding_model', 'fresh sweet fruit') SELECT id, name, category, description, cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh sweet fruit')) AS similarity FROM product_embeddings ORDER BY similarity ASC LIMIT 3;

Output:

id	name	category	description	similarity
1	Apple	Fruit	Fresh red apple...	0.009
2	Banana	Fruit	Ripe yellow banana...	0.652
4	Dog	Animal	Friendly golden retriever...	0.881

Key notes:

A smaller cosine_distance value means greater similarity (0 = identical, 2 = completely opposite)
The query vector is generated in real time by ai_embedding and must have the same dimension as the stored vectors (e.g., 768)

Scenario 2: Comparing Distance Functions

Problem

Different distance metrics suit different business scenarios; you need to compare their results.

SQL Implementation

-- Compare the effect of different distance metrics -- The query vector is generated in real time by ai_embedding SELECT id, name, cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS cos_dist, l2_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS l2_dist, dot_product(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS dot_prod FROM product_embeddings ORDER BY cos_dist ASC;

Output:

id	name	cos_dist	l2_dist	dot_prod
1	Apple	0.009	0.141	0.74
2	Banana	0.652	0.990	0.26
3	Carrot	0.957	1.273	0.00
4	Dog	0.881	1.086	0.10
5	Elephant	0.949	1.225	0.02

Selection guide:

Cosine distance: focuses on direction rather than magnitude (text semantics, recommendations)
Euclidean distance: focuses on absolute difference (image features, spatial coordinates)
Dot product: recommendation system scoring (higher value = more relevant)

Scenario 3: Vector Normalization and Baseline Vectors

Problem

Create a zero vector as a baseline, or compute similarity after normalizing vectors.

SQL Implementation

-- Create a zero vector and compute distances SELECT id, name, l2_distance(embedding, fill_vector(768, 0.0)) AS dist_to_zero, l2_norm(embedding) AS vector_length FROM product_embeddings ORDER BY dist_to_zero ASC LIMIT 3;

Output:

id	name	dist_to_zero	vector_length
4	Dog	0.812	0.812
1	Apple	0.906	0.906
2	Banana	0.906	0.906

⚠️ Note: l2_distance(v, fill_vector(4, 0.0)) is equivalent to l2_norm(v). l2_normalize(v) returns the normalized vector, which can be used to convert dot product to cosine similarity, but the VECTOR type cannot be output as text directly.

Scenario 4: Binary Quantization and Hamming Distance

Problem

Quantize floating-point vectors into 0/1 binary vectors for fast approximate search.

SQL Implementation

-- Compute Hamming distance after binary quantization (avoid outputting vector columns directly) SELECT a.id AS id_a, a.name AS name_a, b.id AS id_b, b.name AS name_b, hamming_distance( binary_quantize(a.embedding), binary_quantize(b.embedding) ) AS hamming_dist FROM product_embeddings a JOIN product_embeddings b ON a.id < b.id ORDER BY hamming_dist ASC LIMIT 5;

Output:

id_a	name_a	id_b	name_b	hamming_dist
3	Carrot	4	Dog	1.0
1	Apple	2	Banana	2.0
...	...	...	...	...

Notes:

binary_quantize returns a vector(tinyint, N) type, which cannot be output as text directly
hamming_distance is designed for fast distance computation on binary vectors
Hamming distance = the number of positions where the two vectors differ

Scenario 5: Jaccard Similarity (Set Overlap)

Problem

Compute the Jaccard distance between two binary vectors (representing sets) to measure set overlap.

SQL Implementation

-- Jaccard distance requires vectors of tinyint type -- Distance = 0 means identical; distance = 1 means completely different SELECT jaccard_distance(vector(1Y, 1Y, 0Y, 0Y), vector(1Y, 0Y, 1Y, 0Y)) AS jaccard_dist, -- Jaccard similarity = 1 - distance 1 - jaccard_distance(vector(1Y, 1Y, 0Y, 0Y), vector(1Y, 0Y, 1Y, 0Y)) AS jaccard_sim

Output:

jaccard_dist	jaccard_sim
0.667	0.333

Notes:

Jaccard distance = 1 - (intersection size / union size)
Suitable for tag overlap, user interest similarity, and similar scenarios

Scenario 7: Accelerating Retrieval with a Vector Index

Problem

When data reaches millions of rows or more, brute-force computation of all vector distances leads to extremely slow queries. A vector index is essential for large-scale vector retrieval and can reduce query time from minutes to milliseconds.

Why Do You Need a Vector Index?

Scenario	No Index (Brute Force)	With Index (HNSW)
Query method	Full table scan, computes all vector distances	Approximate nearest neighbor (ANN), computes only a subset of vectors
Time complexity	O(N), grows linearly with data volume	O(log N), grows logarithmically with data volume
Small dataset (< 10,000 rows)	Performance difference is negligible	Performance difference is negligible
Large dataset (> 100,000 rows)	Query time increases significantly	Query time remains roughly constant

Conclusion: When data exceeds 10,000 rows, creating a vector index is strongly recommended.

SQL Implementation

Option 1: Declare the vector index at table creation (recommended)

CREATE TABLE product_embeddings ( id BIGINT, name VARCHAR(100), embedding VECTOR(768), INDEX idx_embedding(embedding) USING VECTOR PROPERTIES ( 'scalar.type' = 'f32', 'distance.function' = 'cosine_distance' ) );

Option 2: Add a vector index to an existing table

-- Create the index after the table and data already exist CREATE VECTOR INDEX idx_embedding ON TABLE product_embeddings(embedding) PROPERTIES ( 'scalar.type' = 'f32', 'distance.function' = 'cosine_distance' ); -- Existing data requires a manual build BUILD INDEX idx_embedding ON product_embeddings;

Verifying the Index

-- View indexes on the table SHOW INDEX IN product_embeddings;

Output:

index_name	index_type
idx_embedding	vector

Key notes:

Index declared at table creation: data is indexed automatically on insert; no BUILD is needed
Index added to an existing table: the new index only applies to data written after creation; existing data requires BUILD INDEX
The vector index uses the HNSW algorithm, suitable for large-scale vector retrieval

Vector Index + Text Filtering

-- Issue: combining a vector index with WHERE filters on other columns degrades to brute-force retrieval -- Option 1: use a subquery to do vector retrieval first, then filter in the outer query SELECT id, name, dist FROM ( SELECT id, name, cosine_distance(embedding, query_vec) as dist FROM product_embeddings WHERE cosine_distance(embedding, query_vec) < 0.5 ORDER BY dist LIMIT 100 ) WHERE name LIKE '%Apple%'; -- Option 2: combine vector index with inverted index SET cz.sql.index.prewhere.enabled=true; SELECT id, name, cosine_distance(embedding, query_vec) as dist FROM product_embeddings WHERE match_regexp(name, '.*Apple.*', map('analyzer', 'keyword')) AND cosine_distance(embedding, query_vec) < 0.5 ORDER BY dist LIMIT 100;

Scenario 8: End-to-End Example — RAG Knowledge Base Retrieval

Problem

Build a simple RAG (Retrieval-Augmented Generation) knowledge base retrieval flow:

Convert the user query to a vector
Retrieve the Top K most relevant documents
Filter results by category

SQL Implementation

-- RAG retrieval flow: -- 1. User query → ai_embedding converts it to a vector -- 2. Vector retrieval → Top K relevant documents -- 3. Retrieved results + original query → LLM generates an answer SELECT id, name, category, description, cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'large land animal')) AS relevance FROM product_embeddings WHERE category = 'Animal' -- optional: filter by category ORDER BY relevance ASC LIMIT 2;

Output:

id	name	category	relevance
4	Dog	Animal	0.000
5	Elephant	Animal	0.134

RAG flow summary:

Query text → Embedding model → Query vector
SQL vector retrieval → Top K relevant documents
Retrieved results + original query → LLM generates an answer

Common Issues

1. Vector type cannot be output directly

-- Wrong: querying a vector column directly will error SELECT embedding FROM product_embeddings; -- Error: data type vector(float,4) is not currently supported by textfile writer -- Correct: use with a distance function, or output only scalar results SELECT id, name, cosine_distance(embedding, vector(0.5, 0.5, 0.5, 0.5)) AS dist FROM product_embeddings; -- Correct: compute distance after binary quantization SELECT l2_distance(binary_quantize(embedding), binary_quantize(vector(0.5, 0.5, 0.5, 0.5))) AS bin_dist FROM product_embeddings;

2. Dimension consistency for `ai_embedding`-generated vectors

-- Wrong: query vector dimension does not match stored vector dimension cosine_distance(embedding, ai_embedding('endpoint:my_model', 'query')) -- embedding is 768-dimensional; ai_embedding default may be 1536-dimensional -- Correct: specify the dimensions parameter to ensure consistency ai_embedding('endpoint:my_model', 'query', JSON '{"dimensions": "768"}')

3. `vector()` function parameter format

-- Wrong: passing a string array vector('[0.9, 0.1, 0.0, 0.0]') -- Correct: pass individual numeric arguments vector(0.9, 0.1, 0.0, 0.0)

4. Cosine distance vs. cosine similarity

-- cosine_distance returns a distance value (0–2), not a similarity score -- Similarity = 1 - distance SELECT name, 1 - cosine_distance(embedding, query_vec) AS cosine_similarity FROM product_embeddings; -- Verification: identical vectors → distance=0, orthogonal → distance=1, opposite → distance=2 SELECT cosine_distance(vector(1.0, 0.0), vector(1.0, 0.0)) AS same, -- 0 cosine_distance(vector(1.0, 0.0), vector(0.0, 1.0)) AS ortho, -- 1 cosine_distance(vector(1.0, 0.0), vector(-1.0, 0.0)) AS opp; -- 2

5. Vector dimensions must match

-- Wrong: dimension mismatch will error cosine_distance(vector(1.0, 2.0), vector(1.0, 2.0, 3.0)) -- Correct: ensure the query vector has the same dimension as the stored vector cosine_distance(embedding, vector(0.8, 0.2, 0.0, 0.0)) -- 4-dim vs 4-dim

6. `fill_vector` parameter order

-- Wrong: parameters are in the wrong order fill_vector(vector(0.0, 0.0), 4, 0.5) -- Correct: fill_vector(dimension, fill_value) fill_vector(4, 0.0) -- create a 4-dimensional zero vector

7. Vector arithmetic functions

-- vector_add_scalar: add a scalar to each element of a vector (second parameter must be an integer) SELECT l2_distance( vector_add_scalar(vector(1.0, 2.0, 3.0), 2), vector(3.0, 4.0, 5.0) ) AS dist; -- result: 0, confirming correct computation -- l2_normalize: L2 norm ≈ 1 after normalization SELECT l2_norm(l2_normalize(vector(1.0, 2.0, 3.0, 4.0))) AS norm; -- result: 0.99999994

Performance Optimization Tips

Scenario	Optimization Strategy
Large-scale vector retrieval	Create a vector index (HNSW) to avoid full table scans
Binary quantization speedup	Use `binary_quantize` to reduce dimensionality; Hamming distance is faster to compute
Pre-filtering	Use `WHERE` to filter by category or time range before computing vector distances
Batch queries	Use `JOIN` instead of multiple single-row queries
Vector index tuning	Adjust the `cz.vector.index.search.ef` parameter to balance performance and accuracy

-- Recommended: filter first, then compute distances SELECT id, name, cosine_distance(embedding, query_vec) AS dist FROM product_embeddings WHERE category = 'Fruit' -- narrow the scope first ORDER BY dist ASC LIMIT 3; -- Not recommended: compute distances across the full table, then filter SELECT id, name, cosine_distance(embedding, query_vec) AS dist FROM product_embeddings ORDER BY dist ASC LIMIT 100; -- then filter by category in the application layer -- Recommended: use vector index to accelerate SET cz.vector.index.search.ef=64; -- adjust the exploration factor SELECT id, name, cosine_distance(embedding, query_vec) AS dist FROM product_embeddings WHERE cosine_distance(embedding, query_vec) < 0.5 -- distance threshold triggers the index ORDER BY dist ASC LIMIT 10;

Vector Search and RAG Applications Guide

Quick Reference

Prerequisites

Scenario 0: Generating Vectors with ai_embedding

Problem

SQL Implementation

Batch Vector Generation

Scenario 1: Semantic Similarity Search

Problem

SQL Implementation

Scenario 2: Comparing Distance Functions

Problem

SQL Implementation

Scenario 3: Vector Normalization and Baseline Vectors

Problem

SQL Implementation

Scenario 4: Binary Quantization and Hamming Distance

Problem

SQL Implementation

Scenario 5: Jaccard Similarity (Set Overlap)

Problem

SQL Implementation

Scenario 7: Accelerating Retrieval with a Vector Index

Problem

Why Do You Need a Vector Index?

SQL Implementation

Verifying the Index

Vector Index + Text Filtering

Scenario 8: End-to-End Example — RAG Knowledge Base Retrieval

Problem

SQL Implementation

Common Issues

1. Vector type cannot be output directly

2. Dimension consistency for ai_embedding-generated vectors

3. vector() function parameter format

4. Cosine distance vs. cosine similarity

5. Vector dimensions must match

6. fill_vector parameter order

7. Vector arithmetic functions

Performance Optimization Tips

Related Documentation

2. Dimension consistency for `ai_embedding`-generated vectors

3. `vector()` function parameter format

6. `fill_vector` parameter order