Vector Search and RAG Applications Guide


Quick Reference

FunctionPurposeReturn ValueUse Case
ai_embedding(model, text)Convert text to a high-dimensional vectorVECTOR(N)RAG retrieval, semantic search, vector storage
vector(v1, v2, ...)Create a vectorVECTOR typeConstructing query vectors, test data
cosine_distance(a, b)Cosine distance (0–2)FLOATText semantic similarity, recommendation systems
l2_distance(a, b)Euclidean distance (≥0)FLOATImage feature matching, spatial distance
dot_product(a, b)Vector dot productFLOATRecommendation system scoring
l2_norm(v)L2 norm (vector length)FLOATVector normalization preprocessing
l2_normalize(v)L2 normalizationVECTOR typeConvert dot product to cosine similarity
fill_vector(n, val)Create a filled vectorVECTOR typeZero vector / unit vector baseline
binary_quantize(v)Binary quantization of a vectorVECTOR(tinyint, N)Dimensionality reduction, Hamming distance search

Prerequisites

All examples in this guide use the following test data:

-- Product semantic vector table -- In production, use the ai_embedding function to convert product descriptions to high-dimensional vectors: -- UPDATE product_embeddings -- SET embedding = ai_embedding('endpoint:my_embedding_model', description); CREATE TABLE product_embeddings ( id BIGINT, name VARCHAR(100), description VARCHAR(500), category VARCHAR(50), embedding VECTOR(768), -- Create a vector index (HNSW algorithm) to accelerate cosine distance retrieval INDEX idx_embedding(embedding) USING VECTOR PROPERTIES ( 'scalar.type' = 'f32', 'distance.function' = 'cosine_distance' ) ); -- Sample data: in production, vectors are generated by an embedding model. -- Simplified vectors are used here for demonstration. -- Note: real-world vector dimensions are typically 768, 1024, or 1536. INSERT INTO product_embeddings (id, name, description, category, embedding) VALUES (1, 'Apple', 'Fresh red apple from organic farm, sweet and juicy.', 'Fruit', vector(0.9, 0.1, 0.0, 0.0) || fill_vector(764, 0.0)), (2, 'Banana', 'Ripe yellow banana, rich in potassium and energy.', 'Fruit', vector(0.1, 0.9, 0.0, 0.0) || fill_vector(764, 0.0)), (3, 'Carrot', 'Organic carrot, crunchy and full of vitamins.', 'Vegetable',vector(0.0, 0.0, 0.9, 0.1) || fill_vector(764, 0.0)), (4, 'Dog', 'Friendly golden retriever, great family pet.', 'Animal', vector(0.1, 0.0, 0.1, 0.8) || fill_vector(764, 0.0)), (5, 'Elephant', 'Majestic elephant, largest land animal.', 'Animal', vector(0.0, 0.1, 0.0, 0.9) || fill_vector(764, 0.0));


Scenario 0: Generating Vectors with ai_embedding

Problem

How to convert text descriptions into high-dimensional vectors for subsequent similarity search.

SQL Implementation

-- Use ai_embedding to convert product descriptions to 768-dimensional vectors SELECT id, name, description, ai_embedding('endpoint:my_embedding_model', description, JSON '{"dimensions": "768"}') AS embedding FROM products WHERE id <= 5;

Batch Vector Generation

-- Batch-generate vectors for existing data UPDATE product_embeddings_768 SET embedding = ai_embedding('endpoint:my_embedding_model', description, JSON '{"dimensions": "768"}') WHERE embedding IS NULL;


Problem

Given a query vector (e.g., the embedding for "fruit"), find the most similar products.

SQL Implementation

-- Given a query text, convert it to a vector with ai_embedding and then search -- In production, the query vector is also generated by ai_embedding: -- ai_embedding('endpoint:my_embedding_model', 'fresh sweet fruit') SELECT id, name, category, description, cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh sweet fruit')) AS similarity FROM product_embeddings ORDER BY similarity ASC LIMIT 3;

Output:

idnamecategorydescriptionsimilarity
1AppleFruitFresh red apple...0.009
2BananaFruitRipe yellow banana...0.652
4DogAnimalFriendly golden retriever...0.881

Scenario 2: Comparing Distance Functions

Problem

Different distance metrics suit different business scenarios; you need to compare their results.

SQL Implementation

-- Compare the effect of different distance metrics -- The query vector is generated in real time by ai_embedding SELECT id, name, cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS cos_dist, l2_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS l2_dist, dot_product(embedding, ai_embedding('endpoint:my_embedding_model', 'fresh fruit')) AS dot_prod FROM product_embeddings ORDER BY cos_dist ASC;

Output:

idnamecos_distl2_distdot_prod
1Apple0.0090.1410.74
2Banana0.6520.9900.26
3Carrot0.9571.2730.00
4Dog0.8811.0860.10
5Elephant0.9491.2250.02

Scenario 3: Vector Normalization and Baseline Vectors

Problem

Create a zero vector as a baseline, or compute similarity after normalizing vectors.

SQL Implementation

-- Create a zero vector and compute distances SELECT id, name, l2_distance(embedding, fill_vector(768, 0.0)) AS dist_to_zero, l2_norm(embedding) AS vector_length FROM product_embeddings ORDER BY dist_to_zero ASC LIMIT 3;

Output:

idnamedist_to_zerovector_length
4Dog0.8120.812
1Apple0.9060.906
2Banana0.9060.906

Scenario 4: Binary Quantization and Hamming Distance

Problem

Quantize floating-point vectors into 0/1 binary vectors for fast approximate search.

SQL Implementation

-- Compute Hamming distance after binary quantization (avoid outputting vector columns directly) SELECT a.id AS id_a, a.name AS name_a, b.id AS id_b, b.name AS name_b, hamming_distance( binary_quantize(a.embedding), binary_quantize(b.embedding) ) AS hamming_dist FROM product_embeddings a JOIN product_embeddings b ON a.id < b.id ORDER BY hamming_dist ASC LIMIT 5;

Output:

id_aname_aid_bname_bhamming_dist
3Carrot4Dog1.0
1Apple2Banana2.0
...............

Scenario 5: Jaccard Similarity (Set Overlap)

Problem

Compute the Jaccard distance between two binary vectors (representing sets) to measure set overlap.

SQL Implementation

-- Jaccard distance requires vectors of tinyint type -- Distance = 0 means identical; distance = 1 means completely different SELECT jaccard_distance(vector(1Y, 1Y, 0Y, 0Y), vector(1Y, 0Y, 1Y, 0Y)) AS jaccard_dist, -- Jaccard similarity = 1 - distance 1 - jaccard_distance(vector(1Y, 1Y, 0Y, 0Y), vector(1Y, 0Y, 1Y, 0Y)) AS jaccard_sim

Output:

jaccard_distjaccard_sim
0.6670.333

Scenario 7: Accelerating Retrieval with a Vector Index

Problem

When data reaches millions of rows or more, brute-force computation of all vector distances leads to extremely slow queries. A vector index is essential for large-scale vector retrieval and can reduce query time from minutes to milliseconds.

Why Do You Need a Vector Index?

ScenarioNo Index (Brute Force)With Index (HNSW)
Query methodFull table scan, computes all vector distancesApproximate nearest neighbor (ANN), computes only a subset of vectors
Time complexityO(N), grows linearly with data volumeO(log N), grows logarithmically with data volume
Small dataset (< 10,000 rows)Performance difference is negligiblePerformance difference is negligible
Large dataset (> 100,000 rows)Query time increases significantlyQuery time remains roughly constant

SQL Implementation

Option 1: Declare the vector index at table creation (recommended)

CREATE TABLE product_embeddings ( id BIGINT, name VARCHAR(100), embedding VECTOR(768), INDEX idx_embedding(embedding) USING VECTOR PROPERTIES ( 'scalar.type' = 'f32', 'distance.function' = 'cosine_distance' ) );

Option 2: Add a vector index to an existing table

-- Create the index after the table and data already exist CREATE VECTOR INDEX idx_embedding ON TABLE product_embeddings(embedding) PROPERTIES ( 'scalar.type' = 'f32', 'distance.function' = 'cosine_distance' ); -- Existing data requires a manual build BUILD INDEX idx_embedding ON product_embeddings;

Verifying the Index

-- View indexes on the table SHOW INDEX IN product_embeddings;

Output:

index_nameindex_type
idx_embeddingvector

Vector Index + Text Filtering

-- Issue: combining a vector index with WHERE filters on other columns degrades to brute-force retrieval -- Option 1: use a subquery to do vector retrieval first, then filter in the outer query SELECT id, name, dist FROM ( SELECT id, name, cosine_distance(embedding, query_vec) as dist FROM product_embeddings WHERE cosine_distance(embedding, query_vec) < 0.5 ORDER BY dist LIMIT 100 ) WHERE name LIKE '%Apple%'; -- Option 2: combine vector index with inverted index SET cz.sql.index.prewhere.enabled=true; SELECT id, name, cosine_distance(embedding, query_vec) as dist FROM product_embeddings WHERE match_regexp(name, '.*Apple.*', map('analyzer', 'keyword')) AND cosine_distance(embedding, query_vec) < 0.5 ORDER BY dist LIMIT 100;


Scenario 8: End-to-End Example — RAG Knowledge Base Retrieval

Problem

Build a simple RAG (Retrieval-Augmented Generation) knowledge base retrieval flow:

  1. Convert the user query to a vector
  2. Retrieve the Top K most relevant documents
  3. Filter results by category

SQL Implementation

-- RAG retrieval flow: -- 1. User query → ai_embedding converts it to a vector -- 2. Vector retrieval → Top K relevant documents -- 3. Retrieved results + original query → LLM generates an answer SELECT id, name, category, description, cosine_distance(embedding, ai_embedding('endpoint:my_embedding_model', 'large land animal')) AS relevance FROM product_embeddings WHERE category = 'Animal' -- optional: filter by category ORDER BY relevance ASC LIMIT 2;

Output:

idnamecategoryrelevance
4DogAnimal0.000
5ElephantAnimal0.134

Common Issues

1. Vector type cannot be output directly

-- Wrong: querying a vector column directly will error SELECT embedding FROM product_embeddings; -- Error: data type vector(float,4) is not currently supported by textfile writer -- Correct: use with a distance function, or output only scalar results SELECT id, name, cosine_distance(embedding, vector(0.5, 0.5, 0.5, 0.5)) AS dist FROM product_embeddings; -- Correct: compute distance after binary quantization SELECT l2_distance(binary_quantize(embedding), binary_quantize(vector(0.5, 0.5, 0.5, 0.5))) AS bin_dist FROM product_embeddings;

2. Dimension consistency for ai_embedding-generated vectors

-- Wrong: query vector dimension does not match stored vector dimension cosine_distance(embedding, ai_embedding('endpoint:my_model', 'query')) -- embedding is 768-dimensional; ai_embedding default may be 1536-dimensional -- Correct: specify the dimensions parameter to ensure consistency ai_embedding('endpoint:my_model', 'query', JSON '{"dimensions": "768"}')

3. vector() function parameter format

-- Wrong: passing a string array vector('[0.9, 0.1, 0.0, 0.0]') -- Correct: pass individual numeric arguments vector(0.9, 0.1, 0.0, 0.0)

4. Cosine distance vs. cosine similarity

-- cosine_distance returns a distance value (0–2), not a similarity score -- Similarity = 1 - distance SELECT name, 1 - cosine_distance(embedding, query_vec) AS cosine_similarity FROM product_embeddings; -- Verification: identical vectors → distance=0, orthogonal → distance=1, opposite → distance=2 SELECT cosine_distance(vector(1.0, 0.0), vector(1.0, 0.0)) AS same, -- 0 cosine_distance(vector(1.0, 0.0), vector(0.0, 1.0)) AS ortho, -- 1 cosine_distance(vector(1.0, 0.0), vector(-1.0, 0.0)) AS opp; -- 2

5. Vector dimensions must match

-- Wrong: dimension mismatch will error cosine_distance(vector(1.0, 2.0), vector(1.0, 2.0, 3.0)) -- Correct: ensure the query vector has the same dimension as the stored vector cosine_distance(embedding, vector(0.8, 0.2, 0.0, 0.0)) -- 4-dim vs 4-dim

6. fill_vector parameter order

-- Wrong: parameters are in the wrong order fill_vector(vector(0.0, 0.0), 4, 0.5) -- Correct: fill_vector(dimension, fill_value) fill_vector(4, 0.0) -- create a 4-dimensional zero vector

7. Vector arithmetic functions

-- vector_add_scalar: add a scalar to each element of a vector (second parameter must be an integer) SELECT l2_distance( vector_add_scalar(vector(1.0, 2.0, 3.0), 2), vector(3.0, 4.0, 5.0) ) AS dist; -- result: 0, confirming correct computation -- l2_normalize: L2 norm ≈ 1 after normalization SELECT l2_norm(l2_normalize(vector(1.0, 2.0, 3.0, 4.0))) AS norm; -- result: 0.99999994


Performance Optimization Tips

ScenarioOptimization Strategy
Large-scale vector retrievalCreate a vector index (HNSW) to avoid full table scans
Binary quantization speedupUse binary_quantize to reduce dimensionality; Hamming distance is faster to compute
Pre-filteringUse WHERE to filter by category or time range before computing vector distances
Batch queriesUse JOIN instead of multiple single-row queries
Vector index tuningAdjust the cz.vector.index.search.ef parameter to balance performance and accuracy

-- Recommended: filter first, then compute distances SELECT id, name, cosine_distance(embedding, query_vec) AS dist FROM product_embeddings WHERE category = 'Fruit' -- narrow the scope first ORDER BY dist ASC LIMIT 3; -- Not recommended: compute distances across the full table, then filter SELECT id, name, cosine_distance(embedding, query_vec) AS dist FROM product_embeddings ORDER BY dist ASC LIMIT 100; -- then filter by category in the application layer -- Recommended: use vector index to accelerate SET cz.vector.index.search.ef=64; -- adjust the exploration factor SELECT id, name, cosine_distance(embedding, query_vec) AS dist FROM product_embeddings WHERE cosine_distance(embedding, query_vec) < 0.5 -- distance threshold triggers the index ORDER BY dist ASC LIMIT 10;