Lakehouse Vector Index Creation Guide

Overview

Lakehouse provides comprehensive vector search capabilities, including vector data types, search functions, and dedicated indexes. Vectors, as fixed-dimensional numerical collections, can represent various types of information such as LLM-generated text embeddings, image features, and time-series data. The system natively supports the VECTOR type, enabling efficient storage and querying of vector embeddings generated by large language models (LLMs) and other deep learning models. These vector embeddings convert unstructured data (such as text and images) into numerical representations that preserve semantic similarity. By building specialized vector indexes, Lakehouse can achieve millisecond-level similarity searches across massive datasets, providing high-performance infrastructure support for LLM-based RAG applications, intelligent Q&A, semantic search, and other AI scenarios.

1. Preparation

1.1 Create Test Schema

-- Create a schema dedicated to vector index testing
CREATE SCHEMA IF NOT EXISTS vector_demo COMMENT 'Vector index demo';

-- Switch to the newly created schema
USE SCHEMA vector_demo;

1.2 Prepare Vector Generation Function

Lakehouse supports creating AI functions that can convert text into vector representations. See: Create Embedding Function

-- Function signature
-- fc_embeddings(mode, text, api_key, model, dimension)
-- Parameter description:
-- mode: 'text' - text mode
-- text: the text content to convert
-- api_key: API key (please replace with your actual key)
-- model: model name, e.g., 'text-embedding-v4'
-- dimension: vector dimension, e.g., 1024

-- Test the vector generation function
SELECT public.fc_embeddings(
    'text', 
    'hello world', 
    '${API_KEY}', 
    'text-embedding-v4',
    1024
) as embedding;

2. Create a Table with Vector Columns

Create Product Description Table

-- Create product table with vector column (simplified version, avoids IDENTITY issues)
CREATE TABLE IF NOT EXISTS vector_demo.product_embeddings (
    product_id INT COMMENT 'Product ID',
    product_name VARCHAR(255) NOT NULL COMMENT 'Product name',
    category VARCHAR(100) COMMENT 'Product category',
    description STRING COMMENT 'Product description',
    embedding VECTOR(FLOAT, 1024) COMMENT 'Vector representation of product description',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP() COMMENT 'Creation time',
    
    -- Create vector index (simplified syntax)
    INDEX idx_product_embedding (embedding) USING VECTOR
) 
COMMENT 'Product vector table';

3. Insert Vector Data

3.1 Insert Data Using Vector Generation Function

Important Note: Vectors generated using the fc_embeddings function need to be explicitly cast to the VECTOR type.

-- Insert vector data, note that CAST conversion is required
INSERT INTO vector_demo.product_embeddings (product_id, product_name, category, description, embedding)
VALUES 
-- Smartphone category
(1, 'iPhone 15 Pro', 'Smartphone', 
    'Apple latest flagship phone, titanium design, A17 Pro chip, ray tracing gaming support, 48MP main camera, USB-C port',
    CAST(public.fc_embeddings('text', 
        'Apple latest flagship phone, titanium design, A17 Pro chip, ray tracing gaming support, 48MP main camera, USB-C port', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(2, 'Xiaomi 14 Pro', 'Smartphone', 
    'Xiaomi high-end flagship, Snapdragon 8 Gen3 processor, Leica imaging system, 120W fast charging, 2K AMOLED screen',
    CAST(public.fc_embeddings('text', 
        'Xiaomi high-end flagship, Snapdragon 8 Gen3 processor, Leica imaging system, 120W fast charging, 2K AMOLED screen', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(3, 'Huawei Mate 60 Pro', 'Smartphone', 
    'Huawei flagship phone, Kirin 9000S chip, satellite communication, HarmonyOS, Xuanwu architecture drop resistance',
    CAST(public.fc_embeddings('text', 
        'Huawei flagship phone, Kirin 9000S chip, satellite communication, HarmonyOS, Xuanwu architecture drop resistance', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Laptop category
(4, 'MacBook Pro M3 Max', 'Laptop', 
    'Apple professional laptop, M3 Max chip, up to 128GB unified memory, Mini-LED display, 22-hour battery life',
    CAST(public.fc_embeddings('text', 
        'Apple professional laptop, M3 Max chip, up to 128GB unified memory, Mini-LED display, 22-hour battery life', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(5, 'Lenovo ThinkPad X1 Carbon', 'Laptop', 
    'Business ultrabook, 13th Gen Intel Core processor, 14-inch 2.8K OLED screen, only 1.12kg weight',
    CAST(public.fc_embeddings('text', 
        'Business ultrabook, 13th Gen Intel Core processor, 14-inch 2.8K OLED screen, only 1.12kg weight', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(6, 'ASUS ROG Zephyrus 16', 'Laptop', 
    'Gaming laptop, RTX 4070 GPU, Intel i9-13900HX processor, 240Hz gaming display, liquid metal cooling',
    CAST(public.fc_embeddings('text', 
        'Gaming laptop, RTX 4070 GPU, Intel i9-13900HX processor, 240Hz gaming display, liquid metal cooling', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Headphones/Audio category
(7, 'AirPods Pro 2', 'Headphones', 
    'Active noise cancelling earbuds, adaptive audio, spatial audio, H2 chip, up to 30 hours battery life',
    CAST(public.fc_embeddings('text', 
        'Active noise cancelling earbuds, adaptive audio, spatial audio, H2 chip, up to 30 hours battery life', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(8, 'Sony WH-1000XM5', 'Headphones', 
    'Over-ear noise cancelling headphones, industry-leading noise cancellation, 30 hours battery life, LDAC high-quality audio transmission',
    CAST(public.fc_embeddings('text', 
        'Over-ear noise cancelling headphones, industry-leading noise cancellation, 30 hours battery life, LDAC high-quality audio transmission', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(9, 'Bose QC Ultra', 'Headphones', 
    'Flagship noise cancelling headphones, immersive spatial audio, CustomTune technology, 24 hours battery life',
    CAST(public.fc_embeddings('text', 
        'Flagship noise cancelling headphones, immersive spatial audio, CustomTune technology, 24 hours battery life', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Smartwatch category
(10, 'Apple Watch Ultra 2', 'Smartwatch', 
    'Outdoor sports watch, dual-band GPS, 100m water resistance, blood oxygen monitoring, up to 40m diving capability',
    CAST(public.fc_embeddings('text', 
        'Outdoor sports watch, dual-band GPS, 100m water resistance, blood oxygen monitoring, up to 40m diving capability', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(11, 'Huawei Watch GT4', 'Smartwatch', 
    'All-day health monitoring, 14-day ultra-long battery life, 100+ sports modes, scientific sleep monitoring',
    CAST(public.fc_embeddings('text', 
        'All-day health monitoring, 14-day ultra-long battery life, 100+ sports modes, scientific sleep monitoring', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(12, 'Samsung Galaxy Watch6', 'Smartwatch', 
    'Health tracking watch, body composition analysis, sleep guidance, 5ATM+IP68 water resistance',
    CAST(public.fc_embeddings('text', 
        'Health tracking watch, body composition analysis, sleep guidance, 5ATM+IP68 water resistance', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Tablet category
(13, 'iPad Pro 13-inch', 'Tablet', 
    'M4 chip tablet, OLED display, supports Apple Pencil Pro, professional creative tools',
    CAST(public.fc_embeddings('text', 
        'M4 chip tablet, OLED display, supports Apple Pencil Pro, professional creative tools', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(14, 'Xiaomi Pad 6 Pro', 'Tablet', 
    'Snapdragon 8+ processor, 144Hz high refresh rate screen, 67W fast charging, supports Xiaomi Smart Pen',
    CAST(public.fc_embeddings('text', 
        'Snapdragon 8+ processor, 144Hz high refresh rate screen, 67W fast charging, supports Xiaomi Smart Pen', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Smart Speaker category
(15, 'HomePod 2', 'Smart Speaker', 
    'Smart speaker, spatial audio, Siri voice assistant, smart home hub, temperature and humidity sensing',
    CAST(public.fc_embeddings('text', 
        'Smart speaker, spatial audio, Siri voice assistant, smart home hub, temperature and humidity sensing', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(16, 'XiaoAi Speaker Pro', 'Smart Speaker', 
    'XiaoAi voice assistant, DTS audio, smart home control, IR remote control feature',
    CAST(public.fc_embeddings('text', 
        'XiaoAi voice assistant, DTS audio, smart home control, IR remote control feature', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Camera category
(17, 'Sony A7R5', 'Camera', 
    'Full-frame mirrorless camera, 61MP, 8-stop stabilization, 8K video recording, AI autofocus system',
    CAST(public.fc_embeddings('text', 
        'Full-frame mirrorless camera, 61MP, 8-stop stabilization, 8K video recording, AI autofocus system', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(18, 'Canon EOS R5', 'Camera', 
    'Professional full-frame camera, 45MP, in-body stabilization, 8K RAW video, dual card slots',
    CAST(public.fc_embeddings('text', 
        'Professional full-frame camera, 45MP, in-body stabilization, 8K RAW video, dual card slots', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

-- Gaming Device category
(19, 'PlayStation 5 Pro', 'Gaming Console', 
    'Next-gen gaming console, 8K output, ray tracing, ultra-high-speed SSD, DualSense controller',
    CAST(public.fc_embeddings('text', 
        'Next-gen gaming console, 8K output, ray tracing, ultra-high-speed SSD, DualSense controller', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024))),

(20, 'Nintendo Switch OLED', 'Gaming Console', 
    'Portable gaming console, 7-inch OLED screen, enhanced audio, 64GB storage, detachable controllers',
    CAST(public.fc_embeddings('text', 
        'Portable gaming console, 7-inch OLED screen, enhanced audio, 64GB storage, detachable controllers', 
        '${API_KEY}', 'text-embedding-v4', 1024) AS VECTOR(FLOAT, 1024)))
;

4. Build Vector Index

4.1 Monitor Index Build Progress

-- View table index list
SHOW INDEXES FROM vector_demo.product_embeddings;

-- View table details, including indexes
DESC EXTENDED vector_demo.product_embeddings ;
WITH query_vector AS (
    SELECT CAST(public.fc_embeddings(
        'text', 
        'Phone with powerful camera capabilities, excellent camera system, outstanding imaging performance', 
        '${API_KEY}', 
        'text-embedding-v4',
        1024
    ) AS VECTOR(FLOAT, 1024)) as vec
)
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    p.description,
    cosine_distance(p.embedding, q.vec) as similarity_score
FROM vector_demo.product_embeddings p, query_vector q
ORDER BY similarity_score ASC
LIMIT 5;

Results:

WITH query_vector AS (
    SELECT CAST(public.fc_embeddings(
        'text', 
        'Domestic high-end flagship phone', 
        '${API_KEY}', 
        'text-embedding-v4',
        1024
    ) AS VECTOR(FLOAT, 1024)) as vec
)
SELECT 
    product_id,
    product_name,
    description,
    ROUND(cosine_distance(embedding, vec), 4) as distance
FROM vector_demo.product_embeddings, query_vector
WHERE product_name NOT LIKE '%iPhone%'  -- Exclude Apple products
ORDER BY distance ASC
LIMIT 3;

Results:

6. Vector Index Optimization Suggestions

6.1 Distance Function Selection

Distance FunctionUse CaseFeatures
cosine_distanceText similarityUnaffected by vector length, suitable for text embeddings
l2_distanceImage featuresEuclidean distance, sensitive to absolute differences
negative_dot_productRecommendation systemsNegative of dot product, suitable for rating prediction

6.2 Parameter Tuning

-- High-precision configuration (accurate but slower queries)
CREATE INDEX idx_high_precision (embedding) USING VECTOR 
PROPERTIES(
    "m" = "64",
    "ef.construction" = "500",
    "scalar.type" = "f32"
);

-- High-performance configuration (fast queries but slightly lower precision)
CREATE INDEX idx_high_performance (embedding) USING VECTOR 
PROPERTIES(
    "m" = "8",
    "ef.construction" = "100",
    "scalar.type" = "f16",
    "compress.codec" = "lz4"
);

6.3 Storage Optimization

-- Use vector column reuse to reduce storage
CREATE INDEX idx_storage_optimized (embedding) USING VECTOR 
PROPERTIES(
    "reuse.vector.column" = "true",
    "scalar.type" = "i8"  -- Use 8-bit integers for further compression
);

7. Manage Vector Index

7.1 View Index Information

-- View all indexes of the table
SHOW INDEXES FROM vector_demo.product_embeddings;

-- View index details
DESC INDEX idx_product_embedding ON vector_demo.product_embeddings;

7.2 Drop and Rebuild Index

-- Drop index
DROP INDEX idx_product_embedding ON vector_demo.product_embeddings;

-- Recreate index (with new parameters)
CREATE INDEX idx_product_embedding_v2 ON vector_demo.product_embeddings(embedding) 
USING VECTOR PROPERTIES(
    "distance.function" = "negative_dot_product",
    "m" = "24",
    "ef.construction" = "300"
);

-- Build new index
BUILD INDEX idx_product_embedding_v2 ON vector_demo.product_embeddings;

8. Performance Monitoring and Troubleshooting

8.1 Query Performance Analysis

-- Use EXPLAIN to analyze query plan
EXPLAIN 
WITH query_vector AS (
    SELECT public.fc_embeddings('text', 'Test query', '${API_KEY}', 'text-embedding-v4', 1024) as vec
)
SELECT * FROM vector_demo.product_embeddings p, query_vector q
ORDER BY cosine_distance(p.embedding, q.vec) ASC
LIMIT 10;

8.2 Common Issue Handling

  1. Index Build Failure

    • Check if memory is sufficient
    • Verify vector dimensions are consistent
    • Validate vector data integrity
  2. Slow Query Performance

    • Adjust m and ef.construction parameters
    • Consider using a smaller scalar.type
    • Check whether the index is being used correctly
  3. Insufficient Precision

    • Increase m value
    • Increase ef.construction
    • Use f32 instead of compressed format

Important Notes

  1. Vector Type Conversion: The vector result generated by the fc_embeddings function is of String type and must be cast to VECTOR(FLOAT, 1024) type using CAST.

  2. BUILD INDEX: For tables with existing data, you must execute the BUILD INDEX command to build the index; otherwise, vector search performance will be affected.

  3. Distance Function Selection:

    • cosine_distance: Suitable for text embeddings, smaller values indicate higher similarity
    • l2_distance: Euclidean distance, suitable for image features
    • negative_dot_product: Negative of dot product, suitable for recommendation systems
  4. Parameters Supported During Vector Search Queries

    NameDefault ValueNotes
    cz.storage.parquet.vector.index.read.memory.cachefalseWhether to use memory cache
    cz.storage.parquet.vector.index.read.local.cachefalseWhether to use local SSD cache
    cz.storage.parquet.vector.index.read.vectors.ondemandadaptiveWhether to load vector index on demand (slightly slower than using memory cache)
    cz.storage.parquet.vector.index.write.parallel0Whether to enable parallel writes, 0 means off, 8 means 8 threads. Note: performance improvement is not proportional to the number of threads.

    Usage example: Select and execute together when running SQL queries.

    SET cz.sql.index.prewhere.enabled=true; -- Currently this switch needs to be set; future versions will enable it by default
    SELECT id, doc, l2_distance(vec, vector(1,2,3,4)) as dist FROM some_table WHERE match_regexp(doc, '.*hello.*', map('analyzer', 'keyword')) AND l2_distance(vec, vector(1,2,3,4)) < 1000 ORDER BY dist LIMIT 100
  5. Billing

    Storage resources: Vector indexes create vector index files, which are stored together with data files in object storage and billed uniformly.