Lakehouse VECTOR Data Processing User Guide

Native Support and Practice for Vector Data


Introduction

Vector Data Processing Scenarios You May Encounter

Current Common Practice in Hive/Spark:

-- Storing document vectors in Hive
CREATE TABLE document_embeddings (
    doc_id STRING,
    title STRING, 
    content STRING,
    embedding ARRAY<DOUBLE>  -- Store vectors generated by LLM
) PARTITIONED BY (dt STRING);

-- Similarity calculation requires complex UDF
CREATE FUNCTION cosine_similarity_udf(vec1 ARRAY<DOUBLE>, vec2 ARRAY<DOUBLE>) 
RETURNS DOUBLE
LANGUAGE PYTHON AS $
import numpy as np
def cosine_similarity_udf(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
$;

-- Query requires manual dimension check and UDF call
SELECT doc_id, title,
    cosine_similarity_udf(embedding, array(0.1, 0.2, 0.3, ...)) as score
FROM document_embeddings 
WHERE size(embedding) = 1536  -- Manual dimension check
    AND dt >= '2024-01-01'
ORDER BY score DESC 
LIMIT 10;

Technical Challenges:

  • Complex implementation: Requires custom UDFs to implement distance calculation functions
  • No dedicated index: Array types cannot have similarity-specific indexes
  • Type safety issues: Array length is not fixed, prone to dimension mismatch errors
  • Performance limitations: Lack of specialized optimization for large-scale vector computation

Scenario 2: Recommendation System User Feature Matching

Typical Implementation in Spark:

# User vector similarity calculation in Spark
from pyspark.sql.functions import *
from pyspark.ml.linalg import Vectors

# Requires custom function to implement similarity calculation
def cosine_similarity_udf(vec1, vec2):
    from pyspark.ml.linalg import Vectors
    import numpy as np
    v1 = np.array(vec1)
    v2 = np.array(vec2)
    return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))

# Register UDF
cosine_udf = udf(cosine_similarity_udf, DoubleType())

# Query similar users
similar_users = user_features.crossJoin(user_features.alias("other")) \
    .filter(col("user_id") != col("other.user_id")) \
    .withColumn("similarity", cosine_udf(col("features"), col("other.features"))) \
    .filter(col("similarity") > 0.8) \
    .orderBy(desc("similarity"))

Common Technical Pain Points:

  • Complex architecture: Vector computation is separated from data storage, increasing system complexity
  • Development cost: Requires implementing and maintaining custom similarity calculation logic
  • Performance bottleneck: Lack of dedicated acceleration mechanisms for large-scale vector computation
  • Error handling: No compile-time check for vector dimensions

Why Vector Data Type is Needed

Based on the technical challenges in the above scenarios, the design goals of the VECTOR data type are:

Solving Type Safety Issues:

  • Provide fixed-dimension vector type definition
  • Compile-time dimension matching check
  • Avoid runtime dimension mismatch errors

Simplifying Development Process:

  • Built-in common vector distance calculation functions
  • No need to write and maintain custom UDFs
  • Fully compatible with standard SQL syntax

Improving Query Performance:

  • Support for dedicated vector indexes
  • Optimized vector computation
  • Support for efficient similarity search

How to Use This Guide

Reader RoleRecommended FocusExpected Takeaway
Data EngineerType Definition -> Function Usage -> Pitfall GuideMaster correct VECTOR usage methods
System ArchitectApplication Scenarios -> Performance Optimization -> Migration StrategyObtain architecture design and migration planning references
AI Application DeveloperQuick Start -> Application Patterns -> Query LimitationsUnderstand best practices in AI scenarios

Quick Start

Basic Usage Flow

-- 1. Create a table with a vector column
CREATE TABLE document_embeddings (
    id BIGINT,
    content STRING,
    embedding VECTOR(FLOAT, 1536)  -- 1536-dimensional float vector
);

-- 2. Insert vector data
INSERT INTO document_embeddings VALUES 
(1, 'Artificial Intelligence Technology Development', vector(0.1, 0.2, 0.3, ...));  -- Requires providing complete 1536-dimensional data

-- 3. Create a vector index
CREATE VECTOR INDEX embedding_idx ON TABLE document_embeddings(embedding) 
PROPERTIES(
    "distance.function" = "cosine_distance",
    "scalar.type" = "f32"
);

-- 4. Build index (for existing data)
BUILD INDEX embedding_idx ON document_embeddings;

-- 5. Execute similarity query (recommended pattern)
SELECT id, content FROM document_embeddings 
WHERE COSINE_DISTANCE(embedding, vector(0.1, 0.2, 0.3, ...)) < 0.8
ORDER BY id LIMIT 10;

Technical Comparison: ARRAY vs VECTOR

Traditional ARRAY Implementation

-- Using ARRAY to store vectors
CREATE TABLE docs_array (
    content STRING,
    embedding ARRAY<DOUBLE>
);

-- Requires custom function to implement similarity calculation
CREATE FUNCTION cosine_similarity(vec1 ARRAY<DOUBLE>, vec2 ARRAY<DOUBLE>) 
RETURNS DOUBLE
LANGUAGE PYTHON AS $$
def cosine_similarity(vec1, vec2):
    import numpy as np
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
$$;

-- Query requires function call and manual dimension check
SELECT doc_id, cosine_similarity(embedding, array(0.1, 0.2, ...)) as score
FROM docs_array
WHERE array_size(embedding) = 1536  -- Requires manual dimension check
ORDER BY score DESC;

VECTOR Native Implementation

-- Using VECTOR type
CREATE TABLE docs_vector (
    content STRING,
    embedding VECTOR(FLOAT, 1536),  -- Type and dimension explicitly defined
    
    INDEX embedding_idx (embedding) USING VECTOR PROPERTIES (
        "distance.function" = "cosine_distance"
    )
);

-- Using built-in functions, dimension check is automatic
SELECT doc_id FROM docs_vector 
WHERE COSINE_DISTANCE(embedding, vector(0.1, 0.2, ...)) < 0.8
ORDER BY doc_id LIMIT 10;

Feature Comparison

FeatureARRAY SolutionVECTOR SolutionImpact on Users
Type SafetyRuntime checkCompile-time checkReduces production environment errors
Dimension ControlManual verificationType definition guaranteedAvoids dimension mismatch issues
Distance CalculationCustom UDFBuilt-in functionsReduces code development and maintenance
Index SupportNo dedicated indexVector indexImproves large-scale query performance
Development ComplexityRequires implementation logicDirectly use built-in functionsLowers technical barrier
Maintenance CostRequires maintaining UDF codeNative database supportReduces operational workload

Migration Comparison from Spark MLlib

Spark MLlib APISingdata VECTORMigration ComplexityDescription
Vectors.dense(array)vector(v1, v2, ...)EasyDirect function replacement
Vectors.norm(v, 2)L2_DISTANCE(v, vector(0,...))MediumRequires providing zero vector
Custom cosine UDFCOSINE_DISTANCE(v1, v2)EasyRemove UDF, use built-in function
Custom normalize UDFL2_NORMALIZE(v)EasyRemove UDF, use built-in function
Vectors.sqdist(v1, v2)POW(L2_DISTANCE(v1, v2), 2)EasyCombine built-in functions

VECTOR Data Type Specification

Type Definition Syntax

VECTOR(scalar_type, dimension)

Supported Scalar Types

Scalar TypeStorage SizeValue RangeApplicable ScenarioIndex Support
FLOAT4 bytes/dim32-bit floating pointGeneral vector storage, e.g. embeddingsf32
INT4 bytes/dim32-bit integerInteger feature vectors, e.g. countsNot supported
TINYINT1 byte/dim-128 to 127Compressed storage, e.g. binary vectorsi8

Vector Creation Methods

-- Direct vector creation
SELECT vector(1.0, 2.0, 3.0);              -- Create 3-dimensional float vector
SELECT vector(1, 2, 3);                     -- Create 3-dimensional integer vector

-- Conversion from other types
SELECT cast(array(1,2,3) as VECTOR(INT, 3));     -- Array to vector
SELECT cast('[1,2,3]' as VECTOR(INT, 3));        -- JSON string to vector

-- Usage in table definition
CREATE TABLE feature_vectors (
    id BIGINT,
    dense_features VECTOR(FLOAT, 128),       -- Dense features
    sparse_features VECTOR(INT, 1000)        -- Sparse features
);

Vector Functions

Distance Calculation Functions

L2_DISTANCE - Euclidean Distance

SELECT L2_DISTANCE(vector(1,2,3), vector(4,5,6)) as euclidean_distance;
-- Result: 5.196152210235596

Formula: sqrt(sum((ai - bi)2)) Applicable Scenarios: Spatial data analysis, clustering algorithms, image feature matching

COSINE_DISTANCE - Cosine Distance

SELECT COSINE_DISTANCE(vector(1,2,3), vector(4,5,6)) as cosine_distance;
-- Result: 0.025368213653564453

Formula: 1 - (a.b)/(|a|*|b|) Applicable Scenarios: Text semantic similarity, recommendation systems, information retrieval

DOT_PRODUCT - Dot Product

SELECT DOT_PRODUCT(vector(1,2,3), vector(4,5,6)) as dot_product;
-- Result: 32.0

Formula: sum(ai * bi) Applicable Scenarios: Correlation calculation, neural network weight computation

Vector Processing Functions

L2_NORMALIZE - Vector Normalization

-- IMPORTANT: Must use float vectors
SELECT L2_NORMALIZE(vector(3.0, 4.0)) as normalized_vector;
-- Result: [0.6, 0.8]

-- Incorrect example: Integer vectors may cause precision issues
SELECT L2_NORMALIZE(vector(3, 4)) as normalized_vector;
-- Possible result: [0, 0]

Function: Normalize vector to unit length (L2 norm = 1) Applicable Scenarios: Preprocessing for cosine similarity calculation

Function Usage Notes

  • All distance functions require input vectors to have the same dimension; otherwise, a compile-time error will be raised
  • Functions support mixed computation of vector type and array type
  • Normalization function returns a vector type of the same dimension
  • L2_NORMALIZE function is sensitive to data type; float vectors are recommended

Type Conversion and Compatibility

Automatic Type Conversion

-- Vectors can be implicitly converted to arrays for array operations
SELECT array_size(vector(1,2,3,4)) as vector_size;
-- Result: 4

-- Arrays can be used in vector functions
SELECT L2_DISTANCE(array(1,2,3), vector(1,2,3)) as distance;
-- Result: 0.0

Explicit Type Conversion

-- Array to vector (dimensions must match)
SELECT cast(array(1.0, 2.0, 3.0) as VECTOR(FLOAT, 3));
-- Result: [1.0, 2.0, 3.0]

-- JSON string to vector
SELECT cast('[1,2,3]' as VECTOR(INT, 3));
-- Result: [1, 2, 3]

-- Returns NULL when dimensions don't match
SELECT cast(array(1,2) as VECTOR(INT, 3));  -- Returns NULL

SQL Operation Support

Vector type supports standard SQL operations:

-- Sorting
SELECT * FROM vectors ORDER BY feature_vector;

-- Grouping
SELECT feature_vector, COUNT(*) FROM vectors GROUP BY feature_vector;

-- Comparison
SELECT * FROM vectors WHERE feature_vector = vector(1,2,3);

-- Deduplication
SELECT DISTINCT feature_vector FROM vectors;

Vector Index

Index Configuration Parameters

Parameter NameAvailable ValuesDefault ValueDescription
distance.functioncosine_distance, l2_distance, jaccard_distance, hamming_distancecosine_distanceDistance function type
scalar.typef32, f16, i8, b1f32Storage type of vector elements in the index
MPositive integer16Maximum number of connections for HNSW algorithm
ef_constructionPositive integer200Candidate set size during index construction

Index Creation and Management

-- Create vector index (recommended syntax)
CREATE VECTOR INDEX embedding_idx ON TABLE table_name(vector_column) 
PROPERTIES(
    "distance.function" = "cosine_distance",  -- Full function name
    "scalar.type" = "f32",                   -- Standard type name
    "M" = 16,
    "ef_construction" = 200
);

-- Build index for existing data
BUILD INDEX embedding_idx ON table_name;

-- View index status
SHOW INDEXES ON table_name;

-- Drop index
DROP INDEX embedding_idx ON table_name;

Index Performance Tuning

Performance characteristics of different parameter configurations:

Configuration ScenarioM Valueef_constructionscalar.typeCharacteristics
Fast Retrieval864f16Lower construction time and memory usage
Balanced16200f32Balanced precision and performance
High Precision32400f32Higher retrieval precision, higher resource consumption

Mainstream AI Model Adaptation

Vector Dimensions of Common Embedding Models

Model ProviderModel NameDimensionRecommended Configuration
OpenAItext-embedding-3-small1536VECTOR(FLOAT, 1536)
OpenAItext-embedding-3-large3072VECTOR(FLOAT, 3072)
GoogleUniversal Sentence Encoder512VECTOR(FLOAT, 512)
BaiduERNIE-Embed768VECTOR(FLOAT, 768)
ZhipuGLM-Embedding1024VECTOR(FLOAT, 1024)

Multimodal Model Support

Model TypeTypical DimensionApplication Scenario
CLIP512/768Image-text matching, multimodal retrieval
BLIP768Image Q&A, visual understanding
Vision Transformer768/1024Image classification, feature extraction

Application Scenario Design

Document Retrieval System

-- Knowledge base table design
CREATE TABLE knowledge_base (
    doc_id BIGINT PRIMARY KEY,
    title STRING,
    content STRING,
    embedding VECTOR(FLOAT, 1536),
    created_at TIMESTAMP,
    
    INDEX doc_embedding_idx (embedding) USING VECTOR PROPERTIES (
        "scalar.type" = "f32",
        "distance.function" = "cosine_distance"
    )
);

-- Semantic retrieval query (recommended pattern)
SELECT doc_id, title FROM knowledge_base 
WHERE COSINE_DISTANCE(embedding, ?) < 0.7
ORDER BY doc_id LIMIT 10;

Recommendation System

-- User profile table
CREATE TABLE user_profiles (
    user_id BIGINT PRIMARY KEY,
    behavior_vector VECTOR(FLOAT, 128),
    preference_vector VECTOR(FLOAT, 64),
    last_updated TIMESTAMP,
    
    INDEX behavior_idx (behavior_vector) USING VECTOR PROPERTIES (
        "distance.function" = "cosine_distance"
    )
);

-- Item feature table
CREATE TABLE item_features (
    item_id BIGINT PRIMARY KEY,
    content_vector VECTOR(FLOAT, 256),
    category_vector VECTOR(INT, 50),
    
    INDEX content_idx (content_vector) USING VECTOR PROPERTIES (
        "distance.function" = "cosine_distance"
    )
);

-- Find similar users (recommended query pattern)
SELECT target.user_id as similar_user FROM user_profiles source
CROSS JOIN user_profiles target
WHERE source.user_id = ?
    AND target.user_id != source.user_id
    AND COSINE_DISTANCE(source.behavior_vector, target.behavior_vector) < 0.6
ORDER BY target.user_id LIMIT 20;

Image Feature Matching

-- Image feature library
CREATE TABLE image_features (
    image_id BIGINT PRIMARY KEY,
    image_path STRING,
    visual_features VECTOR(FLOAT, 2048),
    semantic_features VECTOR(FLOAT, 512),
    upload_time TIMESTAMP,
    
    INDEX visual_idx (visual_features) USING VECTOR PROPERTIES (
        "distance.function" = "l2_distance",
        "scalar.type" = "f16"
    ),
    INDEX semantic_idx (semantic_features) USING VECTOR PROPERTIES (
        "distance.function" = "cosine_distance"
    )
);

-- Similar image retrieval
SELECT image_id, image_path FROM image_features
WHERE L2_DISTANCE(visual_features, ?) < 100.0
ORDER BY image_id LIMIT 50;

Important Pitfall Avoidance Guide

Vector Query Limitations

Through actual testing in the UAT environment, two different error patterns were discovered:

1. Dimension Mismatch Error (Expected Behavior)

-- Incorrect example: Using a 5-dimensional vector to query 128-dimensional data
SELECT id FROM embeddings_table 
WHERE COSINE_DISTANCE(embedding_128d, vector(0.1, 0.2, 0.3, 0.4, 0.5)) < 0.8;

-- Error message:
-- function 'COSINE_DISTANCE' cannot be resolved, 
-- expect 'vector shall have the same dimension'

Solution: Ensure the query vector has exactly the same dimension as the vector in the table

2. Generated Column Internal Error (Current Limitation)

-- Problematic query: Returning distance calculation in SELECT
SELECT id, content,
    COSINE_DISTANCE(embedding, vector(...)) as distance
FROM embeddings_table 
ORDER BY distance LIMIT 10;

-- Possible error:
-- Generated column 4292967295(generated_field_4292967295) is not filled for VECTOR

Currently Available Query Patterns:

-- Recommended: Vector filtering in WHERE clause
SELECT id, content FROM embeddings_table 
WHERE COSINE_DISTANCE(embedding, vector(...)) < 0.8
ORDER BY id LIMIT 10;

-- Available: Simple vector functions
SELECT COSINE_DISTANCE(embedding, embedding) as self_distance
FROM embeddings_table;

-- Avoid: Complex SELECT returning distance calculation with sorting

Function Usage Pitfalls

L2_NORMALIZE Function Type Requirements

-- Correct: Using float vectors
SELECT L2_NORMALIZE(vector(3.0, 4.0)) as normalized;
-- Result: [0.6, 0.8]

-- Incorrect: Using integer vectors
SELECT L2_NORMALIZE(vector(3, 4)) as normalized;
-- Possible result: [0, 0] (precision loss)

Index Parameter Specifications

-- Correct: Using full parameter names
CREATE VECTOR INDEX idx ON TABLE t(col) PROPERTIES(
    "distance.function" = "cosine_distance",  -- Full function name
    "scalar.type" = "f32"                    -- Standard type name
);

-- Incorrect: Using abbreviated names
CREATE VECTOR INDEX idx ON TABLE t(col) PROPERTIES(
    "distance.function" = "cosine",          -- Will cause error
    "scalar.type" = "float"                  -- Will cause error
);

Best Practice Recommendations

Query Pattern Selection

Usage ScenarioRecommended PatternPattern to Avoid
Similarity FilteringWHERE COSINE_DISTANCE(...) < thresholdORDER BY COSINE_DISTANCE(...)
Top-K RetrievalWHERE filtering + simple sortingDistance calculation + sorting
Batch ComputationStep-by-step queriesComplex nested queries

Performance Optimization Strategies

-- Recommended: Filter query utilizing index
SELECT doc_id, title FROM knowledge_base 
WHERE COSINE_DISTANCE(embedding, ?) < 0.7
ORDER BY doc_id LIMIT 10;

-- Optional: Step-by-step Top-K retrieval
-- Step 1: Filter candidates
CREATE VIEW candidates AS 
SELECT doc_id, title FROM knowledge_base 
WHERE COSINE_DISTANCE(embedding, ?) < 0.8;

-- Step 2: Precise sorting (if needed)
SELECT doc_id, title FROM candidates ORDER BY doc_id LIMIT 10;

Migration Strategy

Migration Value Assessment

Before deciding to migrate to the VECTOR type, it is recommended to evaluate your usage scenarios:

High-Value Migration Scenarios

  • Frequent vector similarity calculations: Such as recommendation systems, document retrieval, image matching
  • Large-scale vector data: Vector data queries exceeding millions of records
  • Complex UDF maintenance: Currently maintaining multiple custom vector computation functions
  • High performance requirements: Applications with strict query response time requirements

Optional Migration Scenarios

  • Occasional vector calculations: Vector computation is not a core business process
  • Small-scale data: Vector data volume is small, performance pressure is not significant
  • Stable existing systems: Current system runs stably, change risk outweighs benefit

Migration Recommendations for Different Users

Spark User Migration Path

If you primarily use Spark MLlib for vector computation:

# Original Spark code
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf

# Custom similarity function
def cosine_similarity(v1, v2):
    # Complex implementation logic
    pass

# Register UDF
cosine_udf = udf(cosine_similarity, DoubleType())

After Migrating to Singdata VECTOR:

-- Directly use built-in functions and recommended query patterns
SELECT user_id FROM user_features
WHERE COSINE_DISTANCE(user_vector, target_vector) < 0.8;

Migration Benefits:

  • Remove custom UDF code
  • Leverage vector indexes to accelerate queries
  • Unified vector computation in SQL

Hive User Migration Path

If you use ARRAY to store vectors in Hive:

-- Original Hive table structure
CREATE TABLE user_embeddings (
    user_id BIGINT,
    features ARRAY<DOUBLE>
) PARTITIONED BY (dt STRING);

-- Requires complex UDF
SELECT user_id, my_cosine_udf(features, target_array) as score
FROM user_embeddings 
WHERE size(features) = 128;  -- Manual dimension check

Migrating to VECTOR Type:

-- New table structure
CREATE TABLE user_embeddings_v2 (
    user_id BIGINT,
    features VECTOR(FLOAT, 128)  -- Type safe
) PARTITIONED BY (dt STRING);

-- Use built-in functions and recommended patterns
SELECT user_id FROM user_embeddings_v2
WHERE COSINE_DISTANCE(features, vector(...)) < 0.8;  -- No manual dimension check needed

Evaluating Existing Systems

Data Scale Assessment

-- Statistics on the scale and dimensions of existing vector data
SELECT 
    table_name,
    column_name,
    COUNT(*) as record_count,
    AVG(size(array_column)) as avg_dimension,
    MIN(size(array_column)) as min_dimension,
    MAX(size(array_column)) as max_dimension,
    COUNT(DISTINCT size(array_column)) as dimension_variants
FROM (
    SELECT 'user_features' as table_name, 'embedding' as column_name, embedding as array_column FROM user_features
    UNION ALL
    SELECT 'product_vectors' as table_name, 'features' as column_name, features as array_column FROM product_vectors
    -- Add other tables that need evaluation
) all_vectors
GROUP BY table_name, column_name;

UDF Complexity Assessment

-- Identify currently used vector-related UDFs
SHOW FUNCTIONS LIKE '*similarity*';
SHOW FUNCTIONS LIKE '*distance*';
SHOW FUNCTIONS LIKE '*cosine*';
SHOW FUNCTIONS LIKE '*euclidean*';

Query Frequency Assessment

-- Analyze the frequency of vector computation queries (if query logs are available)
SELECT 
    DATE(query_time) as query_date,
    COUNT(*) as vector_query_count
FROM query_logs 
WHERE query_text LIKE '%cosine%' 
    OR query_text LIKE '%similarity%'
    OR query_text LIKE '%distance%'
GROUP BY DATE(query_time)
ORDER BY query_date DESC
LIMIT 30;

Parallel Construction of New Tables

-- Create new vector table
CREATE TABLE embeddings_new (
    id BIGINT,
    content STRING,
    embedding VECTOR(FLOAT, 768),
    migrated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    
    INDEX embedding_idx (embedding) USING VECTOR PROPERTIES (
        "distance.function" = "cosine_distance"
    )
);

-- Data migration
INSERT INTO embeddings_new (id, content, embedding)
SELECT 
    id,
    content,
    CASE 
        WHEN size(embedding_array) = 768 
        THEN cast(embedding_array as VECTOR(FLOAT, 768))
        ELSE NULL
    END as embedding
FROM embeddings_old
WHERE size(embedding_array) = 768;

-- Build index
BUILD INDEX embedding_idx ON embeddings_new;

Verifying Migration Results

-- Compare query result consistency (using recommended query patterns)
WITH old_results AS (
    SELECT id FROM embeddings_old 
    WHERE your_cosine_udf(embedding_array, ?) > 0.8
    ORDER BY id LIMIT 10
),
new_results AS (
    SELECT id FROM embeddings_new 
    WHERE COSINE_DISTANCE(embedding, ?) < 0.2  -- Note the distance vs. similarity conversion
    ORDER BY id LIMIT 10
)
SELECT 
    COALESCE(o.id, n.id) as id,
    CASE WHEN o.id IS NOT NULL THEN 'OLD' ELSE NULL END as in_old,
    CASE WHEN n.id IS NOT NULL THEN 'NEW' ELSE NULL END as in_new
FROM old_results o
FULL OUTER JOIN new_results n ON o.id = n.id
ORDER BY id;

Performance Optimization

Query Optimization

-- Recommended: Use threshold filtering to improve query efficiency
SELECT id, content FROM embeddings_table 
WHERE COSINE_DISTANCE(embedding, ?) < 0.8  -- Leverage vector index filtering
ORDER BY id LIMIT 10;

-- Optional: Step-by-step processing for complex queries
CREATE VIEW filtered_candidates AS
SELECT id, content FROM embeddings_table 
WHERE COSINE_DISTANCE(embedding, ?) < 0.8;

SELECT id, content FROM filtered_candidates 
ORDER BY id LIMIT 10;

Storage Optimization

-- Use half-precision storage for large-scale data
CREATE TABLE large_embeddings (
    id BIGINT,
    embedding VECTOR(FLOAT, 1536),
    
    INDEX embedding_idx (embedding) USING VECTOR PROPERTIES (
        "scalar.type" = "f16",  -- Half-precision storage saves memory
        "distance.function" = "cosine_distance"
    )
);

Index Tuning

Choose appropriate index parameters based on application requirements:

  • Real-time query scenarios: Smaller M value and ef_construction, use f16 type
  • Batch analysis scenarios: Larger M value and ef_construction, use f32 type
  • Storage optimization scenarios: Use f16 or i8 type to reduce memory footprint

Important Notes

Functional Limitations

  • Vector dimension is determined at creation time and does not support dynamic adjustment
  • Vectors cannot be used as partition keys
  • JACCARD_DISTANCE and HAMMING_DISTANCE functions have special type requirements
  • The current version has limitations on complex SELECT queries; it is recommended to use WHERE filtering patterns

Index Management

  • CREATE INDEX only applies to newly added data; existing data requires BUILD INDEX
  • The performance of vector indexes is closely related to memory caching
  • It is recommended to use a dedicated Virtual Cluster for vector queries

Troubleshooting

IssuePossible CauseSolution
Poor query performanceIndex not in effectCheck index status, execute BUILD INDEX
Dimension mismatch errorInconsistent vector dimensionsCheck VECTOR type definition and input data dimensions
Generated column errorComplex SELECT query limitationUse WHERE filtering pattern instead
L2_NORMALIZE returns zero vectorUsing integer vectorsSwitch to float vectors: vector(3.0, 4.0)
Index parameter errorNon-standard parameter namesUse full parameter names: cosine_distance
Out of memoryVector index consumes too much memoryAdjust index parameters or increase cluster resources

Summary

Core Value of VECTOR Data Type

Type Safety: Avoid runtime dimension mismatch errors through fixed dimension definition

Development Simplification: Provide built-in vector distance calculation functions, reducing custom code

Performance Optimization: Support dedicated vector indexes, improving large-scale vector query efficiency

Standard Compatibility: Compatible with SQL syntax, supporting sorting, grouping, and other standard operations

Applicable Scenarios

Recommended use of VECTOR:

  • Frequent vector similarity calculations needed
  • Fixed and clear vector dimensions
  • Desire to leverage vector indexes to improve query performance
  • Newly developed AI application projects

Continue using ARRAY:

  • Primarily performing array operations rather than vector computation
  • Frequently changing vector dimensions
  • High cost of retrofitting existing systems
  • Need for complex vector sorting queries (current limitation)

Implementation Recommendations

Evaluation Phase: Analyze the scale and dimensional characteristics of existing vector data

Pilot Validation: Select small-scale scenarios to validate VECTOR effectiveness

Gradual Migration: Use parallel construction approach for smooth migration of existing systems

Continuous Optimization: Adjust index parameters and query strategies based on actual usage

Query Pattern Recommendations

Prioritize:

  • Vector filtering in WHERE clause: WHERE COSINE_DISTANCE(...) < threshold
  • Simple vector function calculations: COSINE_DISTANCE(v1, v2)

Use with Caution:

  • Complex SELECT returning distance calculation with sorting
  • Nested queries on large vectors

Alternative Approaches:

  • Step-by-step queries: Filter first, then sort
  • Pre-computation: Batch-compute distances and store

Note: This document is based on the Lakehouse product documentation as of June 2025. It is recommended to check the official documentation regularly for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.