-- 1. Create a table with a vector column
CREATE TABLE document_embeddings (
id BIGINT,
content STRING,
embedding VECTOR(FLOAT, 1536) -- 1536-dimensional float vector
);
-- 2. Insert vector data
INSERT INTO document_embeddings VALUES
(1, 'Artificial Intelligence Technology Development', vector(0.1, 0.2, 0.3, ...)); -- Requires providing complete 1536-dimensional data
-- 3. Create a vector index
CREATE VECTOR INDEX embedding_idx ON TABLE document_embeddings(embedding)
PROPERTIES(
"distance.function" = "cosine_distance",
"scalar.type" = "f32"
);
-- 4. Build index (for existing data)
BUILD INDEX embedding_idx ON document_embeddings;
-- 5. Execute similarity query (recommended pattern)
SELECT id, content FROM document_embeddings
WHERE COSINE_DISTANCE(embedding, vector(0.1, 0.2, 0.3, ...)) < 0.8
ORDER BY id LIMIT 10;
Technical Comparison: ARRAY vs VECTOR
Traditional ARRAY Implementation
-- Using ARRAY to store vectors
CREATE TABLE docs_array (
content STRING,
embedding ARRAY<DOUBLE>
);
-- Requires custom function to implement similarity calculation
CREATE FUNCTION cosine_similarity(vec1 ARRAY<DOUBLE>, vec2 ARRAY<DOUBLE>)
RETURNS DOUBLE
LANGUAGE PYTHON AS $$
def cosine_similarity(vec1, vec2):
import numpy as np
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
$$;
-- Query requires function call and manual dimension check
SELECT doc_id, cosine_similarity(embedding, array(0.1, 0.2, ...)) as score
FROM docs_array
WHERE array_size(embedding) = 1536 -- Requires manual dimension check
ORDER BY score DESC;
VECTOR Native Implementation
-- Using VECTOR type
CREATE TABLE docs_vector (
content STRING,
embedding VECTOR(FLOAT, 1536), -- Type and dimension explicitly defined
INDEX embedding_idx (embedding) USING VECTOR PROPERTIES (
"distance.function" = "cosine_distance"
)
);
-- Using built-in functions, dimension check is automatic
SELECT doc_id FROM docs_vector
WHERE COSINE_DISTANCE(embedding, vector(0.1, 0.2, ...)) < 0.8
ORDER BY doc_id LIMIT 10;
Feature Comparison
Feature
ARRAY Solution
VECTOR Solution
Impact on Users
Type Safety
Runtime check
Compile-time check
Reduces production environment errors
Dimension Control
Manual verification
Type definition guaranteed
Avoids dimension mismatch issues
Distance Calculation
Custom UDF
Built-in functions
Reduces code development and maintenance
Index Support
No dedicated index
Vector index
Improves large-scale query performance
Development Complexity
Requires implementation logic
Directly use built-in functions
Lowers technical barrier
Maintenance Cost
Requires maintaining UDF code
Native database support
Reduces operational workload
Migration Comparison from Spark MLlib
Spark MLlib API
Singdata VECTOR
Migration Complexity
Description
Vectors.dense(array)
vector(v1, v2, ...)
Easy
Direct function replacement
Vectors.norm(v, 2)
L2_DISTANCE(v, vector(0,...))
Medium
Requires providing zero vector
Custom cosine UDF
COSINE_DISTANCE(v1, v2)
Easy
Remove UDF, use built-in function
Custom normalize UDF
L2_NORMALIZE(v)
Easy
Remove UDF, use built-in function
Vectors.sqdist(v1, v2)
POW(L2_DISTANCE(v1, v2), 2)
Easy
Combine built-in functions
VECTOR Data Type Specification
Type Definition Syntax
VECTOR(scalar_type, dimension)
Supported Scalar Types
Scalar Type
Storage Size
Value Range
Applicable Scenario
Index Support
FLOAT
4 bytes/dim
32-bit floating point
General vector storage, e.g. embeddings
f32
INT
4 bytes/dim
32-bit integer
Integer feature vectors, e.g. counts
Not supported
TINYINT
1 byte/dim
-128 to 127
Compressed storage, e.g. binary vectors
i8
Suggestion: FLOAT type is recommended for production environments, offering the best compatibility and performance
Vector Creation Methods
-- Direct vector creation
SELECT vector(1.0, 2.0, 3.0); -- Create 3-dimensional float vector
SELECT vector(1, 2, 3); -- Create 3-dimensional integer vector
-- Conversion from other types
SELECT cast(array(1,2,3) as VECTOR(INT, 3)); -- Array to vector
SELECT cast('[1,2,3]' as VECTOR(INT, 3)); -- JSON string to vector
-- Usage in table definition
CREATE TABLE feature_vectors (
id BIGINT,
dense_features VECTOR(FLOAT, 128), -- Dense features
sparse_features VECTOR(INT, 1000) -- Sparse features
);
Vector Functions
Distance Calculation Functions
L2_DISTANCE - Euclidean Distance
SELECT L2_DISTANCE(vector(1,2,3), vector(4,5,6)) as euclidean_distance;
-- Result: 5.196152210235596
-- IMPORTANT: Must use float vectors
SELECT L2_NORMALIZE(vector(3.0, 4.0)) as normalized_vector;
-- Result: [0.6, 0.8]
-- Incorrect example: Integer vectors may cause precision issues
SELECT L2_NORMALIZE(vector(3, 4)) as normalized_vector;
-- Possible result: [0, 0]
Function: Normalize vector to unit length (L2 norm = 1)
Applicable Scenarios: Preprocessing for cosine similarity calculation
Function Usage Notes
All distance functions require input vectors to have the same dimension; otherwise, a compile-time error will be raised
Functions support mixed computation of vector type and array type
Normalization function returns a vector type of the same dimension
L2_NORMALIZE function is sensitive to data type; float vectors are recommended
Type Conversion and Compatibility
Automatic Type Conversion
-- Vectors can be implicitly converted to arrays for array operations
SELECT array_size(vector(1,2,3,4)) as vector_size;
-- Result: 4
-- Arrays can be used in vector functions
SELECT L2_DISTANCE(array(1,2,3), vector(1,2,3)) as distance;
-- Result: 0.0
Explicit Type Conversion
-- Array to vector (dimensions must match)
SELECT cast(array(1.0, 2.0, 3.0) as VECTOR(FLOAT, 3));
-- Result: [1.0, 2.0, 3.0]
-- JSON string to vector
SELECT cast('[1,2,3]' as VECTOR(INT, 3));
-- Result: [1, 2, 3]
-- Returns NULL when dimensions don't match
SELECT cast(array(1,2) as VECTOR(INT, 3)); -- Returns NULL
SQL Operation Support
Vector type supports standard SQL operations:
-- Sorting
SELECT * FROM vectors ORDER BY feature_vector;
-- Grouping
SELECT feature_vector, COUNT(*) FROM vectors GROUP BY feature_vector;
-- Comparison
SELECT * FROM vectors WHERE feature_vector = vector(1,2,3);
-- Deduplication
SELECT DISTINCT feature_vector FROM vectors;
IMPORTANT: Parameter names must use the full form, e.g. cosine_distance instead of cosine
Index Creation and Management
-- Create vector index (recommended syntax)
CREATE VECTOR INDEX embedding_idx ON TABLE table_name(vector_column)
PROPERTIES(
"distance.function" = "cosine_distance", -- Full function name
"scalar.type" = "f32", -- Standard type name
"M" = 16,
"ef_construction" = 200
);
-- Build index for existing data
BUILD INDEX embedding_idx ON table_name;
-- View index status
SHOW INDEXES ON table_name;
-- Drop index
DROP INDEX embedding_idx ON table_name;
Index Performance Tuning
Performance characteristics of different parameter configurations:
-- Knowledge base table design
CREATE TABLE knowledge_base (
doc_id BIGINT PRIMARY KEY,
title STRING,
content STRING,
embedding VECTOR(FLOAT, 1536),
created_at TIMESTAMP,
INDEX doc_embedding_idx (embedding) USING VECTOR PROPERTIES (
"scalar.type" = "f32",
"distance.function" = "cosine_distance"
)
);
-- Semantic retrieval query (recommended pattern)
SELECT doc_id, title FROM knowledge_base
WHERE COSINE_DISTANCE(embedding, ?) < 0.7
ORDER BY doc_id LIMIT 10;
Recommendation System
-- User profile table
CREATE TABLE user_profiles (
user_id BIGINT PRIMARY KEY,
behavior_vector VECTOR(FLOAT, 128),
preference_vector VECTOR(FLOAT, 64),
last_updated TIMESTAMP,
INDEX behavior_idx (behavior_vector) USING VECTOR PROPERTIES (
"distance.function" = "cosine_distance"
)
);
-- Item feature table
CREATE TABLE item_features (
item_id BIGINT PRIMARY KEY,
content_vector VECTOR(FLOAT, 256),
category_vector VECTOR(INT, 50),
INDEX content_idx (content_vector) USING VECTOR PROPERTIES (
"distance.function" = "cosine_distance"
)
);
-- Find similar users (recommended query pattern)
SELECT target.user_id as similar_user FROM user_profiles source
CROSS JOIN user_profiles target
WHERE source.user_id = ?
AND target.user_id != source.user_id
AND COSINE_DISTANCE(source.behavior_vector, target.behavior_vector) < 0.6
ORDER BY target.user_id LIMIT 20;
Image Feature Matching
-- Image feature library
CREATE TABLE image_features (
image_id BIGINT PRIMARY KEY,
image_path STRING,
visual_features VECTOR(FLOAT, 2048),
semantic_features VECTOR(FLOAT, 512),
upload_time TIMESTAMP,
INDEX visual_idx (visual_features) USING VECTOR PROPERTIES (
"distance.function" = "l2_distance",
"scalar.type" = "f16"
),
INDEX semantic_idx (semantic_features) USING VECTOR PROPERTIES (
"distance.function" = "cosine_distance"
)
);
-- Similar image retrieval
SELECT image_id, image_path FROM image_features
WHERE L2_DISTANCE(visual_features, ?) < 100.0
ORDER BY image_id LIMIT 50;
Important Pitfall Avoidance Guide
Vector Query Limitations
Through actual testing in the UAT environment, two different error patterns were discovered:
1. Dimension Mismatch Error (Expected Behavior)
-- Incorrect example: Using a 5-dimensional vector to query 128-dimensional data
SELECT id FROM embeddings_table
WHERE COSINE_DISTANCE(embedding_128d, vector(0.1, 0.2, 0.3, 0.4, 0.5)) < 0.8;
-- Error message:
-- function 'COSINE_DISTANCE' cannot be resolved,
-- expect 'vector shall have the same dimension'
Solution: Ensure the query vector has exactly the same dimension as the vector in the table
-- Problematic query: Returning distance calculation in SELECT
SELECT id, content,
COSINE_DISTANCE(embedding, vector(...)) as distance
FROM embeddings_table
ORDER BY distance LIMIT 10;
-- Possible error:
-- Generated column 4292967295(generated_field_4292967295) is not filled for VECTOR
Currently Available Query Patterns:
-- Recommended: Vector filtering in WHERE clause
SELECT id, content FROM embeddings_table
WHERE COSINE_DISTANCE(embedding, vector(...)) < 0.8
ORDER BY id LIMIT 10;
-- Available: Simple vector functions
SELECT COSINE_DISTANCE(embedding, embedding) as self_distance
FROM embeddings_table;
-- Avoid: Complex SELECT returning distance calculation with sorting
Function Usage Pitfalls
L2_NORMALIZE Function Type Requirements
-- Correct: Using float vectors
SELECT L2_NORMALIZE(vector(3.0, 4.0)) as normalized;
-- Result: [0.6, 0.8]
-- Incorrect: Using integer vectors
SELECT L2_NORMALIZE(vector(3, 4)) as normalized;
-- Possible result: [0, 0] (precision loss)
Index Parameter Specifications
-- Correct: Using full parameter names
CREATE VECTOR INDEX idx ON TABLE t(col) PROPERTIES(
"distance.function" = "cosine_distance", -- Full function name
"scalar.type" = "f32" -- Standard type name
);
-- Incorrect: Using abbreviated names
CREATE VECTOR INDEX idx ON TABLE t(col) PROPERTIES(
"distance.function" = "cosine", -- Will cause error
"scalar.type" = "float" -- Will cause error
);
Best Practice Recommendations
Query Pattern Selection
Usage Scenario
Recommended Pattern
Pattern to Avoid
Similarity Filtering
WHERE COSINE_DISTANCE(...) < threshold
ORDER BY COSINE_DISTANCE(...)
Top-K Retrieval
WHERE filtering + simple sorting
Distance calculation + sorting
Batch Computation
Step-by-step queries
Complex nested queries
Performance Optimization Strategies
-- Recommended: Filter query utilizing index
SELECT doc_id, title FROM knowledge_base
WHERE COSINE_DISTANCE(embedding, ?) < 0.7
ORDER BY doc_id LIMIT 10;
-- Optional: Step-by-step Top-K retrieval
-- Step 1: Filter candidates
CREATE VIEW candidates AS
SELECT doc_id, title FROM knowledge_base
WHERE COSINE_DISTANCE(embedding, ?) < 0.8;
-- Step 2: Precise sorting (if needed)
SELECT doc_id, title FROM candidates ORDER BY doc_id LIMIT 10;
Migration Strategy
Migration Value Assessment
Before deciding to migrate to the VECTOR type, it is recommended to evaluate your usage scenarios:
High-Value Migration Scenarios
Frequent vector similarity calculations: Such as recommendation systems, document retrieval, image matching
Large-scale vector data: Vector data queries exceeding millions of records
Complex UDF maintenance: Currently maintaining multiple custom vector computation functions
High performance requirements: Applications with strict query response time requirements
Optional Migration Scenarios
Occasional vector calculations: Vector computation is not a core business process
Small-scale data: Vector data volume is small, performance pressure is not significant
Stable existing systems: Current system runs stably, change risk outweighs benefit
Migration Recommendations for Different Users
Spark User Migration Path
If you primarily use Spark MLlib for vector computation:
# Original Spark code
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
# Custom similarity function
def cosine_similarity(v1, v2):
# Complex implementation logic
pass
# Register UDF
cosine_udf = udf(cosine_similarity, DoubleType())
After Migrating to Singdata VECTOR:
-- Directly use built-in functions and recommended query patterns
SELECT user_id FROM user_features
WHERE COSINE_DISTANCE(user_vector, target_vector) < 0.8;
Migration Benefits:
Remove custom UDF code
Leverage vector indexes to accelerate queries
Unified vector computation in SQL
Hive User Migration Path
If you use ARRAY to store vectors in Hive:
-- Original Hive table structure
CREATE TABLE user_embeddings (
user_id BIGINT,
features ARRAY<DOUBLE>
) PARTITIONED BY (dt STRING);
-- Requires complex UDF
SELECT user_id, my_cosine_udf(features, target_array) as score
FROM user_embeddings
WHERE size(features) = 128; -- Manual dimension check
Migrating to VECTOR Type:
-- New table structure
CREATE TABLE user_embeddings_v2 (
user_id BIGINT,
features VECTOR(FLOAT, 128) -- Type safe
) PARTITIONED BY (dt STRING);
-- Use built-in functions and recommended patterns
SELECT user_id FROM user_embeddings_v2
WHERE COSINE_DISTANCE(features, vector(...)) < 0.8; -- No manual dimension check needed
Evaluating Existing Systems
Data Scale Assessment
-- Statistics on the scale and dimensions of existing vector data
SELECT
table_name,
column_name,
COUNT(*) as record_count,
AVG(size(array_column)) as avg_dimension,
MIN(size(array_column)) as min_dimension,
MAX(size(array_column)) as max_dimension,
COUNT(DISTINCT size(array_column)) as dimension_variants
FROM (
SELECT 'user_features' as table_name, 'embedding' as column_name, embedding as array_column FROM user_features
UNION ALL
SELECT 'product_vectors' as table_name, 'features' as column_name, features as array_column FROM product_vectors
-- Add other tables that need evaluation
) all_vectors
GROUP BY table_name, column_name;
UDF Complexity Assessment
-- Identify currently used vector-related UDFs
SHOW FUNCTIONS LIKE '*similarity*';
SHOW FUNCTIONS LIKE '*distance*';
SHOW FUNCTIONS LIKE '*cosine*';
SHOW FUNCTIONS LIKE '*euclidean*';
Query Frequency Assessment
-- Analyze the frequency of vector computation queries (if query logs are available)
SELECT
DATE(query_time) as query_date,
COUNT(*) as vector_query_count
FROM query_logs
WHERE query_text LIKE '%cosine%'
OR query_text LIKE '%similarity%'
OR query_text LIKE '%distance%'
GROUP BY DATE(query_time)
ORDER BY query_date DESC
LIMIT 30;
Parallel Construction of New Tables
-- Create new vector table
CREATE TABLE embeddings_new (
id BIGINT,
content STRING,
embedding VECTOR(FLOAT, 768),
migrated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
INDEX embedding_idx (embedding) USING VECTOR PROPERTIES (
"distance.function" = "cosine_distance"
)
);
-- Data migration
INSERT INTO embeddings_new (id, content, embedding)
SELECT
id,
content,
CASE
WHEN size(embedding_array) = 768
THEN cast(embedding_array as VECTOR(FLOAT, 768))
ELSE NULL
END as embedding
FROM embeddings_old
WHERE size(embedding_array) = 768;
-- Build index
BUILD INDEX embedding_idx ON embeddings_new;
Verifying Migration Results
-- Compare query result consistency (using recommended query patterns)
WITH old_results AS (
SELECT id FROM embeddings_old
WHERE your_cosine_udf(embedding_array, ?) > 0.8
ORDER BY id LIMIT 10
),
new_results AS (
SELECT id FROM embeddings_new
WHERE COSINE_DISTANCE(embedding, ?) < 0.2 -- Note the distance vs. similarity conversion
ORDER BY id LIMIT 10
)
SELECT
COALESCE(o.id, n.id) as id,
CASE WHEN o.id IS NOT NULL THEN 'OLD' ELSE NULL END as in_old,
CASE WHEN n.id IS NOT NULL THEN 'NEW' ELSE NULL END as in_new
FROM old_results o
FULL OUTER JOIN new_results n ON o.id = n.id
ORDER BY id;
Performance Optimization
Query Optimization
-- Recommended: Use threshold filtering to improve query efficiency
SELECT id, content FROM embeddings_table
WHERE COSINE_DISTANCE(embedding, ?) < 0.8 -- Leverage vector index filtering
ORDER BY id LIMIT 10;
-- Optional: Step-by-step processing for complex queries
CREATE VIEW filtered_candidates AS
SELECT id, content FROM embeddings_table
WHERE COSINE_DISTANCE(embedding, ?) < 0.8;
SELECT id, content FROM filtered_candidates
ORDER BY id LIMIT 10;
Storage Optimization
-- Use half-precision storage for large-scale data
CREATE TABLE large_embeddings (
id BIGINT,
embedding VECTOR(FLOAT, 1536),
INDEX embedding_idx (embedding) USING VECTOR PROPERTIES (
"scalar.type" = "f16", -- Half-precision storage saves memory
"distance.function" = "cosine_distance"
)
);
Index Tuning
Choose appropriate index parameters based on application requirements:
Real-time query scenarios: Smaller M value and ef_construction, use f16 type
Batch analysis scenarios: Larger M value and ef_construction, use f32 type
Storage optimization scenarios: Use f16 or i8 type to reduce memory footprint
Important Notes
Functional Limitations
Vector dimension is determined at creation time and does not support dynamic adjustment
Vectors cannot be used as partition keys
JACCARD_DISTANCE and HAMMING_DISTANCE functions have special type requirements
The current version has limitations on complex SELECT queries; it is recommended to use WHERE filtering patterns
Index Management
CREATE INDEX only applies to newly added data; existing data requires BUILD INDEX
The performance of vector indexes is closely related to memory caching
It is recommended to use a dedicated Virtual Cluster for vector queries
Troubleshooting
Issue
Possible Cause
Solution
Poor query performance
Index not in effect
Check index status, execute BUILD INDEX
Dimension mismatch error
Inconsistent vector dimensions
Check VECTOR type definition and input data dimensions
Generated column error
Complex SELECT query limitation
Use WHERE filtering pattern instead
L2_NORMALIZE returns zero vector
Using integer vectors
Switch to float vectors: vector(3.0, 4.0)
Index parameter error
Non-standard parameter names
Use full parameter names: cosine_distance
Out of memory
Vector index consumes too much memory
Adjust index parameters or increase cluster resources
Summary
Core Value of VECTOR Data Type
Type Safety: Avoid runtime dimension mismatch errors through fixed dimension definition
Development Simplification: Provide built-in vector distance calculation functions, reducing custom code
Standard Compatibility: Compatible with SQL syntax, supporting sorting, grouping, and other standard operations
Applicable Scenarios
Recommended use of VECTOR:
Frequent vector similarity calculations needed
Fixed and clear vector dimensions
Desire to leverage vector indexes to improve query performance
Newly developed AI application projects
Continue using ARRAY:
Primarily performing array operations rather than vector computation
Frequently changing vector dimensions
High cost of retrofitting existing systems
Need for complex vector sorting queries (current limitation)
Implementation Recommendations
Evaluation Phase: Analyze the scale and dimensional characteristics of existing vector data
Pilot Validation: Select small-scale scenarios to validate VECTOR effectiveness
Gradual Migration: Use parallel construction approach for smooth migration of existing systems
Continuous Optimization: Adjust index parameters and query strategies based on actual usage
Query Pattern Recommendations
Prioritize:
Vector filtering in WHERE clause: WHERE COSINE_DISTANCE(...) < threshold
Simple vector function calculations: COSINE_DISTANCE(v1, v2)
Use with Caution:
Complex SELECT returning distance calculation with sorting
Nested queries on large vectors
Alternative Approaches:
Step-by-step queries: Filter first, then sort
Pre-computation: Batch-compute distances and store
Note: This document is based on the Lakehouse product documentation as of June 2025. It is recommended to check the official documentation regularly for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.