Overview
Lakehouse supports vector types, vector search functions, and vector indexes, enabling vector retrieval scenarios through vector search functions. The vector data type is an ordered collection of numerical values with fixed dimensions. Vectors can represent various data types, such as vector embeddings obtained from large language models (LLM), image or facial vector embeddings, financial time series, spatial coordinates, velocity, color, etc. Using the vector data type makes it easier to insert, load, and query vectors.
Lakehouse provides vector types and vector index retrieval. Vectors are structured numerical representations that modern deep learning techniques can create from unstructured data (such as text and images), while preserving the semantic concepts of similarity and dissimilarity in the geometric structure of the generated vectors. Lakehouse offers the VECTOR type to store these transformed vectors, and building indexes can improve vector search performance.
Vector Features Supported by Lakehouse
- Vector Storage: Store vectors using the
VECTOR
type. - Vector Indexing: Build vector indexes using the HNSW (Hierarchical Navigable Small World) algorithm to accelerate computation.
- Distance Calculation: Support various functions to calculate vector similarity, including functions like
L2_DISTANCE
andCOSINE_DISTANCE
.
Usage Considerations
- The current version of the vector type does not support comparison operations, so it cannot be used in
ORDER BY
orGROUP BY
clauses. - The current client does not implement the vector type, but it is supported in the SQL engine. Therefore, when you execute a select result that includes the vector type, an error will occur: Unsupported data type: VECTOR_TYPE.
- The performance of vector indexing is directly related to memory cache and disk cache. It is recommended to use a separate VC. Mixing with other scenarios may compete for cache and result in performance not meeting expectations.
Lakehouse Vector Usage
Creating Vectors
properties supports specifying parameters, refer to Create Vector Index Documentation
Insert Data
- Use SQL to insert
- If you are writing to the Lakehouse through an external system, the current Lakehouse does not support direct writing of vectors. You can write it as an array, and then use
insert overwrite select cast (array_col as vector)
for conversion. -
- If the data is in object storage, you can directly use volume to import the vector type
Vector Retrieval
When you need to confirm whether the vector index is effective, you can use the EXPLAIN SELECT ...
syntax to check if the TableScan operator contains the term vector_index_search_type.
When the vector index is not effective, it will degrade to brute force search.
Using with Inverted Indexes Simultaneously
The vector index can only solve the vector search problem. When combined with other field-related filtering conditions, it will directly degrade to a brute force algorithm. To solve this problem, there are generally two approaches.
- First perform vector search in a subquery, and then execute other field filtering conditions in the outer query. Although this solution has fast query performance, if the filtering of non-vector fields is relatively high, the final output result is often less than the number of data expected by the user, or even empty.
In the above example, the execution process is: first, use the inverted index to filter out the matching rows based on match_regexp(doc, '.*hello.*', map('analyzer', 'keyword'))
, and then perform vector search on the matching rows.
Parameters Supported in Vector Search Queries
Name | Default Value | Notes |
cz.storage.parquet.vector.index.read.memory.cache | false | Whether to use memory cache |
cz.storage.parquet.vector.index.read.local.cache | false | Whether to use local SSD cache |
cz.storage.parquet.vector.index.read.vectors.ondemand | adaptive | Whether to load vector index on demand (slower than using memory cache) |
cz.storage.parquet.vector.index.write.parallel | 0 | Whether to enable parallel writing, 0 means off, 8 means 8 threads writing. Note that the performance improvement is not proportional to the number of threads. |
Usage example. Select to execute together when using SQL query
Billing
- Storage Resources: The vector index will create vector index files, and both the index files and data files are stored in object storage, with unified billing.