Lakehouse Full-Text Search Usage Guide
Foreword
Text Retrieval Scenarios You May Encounter
Scenario 1: Enterprise Knowledge Base Search System
Common approaches in Hive/Spark:
-- Text search in Hive
SELECT * FROM knowledge_base
WHERE content LIKE '%Artificial Intelligence%'
OR content LIKE '%Machine Learning%'
OR content LIKE '%Deep Learning%';
-- Performance issues:
-- 1. Full table scan, cannot use indexes
-- 2. Only exact matching, no semantic search
-- 3. Chinese word segmentation difficulty, poor search results
Technical challenges faced:
- Performance limitations: LIKE queries require full table scans, extremely poor performance with large data volumes
- Functional limitations: No word segmentation search, only exact string matching
- Maintenance complexity: Manual management of search keywords and synonyms required
- Scalability difficulties: Multi-language support requires additional development
Scenario 2: Log Analysis and Troubleshooting
Typical implementation in Elasticsearch:
// Elasticsearch query
{
"query": {
"multi_match": {
"query": "ERROR timeout connection",
"fields": ["message", "stack_trace"]
}
}
}
Common technical pain points:
- Architecture complexity: Requires maintaining a separate search engine system
- Data synchronization: Data needs to be synced from the data lake to the search engine
- Cost issues: Additional storage and compute resource consumption
- Consistency risk: Data sync delay leads to inaccurate search results
Why Native Full-Text Search Is Needed
Based on the technical challenges from the above scenarios, the design goals of Lakehouse native full-text search are:
Unified data processing platform:
- Perform analytical computation and full-text search in the same system
- No need to maintain a separate search engine system
- Strong consistency between data storage and retrieval
High-performance inverted index:
- Fast text retrieval based on inverted indexes
- Support for multiple tokenization strategies and languages
- Performance improvement of tens of times over traditional LIKE queries
Intelligent tokenization and matching:
- Support for Chinese, English, Unicode, and other tokenizers
- Multiple matching modes including phrase matching, prefix matching, and regex matching
- Automatic case conversion and punctuation filtering
How to Use This Guide
| Reader Role | Suggested Reading Focus | Expected Takeaway |
|---|
| Data Engineer | Index creation -> Function usage -> Performance optimization | Master the correct usage of inverted indexes |
| System Architect | Application scenarios -> Architecture comparison -> Migration strategy | Obtain guidance for migrating from traditional search engines |
| Application Developer | Quick start -> Query patterns -> Pitfall avoidance guide | Understand best practices for full-text search |
Quick Start
Basic Usage Flow
-- 1. Create a table with an inverted index
CREATE TABLE documents (
id BIGINT,
title STRING,
content STRING,
created_at TIMESTAMP,
-- Define inverted index at table creation
INDEX title_idx (title) INVERTED PROPERTIES('analyzer'='keyword'),
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart')
);
-- 2. Insert data
INSERT INTO documents VALUES
(1, 'Singdata Technical Guide', 'Singdata Lakehouse supports vector search and full-text search features', CURRENT_TIMESTAMP());
-- 3. Build index for existing data
BUILD INDEX content_idx ON documents;
-- 4. Execute full-text search query
SELECT id, title FROM documents
WHERE MATCH_ANY(content, 'full-text search vector search', map('analyzer', 'auto'));
-- 5. Use phrase matching
SELECT id, title FROM documents
WHERE MATCH_PHRASE(content, 'full-text search features', map('analyzer', 'auto'));
Technical Comparison: Traditional Approach vs Native Full-Text Search
Traditional LIKE Query Approach
-- Text search using LIKE
SELECT * FROM documents
WHERE content LIKE '%Artificial Intelligence%'
OR content LIKE '%Machine Learning%'
OR content LIKE '%Deep Learning%';
-- Existing problems:
-- 1. Full table scan, poor performance
-- 2. No word segmentation, only exact matching
-- 3. Case sensitive
-- 4. Cannot handle synonyms
External Search Engine Approach
# Requires maintaining an additional Elasticsearch cluster
version: '3'
services:
elasticsearch:
image: elasticsearch:7.14.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"
# Data sync script needed
from elasticsearch import Elasticsearch
def sync_data_to_es():
# Read data from data lake
# Convert format
# Sync to Elasticsearch
pass
Lakehouse Native Full-Text Search
-- One-stop solution
CREATE TABLE documents (
id BIGINT,
content STRING,
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese')
);
-- Use built-in functions directly
SELECT * FROM documents
WHERE MATCH_ALL(content, 'Artificial Intelligence Machine Learning', map('analyzer', 'auto'));
Solution Comparison Summary
| Feature | LIKE Query | External Search Engine | Lakehouse Full-Text Search | Advantage |
|---|
| Performance | Very poor (full table scan) | Excellent | Excellent (inverted index) | High performance without external systems |
| Architecture complexity | Simple | Complex | Simple | Unified platform, reduced ops cost |
| Data consistency | Strong | Eventual | Strong | Avoids data sync delay |
| Development cost | Low | High | Medium | Reduced integration work |
| Tokenization support | None | Rich | Rich | Native multi-language tokenization |
| Maintenance cost | Low | High | Low | No need to maintain a separate search cluster |
Inverted Index Core Concepts
Inverted Index Principles
Basic concepts:
- Dictionary: Stores the list of unique words that appear across all documents
- Posting List: Records which documents each word appears in and position information
Build process:
- Tokenization: Split document content into words or phrases
- Normalization: Lowercase conversion, stop word removal, and other processing
- Build dictionary: Assign a unique ID to each word
- Build posting list: Record the mapping between words and documents
Index Creation Syntax
Define Index at Table Creation
CREATE TABLE table_name (
column_definitions,
INDEX index_name (column_name) INVERTED
[COMMENT 'description']
PROPERTIES(
'analyzer'='english|chinese|keyword|unicode',
'mode'='smart|max_word' -- Only supported for Chinese tokenizer
)
);
Add Index to Existing Table
CREATE INVERTED INDEX [IF NOT EXISTS] index_name
ON TABLE [schema.]table_name(column_name)
[COMMENT 'description']
PROPERTIES(
'analyzer'='english|chinese|keyword|unicode',
'mode'='smart|max_word'
);
Supported Data Types
| Data Type Category | Specific Types | PROPERTIES Required | Description |
|---|
| Numeric types | TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE | No | Used to accelerate equality and range queries |
| Date types | DATE, TIMESTAMP | No | Used to accelerate time range queries |
| String types | STRING, VARCHAR, CHAR | Yes | Must specify tokenizer |
Tokenizer Details
Tokenizer Type Comparison
keyword Tokenizer
SELECT TOKENIZE('Singdata Lakehouse Technical Guide', map('analyzer', 'keyword'));
-- Result: ["Singdata Lakehouse Technical Guide"]
- Characteristics: No tokenization, preserves complete string
- Applicable scenarios: Exact matching, ID fields, status fields
- Performance: Best, direct string matching
english Tokenizer
SELECT TOKENIZE('Singdata Lakehouse Technical Guide', map('analyzer', 'english'));
-- Result: ["singdata", "lakehouse", "technical", "guide"]
- Characteristics: Identifies ASCII alphanumeric characters, lowercases, filters punctuation
- Applicable scenarios: Pure English content
- Performance: Excellent, optimized for English
chinese Tokenizer
SELECT TOKENIZE('Singdata Lakehouse Technical Guide', map('analyzer', 'chinese'));
-- Result: ["singdata", "lakehouse", "technical", "guide"]
- Characteristics: Supports mixed Chinese and English, intelligent tokenization
- Applicable scenarios: Chinese documents, mixed Chinese-English content
- Tokenization modes:
smart: Intelligent tokenization, coarse granularity, high accuracy
max_word: Maximum tokenization, fine granularity, high recall
unicode Tokenizer
SELECT TOKENIZE('Singdata Lakehouse Technical Guide', map('analyzer', 'unicode'));
-- Result: ["singdata", "lakehouse", "technical", "guide"]
- Characteristics: Supports all Unicode characters
- Applicable scenarios: Multi-language mixed content
- Performance: Relatively slower, but broadest support range
Tokenization Mode Details
Only the Chinese tokenizer supports tokenization modes:
-- smart mode: Intelligent tokenization
SELECT TOKENIZE('Natural Language Processing Technology', map('analyzer', 'chinese', 'mode', 'smart'));
-- Result: ["Natural Language", "Processing", "Technology"]
-- max_word mode: Maximum tokenization
SELECT TOKENIZE('Natural Language Processing Technology', map('analyzer', 'chinese', 'mode', 'max_word'));
-- Result: ["Natural", "Language", "Natural Language", "Processing", "Technology"]
Selection guidance:
- Short text matching: Use
smart mode for higher accuracy
- Full-text search: Use
max_word mode for higher recall
Full-Text Search Functions
Function Overview
| Function Name | Purpose | Return Type | Applicable Scenario |
|---|
| TOKENIZE | Tokenization test | ARRAY | Verify tokenization results |
| MATCH_ALL | Match all keywords | BOOLEAN | Exact search |
| MATCH_ANY | Match any keyword | BOOLEAN | Broad search |
| MATCH_PHRASE | Phrase matching | BOOLEAN | Order-sensitive search |
| MATCH_PHRASE_PREFIX | Phrase prefix matching | BOOLEAN | Autocomplete |
| MATCH_REGEXP | Regex matching | BOOLEAN | Pattern matching |
TOKENIZE - Tokenization Test Function
Purpose: Tokenize text using the specified tokenizer, used to verify tokenization results
Parameters:
input: Text to tokenize
option: Tokenization options, e.g., map('analyzer', 'chinese')
Usage example:
-- Test different tokenizer effects
SELECT
'keyword' as analyzer,
TOKENIZE('Machine Learning Algorithm', map('analyzer', 'keyword')) as tokens
UNION ALL
SELECT
'chinese',
TOKENIZE('Machine Learning Algorithm', map('analyzer', 'chinese'))
UNION ALL
SELECT
'unicode',
TOKENIZE('Machine Learning Algorithm', map('analyzer', 'unicode'));
MATCH_ALL - Match All Function
MATCH_ALL(column, query, option)
Purpose: Requires documents to contain all tokenized results from the query text
Logic: AND relationship, all words must be present
Usage example:
-- Find documents containing both "Machine Learning" and "Algorithm"
SELECT id, title FROM documents
WHERE MATCH_ALL(content, 'Machine Learning Algorithm', map('analyzer', 'auto'));
MATCH_ANY - Match Any Function
MATCH_ANY(column, query, option)
Purpose: Documents need only contain any single token from the query text
Logic: OR relationship, any word present is sufficient
Usage example:
-- Find documents containing "AI" or "Machine Learning" or "Deep Learning"
SELECT id, title FROM documents
WHERE MATCH_ANY(content, 'AI Machine Learning Deep Learning', map('analyzer', 'auto'));
MATCH_PHRASE - Phrase Matching Function
MATCH_PHRASE(column, query, option)
Purpose: Requires documents to contain the query phrase, with word order consistent and continuous
Characteristics:
- Order-sensitive
- Requires consecutive occurrence
- Case-insensitive
Usage example:
-- Find documents containing the phrase "Natural Language Processing"
SELECT id, title FROM documents
WHERE MATCH_PHRASE(content, 'Natural Language Processing', map('analyzer', 'auto'));
-- Counterexample: will not match "Natural Language Processing" (not consecutive)
-- Counterexample: will not match "Language Natural Processing" (wrong order)
MATCH_PHRASE_PREFIX - Phrase Prefix Matching
MATCH_PHRASE_PREFIX(column, query, option)
Purpose: First n-1 words matched as phrase, last word matched as prefix
Applicable scenarios: Search suggestions, autocomplete
Usage example:
-- Find phrases starting with "Data An" (e.g., "Data Analysis", "Data Annotation")
SELECT id, title FROM documents
WHERE MATCH_PHRASE_PREFIX(content, 'Data An', map('analyzer', 'auto'));
MATCH_REGEXP - Regex Matching Function
MATCH_REGEXP(column, query, option)
Purpose: Perform regex matching on tokenization results
Usage example:
-- Find documents containing words ending with "ing"
SELECT id, title FROM documents
WHERE MATCH_REGEXP(content, '.*ing', map('analyzer', 'auto'));
-- Find documents containing digits
SELECT id, title FROM documents
WHERE MATCH_REGEXP(content, '.*[0-9].*', map('analyzer', 'auto'));
Important Parameter Notes
analyzer Option
Recommended auto parameter:
map('analyzer', 'auto') -- Automatically matches the column's tokenizer settings
Manually specifying tokenizer:
map('analyzer', 'chinese')
map('analyzer', 'chinese', 'mode', 'smart')
-- IMPORTANT: The tokenizer in the function must match the tokenizer used when creating the index; otherwise, the index cannot be leveraged for acceleration!
Index Management
Index Lifecycle
1. Create Index
-- Method 1: Create at table creation
CREATE TABLE documents (
id BIGINT,
content STRING,
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese')
);
-- Method 2: Create for existing table
CREATE INVERTED INDEX content_idx ON TABLE documents(content)
PROPERTIES('analyzer'='chinese', 'mode'='smart');
2. Build Index (Important!)
-- Build index for existing data (synchronous task)
BUILD INDEX content_idx ON documents;
-- Build index by partition
BUILD INDEX content_idx ON documents
WHERE partition_date >= '2024-01-01';
-- Key reminders:
CREATE INDEX only takes effect for new data
- Existing data requires
BUILD INDEX to leverage the index
- Building an index consumes compute resources; it is recommended to perform during off-peak hours
3. View Index
-- List all indexes on a table
SHOW INDEX FROM table_name;
-- View index details (if supported by the environment)
DESC INDEX index_name;
DESC INDEX EXTENDED index_name; -- Includes size information
4. Drop Index
DROP INDEX index_name ON table_name;
Drop characteristics:
- Metadata deleted immediately
- Index files cleaned up asynchronously
- Does not affect the data itself
Partitioned Table Index Strategy
-- Recommended approach for large table partition building
BUILD INDEX content_idx ON large_table
WHERE year = '2024' AND month = '01';
BUILD INDEX content_idx ON large_table
WHERE year = '2024' AND month = '02';
-- Build partition by partition to avoid excessive resource consumption
Storage Cost Optimization
-- View index storage occupancy (if supported)
DESC INDEX EXTENDED index_name;
-- Choose appropriate tokenizer based on business needs
-- keyword: Least storage, exact match only
-- chinese: Medium storage, supports intelligent tokenization
-- unicode: Most storage, supports all languages
Application Scenario Design
Enterprise Knowledge Base System
-- Knowledge base table design
CREATE TABLE knowledge_base (
doc_id BIGINT PRIMARY KEY,
title STRING,
content STRING,
category STRING,
tags STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP,
-- Multi-level index design
INDEX title_keyword_idx (title) INVERTED PROPERTIES('analyzer'='keyword'),
INDEX content_chinese_idx (content) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart'),
INDEX tags_unicode_idx (tags) INVERTED PROPERTIES('analyzer'='unicode')
);
-- Multi-dimensional search query
SELECT doc_id, title FROM knowledge_base
WHERE MATCH_ANY(content, 'AI Machine Learning', map('analyzer', 'auto'))
OR MATCH_ANY(tags, 'AI ML', map('analyzer', 'auto'))
ORDER BY updated_at DESC LIMIT 20;
Log Analysis System
-- Application log table
CREATE TABLE application_logs (
log_id BIGINT,
timestamp TIMESTAMP,
level STRING,
message STRING,
stack_trace STRING,
source_ip STRING,
INDEX message_idx (message) INVERTED PROPERTIES('analyzer'='english'),
INDEX stack_trace_idx (stack_trace) INVERTED PROPERTIES('analyzer'='keyword')
) PARTITIONED BY (DATE(timestamp));
-- Troubleshooting query
SELECT log_id, timestamp, message FROM application_logs
WHERE DATE(timestamp) >= CURRENT_DATE() - INTERVAL '7' DAY
AND level = 'ERROR'
AND MATCH_ANY(message, 'timeout connection database', map('analyzer', 'auto'))
ORDER BY timestamp DESC;
E-Commerce Product Search
-- Product information table
CREATE TABLE products (
product_id BIGINT,
name STRING,
description STRING,
brand STRING,
category STRING,
tags STRING,
price DECIMAL(10,2),
INDEX name_chinese_idx (name) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='max_word'),
INDEX desc_chinese_idx (description) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='max_word'),
INDEX brand_keyword_idx (brand) INVERTED PROPERTIES('analyzer'='keyword')
);
-- Product search query
SELECT product_id, name, price FROM products
WHERE MATCH_ANY(name, 'phone smart camera', map('analyzer', 'auto'))
OR MATCH_ANY(description, 'phone smart camera', map('analyzer', 'auto'))
ORDER BY price;
Content Management System
-- Article content table
CREATE TABLE articles (
article_id BIGINT,
title STRING,
content STRING,
author STRING,
publish_date DATE,
status STRING,
INDEX title_chinese_idx (title) INVERTED PROPERTIES('analyzer'='chinese'),
INDEX content_chinese_idx (content) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart'),
INDEX author_keyword_idx (author) INVERTED PROPERTIES('analyzer'='keyword')
);
-- Content search
SELECT article_id, title, author FROM articles
WHERE status = 'published'
AND publish_date >= CURRENT_DATE() - INTERVAL '30' DAY
AND MATCH_PHRASE(content, 'technology development trends', map('analyzer', 'auto'))
ORDER BY publish_date DESC;
Important Pitfall Avoidance Guide
Tokenizer Consistency Issue
Common Error: Tokenizer Mismatch
-- ❌ Error example: Index uses chinese tokenizer, query uses english tokenizer
CREATE TABLE docs (
content STRING,
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='chinese')
);
SELECT * FROM docs
WHERE MATCH_ANY(content, 'test', map('analyzer', 'english'));
-- Result: Cannot leverage index, poor performance
Correct Approach:
-- ✅ Solution 1: Use auto parameter (recommended)
SELECT * FROM docs
WHERE MATCH_ANY(content, 'test', map('analyzer', 'auto'));
-- ✅ Solution 2: Manually match the index tokenizer
SELECT * FROM docs
WHERE MATCH_ANY(content, 'test', map('analyzer', 'chinese'));
Index Build Pitfalls
The Necessity of BUILD INDEX
-- ❌ Common error: Forgetting to build index
CREATE INVERTED INDEX content_idx ON TABLE existing_table(content)
PROPERTIES('analyzer'='chinese');
-- Query directly (only effective for new data, existing data cannot leverage index)
SELECT * FROM existing_table
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
Correct Index Build Process
-- ✅ Complete process
-- 1. Create index
CREATE INVERTED INDEX content_idx ON TABLE existing_table(content)
PROPERTIES('analyzer'='chinese');
-- 2. Build index (required!)
BUILD INDEX content_idx ON existing_table;
-- 3. Verify index
SHOW INDEX FROM existing_table;
-- 4. Execute query
SELECT * FROM existing_table
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
According to the official documentation: In most cases, inverted indexes do not significantly improve performance for queries with sub-second execution times.
-- ❌ Not ideal for inverted index scenarios
-- 1. Small tables (fewer than hundreds of rows)
-- 2. Queries already executing in <1 second
-- 3. Simple equality queries
-- ✅ Suitable for inverted index scenarios
-- 1. Large data tables (10,000+ rows)
-- 2. Complex text searches
-- 3. Multi-keyword matching
Unsupported Query Patterns
-- ❌ Not supported: Type conversion on the column
SELECT * FROM docs
WHERE MATCH_ANY(CAST(id AS STRING), '123', map('analyzer', 'auto'));
-- ✅ Supported: Type conversion on the query value
SELECT * FROM docs
WHERE MATCH_ANY(content, CAST(123 AS STRING), map('analyzer', 'auto'));
-- ❌ Not supported: External tables
-- Inverted indexes do not support external tables
Storage Cost Pitfalls
Index Storage Overhead
-- Inverted indexes create additional index files
-- Storage cost = Original data + Index files
-- Optimization suggestions:
-- 1. Only create indexes for columns that truly need searching
-- 2. Choose appropriate tokenizer based on query patterns
-- 3. Regularly clean up unnecessary indexes
Tokenizer Selection Strategy
| Tokenizer | Index Size | Query Performance | Feature Richness | Recommended Scenario |
|---|
| keyword | Smallest | Fastest | Exact match | IDs, statuses, tags |
| english | Small | Fast | English tokenization | Pure English content |
| chinese | Medium | Medium | Chinese tokenization | Chinese documents |
| unicode | Largest | Relatively slow | Full language support | Multi-language mixed |
Query Optimization Strategies
1. Choose Query Functions Wisely
-- Choose appropriate functions based on business needs
-- Exact search: Use MATCH_ALL
SELECT * FROM docs WHERE MATCH_ALL(content, 'Machine Learning Algorithm', map('analyzer', 'auto'));
-- Broad search: Use MATCH_ANY
SELECT * FROM docs WHERE MATCH_ANY(content, 'AI Machine Learning Deep Learning', map('analyzer', 'auto'));
-- Phrase search: Use MATCH_PHRASE
SELECT * FROM docs WHERE MATCH_PHRASE(content, 'Natural Language Processing', map('analyzer', 'auto'));
2. Combined Query Optimization
-- ✅ Recommended: Use full-text search as the primary filter condition
SELECT * FROM documents
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'))
AND category = 'tech' -- Secondary filtering on top of full-text search
AND created_at >= '2024-01-01';
3. Paginated Query Optimization
-- ✅ Efficient paginated query
SELECT id, title FROM documents
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'))
ORDER BY id -- Sort by primary key
LIMIT 20 OFFSET 0;
-- -- Avoid deep pagination
-- -- LIMIT 20 OFFSET 10000; -- Performance degrades as offset increases
Index Optimization Strategies
1. Partitioned Table Index Management
-- Build index gradually by partition
BUILD INDEX content_idx ON large_table
WHERE year = '2024' AND month = '06';
-- Monitor build progress
-- Job Profile can be used to view progress
2. Index Maintenance Strategy
-- Regularly check index status
SHOW INDEX FROM table_name;
-- For frequently updated tables, index rebuild may be needed
DROP INDEX old_idx ON table_name;
CREATE INVERTED INDEX new_idx ON TABLE table_name(column)
PROPERTIES('analyzer'='chinese');
BUILD INDEX new_idx ON table_name;
System Resource Optimization
1. Compute Resource Configuration
- Index building: Use larger Virtual Clusters for BUILD INDEX operations
- Query execution: Use smaller Virtual Clusters for daily queries
- Mixed workloads: Separate index building and query workloads
2. Storage Resource Optimization
-- Monitor index storage occupancy (if supported)
DESC INDEX EXTENDED index_name;
-- Clean up unnecessary indexes
DROP INDEX unused_idx ON table_name;
Migration Strategy
Migrating from Traditional LIKE Queries
Migration Assessment
Scenarios suitable for migration:
- Frequent text search queries
- Large data tables (10,000+ rows)
- Complex multi-keyword searches
- Scenarios requiring Chinese tokenization
Scenarios not in a hurry to migrate:
- Small data tables
- Occasional text queries
- Simple queries already fast (<1 second)
Migration Steps
-- 1. Analyze existing query patterns
-- Identify frequent LIKE queries
SELECT query_text, execution_count
FROM query_logs
WHERE query_text LIKE '%LIKE%'
ORDER BY execution_count DESC;
-- 2. Create test table to verify effects
CREATE TABLE docs_test AS SELECT * FROM docs_original LIMIT 1000;
-- 3. Add inverted index
CREATE INVERTED INDEX content_idx ON TABLE docs_test(content)
PROPERTIES('analyzer'='chinese');
BUILD INDEX content_idx ON docs_test;
-- 4. Performance comparison test
-- Original LIKE query
SELECT COUNT(*) FROM docs_test WHERE content LIKE '%keyword%';
-- Full-text search query
SELECT COUNT(*) FROM docs_test
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
-- 5. Gradually migrate to production environment
Migrating from External Search Engines
Elasticsearch Migration Mapping
| Elasticsearch Query | Lakehouse Full-Text Search | Description |
|---|
match_all | MATCH_ALL | Match all keywords |
match | MATCH_ANY | Match any keyword |
match_phrase | MATCH_PHRASE | Phrase matching |
match_phrase_prefix | MATCH_PHRASE_PREFIX | Phrase prefix matching |
regexp | MATCH_REGEXP | Regex matching |
Migration Benefits Assessment
Architecture simplification benefits:
- Eliminate 1 separate search engine system
- Unified data storage and retrieval platform
- Reduced operational complexity
Cost optimization benefits:
- Reduced additional storage and compute resources
- No data sync ETL processes needed
- Lower total cost of ownership
Data consistency benefits:
- Strong consistency guarantees
- No data sync delays
- Real-time search of latest data
Hybrid Solution Design
For complex enterprise environments, a hybrid approach can be adopted:
-- Approach 1: Separate by data type
-- Structured queries + Simple text search: Use Lakehouse full-text search
-- Complex semantic search + Recommendation algorithms: Keep Elasticsearch
-- Approach 2: Separate by real-time requirements
-- Real-time search: Use Lakehouse full-text search
-- Offline analysis: Use Elasticsearch
-- Approach 3: Separate by data volume
-- Large data volumes: Use Lakehouse full-text search
-- Small volume rapid prototyping: Use Elasticsearch
Best Practices Summary
Index Design Best Practices
1. Tokenizer Selection Principles
-- Choose tokenizer based on data characteristics
CREATE TABLE multilingual_docs (
id BIGINT,
title_cn STRING, -- Chinese title
title_en STRING, -- English title
content STRING, -- Mixed content
tags STRING, -- Tags (exact match)
INDEX title_cn_idx (title_cn) INVERTED PROPERTIES('analyzer'='chinese', 'mode'='smart'),
INDEX title_en_idx (title_en) INVERTED PROPERTIES('analyzer'='english'),
INDEX content_idx (content) INVERTED PROPERTIES('analyzer'='unicode'),
INDEX tags_idx (tags) INVERTED PROPERTIES('analyzer'='keyword')
);
2. Index Maintenance Strategy
-- Production environment index maintenance process
-- 1. Create index during off-peak hours
CREATE INVERTED INDEX content_idx ON TABLE large_table(content)
PROPERTIES('analyzer'='chinese');
-- 2. Build index in batches
BUILD INDEX content_idx ON large_table
WHERE partition_date = '2024-06-01';
-- 3. Verify index effectiveness
SELECT COUNT(*) FROM large_table
WHERE MATCH_ANY(content, 'test keyword', map('analyzer', 'auto'));
-- 4. Monitor index status
SHOW INDEX FROM large_table;
Query Pattern Best Practices
1. Query Function Selection Guide
| Business Scenario | Recommended Function | Example Query |
|---|
| Exact search with multiple keywords | MATCH_ALL | Must contain "AI" and "Machine Learning" |
| Broad topic search | MATCH_ANY | Contains "AI" or "ML" or "Deep Learning" |
| Professional term search | MATCH_PHRASE | Exact match for "Natural Language Processing" |
| Search suggestions / Autocomplete | MATCH_PHRASE_PREFIX | Phrases starting with "Data An" |
| Pattern matching | MATCH_REGEXP | Contains digits or specific formats |
-- Efficient query patterns
-- 1. Leverage partition filtering
SELECT * FROM logs
WHERE log_date >= '2024-01-01'
AND MATCH_ANY(message, 'error timeout', map('analyzer', 'auto'));
-- 2. Combine index filtering
SELECT * FROM documents
WHERE MATCH_ALL(content, 'keyword1 keyword2', map('analyzer', 'auto'))
AND status = 'active';
Monitoring and Diagnostics
1. Index Effectiveness Verification
-- Verify if index is being used (via execution plan)
EXPLAIN SELECT * FROM documents
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'));
-- Compare LIKE query performance
-- Method: Record query execution time, compare optimization effects
2. Common Issue Diagnosis
-- Check tokenization results
SELECT TOKENIZE('test text', map('analyzer', 'auto'));
-- Check index status
SHOW INDEX FROM table_name;
-- Verify query syntax
SELECT MATCH_ANY('test text', 'keyword', map('analyzer', 'chinese'));
Known Limitations and Considerations
Functional Limitations
1. Unsupported Scenarios
- External tables: Inverted indexes do not support external tables
- Column type conversion: Forced type conversion on table columns is not supported
2. Query Limitations
-- ❌ Unsupported query patterns
-- 1. Type conversion on table columns
WHERE MATCH_ANY(CAST(column AS STRING), 'value', map('analyzer', 'auto'))
-- 2. Creating inverted index on external table
CREATE INVERTED INDEX idx ON EXTERNAL_TABLE table_name(column)
PROPERTIES('analyzer'='chinese');
1. Index Build Cost
- BUILD INDEX is a synchronous operation that consumes compute resources
- Large tables are recommended to be built partition by partition
- Recommended to perform during off-peak hours
2. Storage Cost
- Inverted indexes create additional index files
- Index size depends on data volume and tokenizer type
- Need to balance query performance and storage cost
Operational Considerations
1. Version Compatibility
- Recommend thorough validation in test environments
- Monitor version updates and feature improvements
2. Backup and Recovery
- Index metadata is backed up along with the table structure
- Index files need to be rebuilt
- BUILD INDEX needs to be executed after recovery
Quick Reference Card
Tokenizer Selection Quick Reference
-- Exact matching (IDs, statuses, etc.)
'analyzer'='keyword'
-- Pure English content
'analyzer'='english'
-- Chinese content (recommended)
'analyzer'='chinese', 'mode'='smart'
-- Multi-language mixed
'analyzer'='unicode'
-- Auto matching (recommended)
'analyzer'='auto'
Common Query Templates
-- 1. Basic text search
SELECT * FROM table_name
WHERE MATCH_ANY(column, 'keyword', map('analyzer', 'auto'));
-- 2. Exact phrase search
SELECT * FROM table_name
WHERE MATCH_PHRASE(column, 'exact phrase', map('analyzer', 'auto'));
-- 3. Multi-condition combined search
SELECT * FROM table_name
WHERE MATCH_ALL(content, 'keyword1 keyword2', map('analyzer', 'auto'))
AND category = 'target_category';
-- 4. Paginated query
SELECT id, title FROM table_name
WHERE MATCH_ANY(content, 'keyword', map('analyzer', 'auto'))
ORDER BY id LIMIT 20 OFFSET 0;
Index Management Commands
-- Create index
CREATE INVERTED INDEX idx_name ON TABLE table_name(column)
PROPERTIES('analyzer'='chinese');
-- Build index
BUILD INDEX idx_name ON table_name;
-- View index
SHOW INDEX FROM table_name;
-- Drop index
DROP INDEX idx_name ON table_name;
Note: This document is based on Lakehouse product documentation as of June 2025. It is recommended to regularly check the official documentation for the latest updates. Before using in a production environment, be sure to verify the correctness and performance impact of all operations in a test environment.