Full-Text Search and Text Analysis Guide


Quick Reference

FunctionPurposeUse CaseIndex Required
tokenizeTokenize textWord frequency statistics, tag extraction, text preprocessingNo
match_anyAny-word match (OR logic)Broad search, keyword filteringYes (inverted index)
match_phraseExact phrase matchFinding fixed phrases, error messagesYes (inverted index)
match_phrase_prefixPrefix phrase matchSearch suggestions, autocompleteYes (inverted index)
match_regexpRegular expression matchComplex pattern extraction, log format parsingNo

Prerequisites

All examples in this guide use the following test data:

-- System log and document table CREATE TABLE search_docs ( id BIGINT, title VARCHAR(500), content VARCHAR(1000), log_level VARCHAR(20), log_message VARCHAR(500) ); INSERT INTO search_docs VALUES (1, 'Singdata Lakehouse Introduction', 'Singdata Lakehouse is a modern data platform that supports real-time analytics and AI applications.', 'INFO', 'System started successfully.'), (2, 'Data Pipeline Development', 'Use Dynamic Table to build real-time data pipelines, enabling data flow from ODS to ADS.', 'ERROR', 'Connection timeout to database.'), (3, 'User Behavior Analysis', 'Use BITMAP functions to achieve efficient user overlap analysis and precision marketing.', 'WARN', 'High memory usage detected.'), (4, 'SQL Function Reference', 'This document covers the various SQL functions supported by Singdata Lakehouse, including aggregate, window, and scalar functions.', 'DEBUG', 'Query execution plan generated.'), (5, 'Vector Search Applications', 'Use vector similarity search to implement semantic matching and recommendation systems.', 'INFO', 'Vector index rebuilt.'), (6, 'Data Security and Compliance', 'Supports row- and column-level access control, dynamic data masking, and audit logging.', 'ERROR', 'Authentication failed for user admin.'), (7, 'Performance Optimization Best Practices', 'Improve performance through small file compaction, result caching, and query rewriting.', 'WARN', 'Disk space running low.'), (8, 'Multi-Table Real-Time Sync', 'Use CDC technology to achieve real-time data synchronization from MySQL to Lakehouse.', 'INFO', 'CDC stream processing started.'), (9, 'External Data Federation Query', 'Query external data sources such as Hive and Iceberg directly via External Catalog.', 'DEBUG', 'External catalog metadata refreshed.'), (10, 'AI Model Integration', 'Call large language models directly in SQL for text generation and sentiment analysis.', 'INFO', 'AI model endpoint connected.');


Scenario 1: Text Tokenization and Word Frequency Statistics (tokenize)

Problem

Tokenize mixed Chinese/English text and count the most frequent terms.

SQL Implementation

-- Tokenize using the Chinese analyzer SELECT id, title, tokenize(content, map('analyzer', 'chinese')) AS tokens FROM search_docs WHERE id IN (1, 2, 3);

Output:

idtitletokens
1Singdata Lakehouse Introductionsingdata:lakehouse:modern:data:platform:real-time:analytics:ai:applications
2Data Pipeline Developmentdynamic:table:real-time:data:pipeline:ods:ads:data:flow
3User Behavior Analysisbitmap:functions:efficient:user:overlap:analysis:precision:marketing

Word Frequency Statistics in Practice

-- Expand tokenization results and count the top 10 most frequent terms SELECT token, COUNT(*) AS freq FROM ( SELECT tokenize(content, map('analyzer', 'chinese')) AS token_array FROM search_docs ), LATERAL VIEW explode(token_array) t AS token WHERE LENGTH(token) > 1 -- filter single characters GROUP BY token ORDER BY freq DESC LIMIT 10;

Output:

tokenfreq
data4
real-time3
analysis2
support2
......

Scenario 2: Regular Expression Matching (match_regexp)

Problem

Extract specific error codes or IP addresses from log messages.

SQL Implementation

-- Match logs containing "timeout" or "failed" SELECT id, log_level, log_message FROM search_docs WHERE match_regexp(log_message, '(?i)(timeout|failed)');

Output:

idlog_levellog_message
2ERRORConnection timeout to database.
6ERRORAuthentication failed for user admin.

Extracting Key Information from Logs

-- Extract the error type (assuming format: "ERROR: [Type] message") SELECT id, log_message, -- Use regexp_extract to capture the content before "failed" regexp_extract(log_message, '([A-Za-z]+) failed', 1) AS error_type FROM search_docs WHERE match_regexp(log_message, 'failed');


match_any: Any-Word Match (OR Logic)

-- Search for documents containing "data", "pipeline", or "real-time" -- Note: the third parameter must specify the analyzer, matching the index configuration SELECT id, title, content FROM search_docs WHERE match_any(content, 'data pipeline real-time', map('analyzer', 'chinese'));

Use case: Broad search — a document matches if it contains any one of the keywords.

match_phrase: Exact Phrase Match

-- Search for documents containing the exact phrase "real-time data" SELECT id, title, content FROM search_docs WHERE match_phrase(content, 'real-time data', map('analyzer', 'chinese'));

Use case: Finding fixed phrases, error messages, or proper nouns.

match_phrase_prefix: Prefix Phrase Match

-- Search for phrases starting with "real-time d" SELECT id, title, content FROM search_docs WHERE match_phrase_prefix(content, 'real-time d', map('analyzer', 'chinese'));

Use case: Search suggestions, autocomplete, fuzzy prefix matching.


Scenario 4: End-to-End Example — Log Analysis and Alerting

Problem

Quickly locate ERROR-level critical failures from a large volume of logs and analyze the distribution of error types.

SQL Implementation

WITH error_logs AS ( SELECT id, log_level, log_message, tokenize(log_message, map('analyzer', 'standard')) AS tokens FROM search_docs WHERE log_level = 'ERROR' ) SELECT id, log_message, -- Classify error type CASE WHEN match_regexp(log_message, '(?i)timeout') THEN 'Connection Timeout' WHEN match_regexp(log_message, '(?i)failed') THEN 'Authentication Failed' ELSE 'Other' END AS error_category, -- Extract key entities filter(tokens, t -> LENGTH(t) > 3) AS important_words FROM error_logs;

Output:

idlog_messageerror_categoryimportant_words
2Connection timeout to database.Connection Timeoutconnection:timeout:database
6Authentication failed for user admin.Authentication Failedauthentication:failed:user:admin

Common Issues

1. match_* functions require an inverted index

-- Wrong: calling match_any without an index SELECT * FROM search_docs WHERE match_any(content, 'keyword', map('analyzer', 'chinese')); -- Error: cannot infer analyzer from arguments -- Correct: declare the index in the CREATE TABLE DDL (recommended) CREATE TABLE search_docs ( id BIGINT, content VARCHAR(1000), INDEX idx_content(content) USING INVERTED PROPERTIES('analyzer' = 'chinese') ); -- Data is indexed automatically on insert; queries work immediately -- Or add an index to an existing table CREATE INVERTED INDEX idx_content ON TABLE search_docs(content) PROPERTIES('analyzer' = 'chinese'); BUILD INDEX idx_content ON search_docs; -- manually build for existing data

2. match_* functions require the third parameter

-- Wrong: missing the analyzer parameter match_any(content, 'keyword') -- Correct: specify the analyzer, matching the index configuration match_any(content, 'keyword', map('analyzer', 'chinese')) match_any(content, 'keyword', map('analyzer', 'auto')) -- auto-match index configuration

3. Choosing the right analyzer for tokenize

-- Chinese text must use the 'chinese' analyzer; otherwise tokenization may be incorrect tokenize('Singdata Lakehouse', map('analyzer', 'chinese')) -- Output: singdata:lakehouse tokenize('Singdata Lakehouse') -- default analyzer may only recognize English -- Output: lakehouse

4. Regular expression syntax

-- match_regexp uses Java/PCRE regex syntax match_regexp(log, '(?i)error') -- (?i) for case-insensitive matching match_regexp(log, '\\d{3}') -- match 3 digits (note the escape)

5. tokenize returns an array type

-- Wrong: comparing an array directly to a string SELECT * FROM search_docs WHERE tokenize(content) = 'data'; -- Correct: use explode to expand, or use array_contains SELECT * FROM search_docs WHERE array_contains(tokenize(content), 'data');


Performance Optimization Tips

ScenarioOptimization Strategy
Tokenizing large textUse SUBSTR to truncate to the first 1000 characters before tokenizing
High-frequency search termsCreate an inverted index to avoid full table scans
Regex matchingUse LIKE instead of simple regex where possible for better performance
Filtering after tokenizationApply WHERE filtering immediately after tokenize to reduce intermediate data

-- Recommended: filter after tokenizing SELECT * FROM ( SELECT id, tokenize(content, map('analyzer', 'chinese')) AS tokens FROM search_docs ) WHERE array_contains(tokens, 'data'); -- Not recommended: full table scan then tokenize SELECT id, tokenize(content) FROM search_docs WHERE content LIKE '%data%';