Lakehouse String Processing Guide

Overview

String processing is one of the most common requirements in log analysis, user behavior analysis, and data cleansing. Singdata Lakehouse provides comprehensive string function support, including substring extraction and concatenation, regex extraction and replacement, fuzzy matching, and string aggregation. This guide is organized by business scenario to help you quickly master efficient string processing methods.


SQL Commands Covered

Command/FunctionPurposeUse Case
SUBSTR() / SUBSTRING()Extract substring by positionExtract domain, truncate fixed-length fields
CONCAT()Concatenate multiple stringsBuild composite keys, concatenate fields
CONCAT_WS()Concatenate with a delimiterConcatenate multiple columns into one delimited string
LOCATE() / INSTR()Find character or substring positionAssist with extraction, locate delimiter positions
REPLACE()Literal string replacementRemove fixed prefixes/suffixes
REGEXP_EXTRACT()Extract matching group via regexExtract fields from URLs/logs
REGEXP_REPLACE()Replace matches via regexCleanse special characters, mask data
SPLIT()Split a string by delimiter into an arrayConvert comma-separated tags to an array
LATERAL VIEW EXPLODE()Expand an array column into multiple rowsTag expansion, flatten multi-value fields
LIKEWildcard pattern matchingKeyword filtering
RLIKERegex pattern matchingComplex pattern filtering
GROUP_CONCAT()Aggregate multiple rows into a string (supports ORDER BY)Merge access IDs in order
WM_CONCAT()Aggregate multiple rows into a stringSimple multi-row merging
UPPER() / LOWER()Case conversionField normalization, unified formatting
TRIM() / LTRIM() / RTRIM()Remove leading/trailing whitespaceCleanse user input
LENGTH()Return character count of a stringLength validation, filter empty values

Prerequisites

The following examples use a simulated access log table doc_str_logs:

-- Create test table
CREATE TABLE IF NOT EXISTS doc_str_logs (
    log_id  INT,
    user_id STRING,
    url     STRING,
    tags    STRING,
    content STRING
);

-- Insert test data
INSERT INTO doc_str_logs VALUES
(1, 'u001', 'https://www.example.com/product/detail?id=123',  'tech,mobile,review',    'Hello World! This is a test.'),
(2, 'u002', 'https://shop.example.com/category/shoes?color=red', 'fashion,shoes,sale', '  spaces around  '),
(3, 'u003', 'https://www.example.com/blog/post?tag=python',   'tech,python,tutorial',  'Python is great!'),
(4, 'u001', 'https://api.example.com/v2/users?page=1',        'api,backend,rest',      'API response: {"status":"ok"}'),
(5, 'u004', 'https://www.example.com/search?q=lakehouse',     'data,lakehouse,cloud',  'Lakehouse Data Platform'),
(6, 'u002', 'https://cdn.example.com/images/logo.png',        'image,cdn,static',      'CDN Resource File'),
(7, 'u005', 'https://www.example.com/product/detail?id=456',  'tech,mobile,new',       'Phone: 138-0013-8000, please contact'),
(8, 'u003', 'https://shop.example.com/checkout?order=789',    'order,payment,checkout','Order #789 confirmed!');

Scenario 1: Extraction and Concatenation

Extracting domain names from URLs, or concatenating multiple fields into composite keys, are the most common string operations in log processing.

Extract Domain with SUBSTR + LOCATE

SUBSTR(str, pos, len) extracts a substring of specified length starting from the specified position. LOCATE(substr, str, start) returns the position of the first occurrence of the substring starting from the start position. Using them together allows precise extraction of the domain part of a URL.

https:// occupies exactly 8 characters, so the domain starts at position 9 and ends at the first slash:

SELECT
    log_id,
    url,
    SUBSTR(url, 9, LOCATE('/', url, 9) - 9) AS domain
FROM doc_str_logs
ORDER BY log_id;

Execution Results:

log_idurldomain
1https://www.example.com/product/detail?id=123www.example.com
2https://shop.example.com/category/shoes?color=redshop.example.com
3https://www.example.com/blog/post?tag=pythonwww.example.com
4https://api.example.com/v2/users?page=1api.example.com
5https://www.example.com/search?q=lakehousewww.example.com
6https://cdn.example.com/images/logo.pngcdn.example.com
7https://www.example.com/product/detail?id=456www.example.com
8https://shop.example.com/checkout?order=789shop.example.com

Build Composite Keys with CONCAT

CONCAT(str1, str2, ...) concatenates multiple strings into one. Commonly used to create unique identifiers or label row provenance:

SELECT
    log_id,
    CONCAT(user_id, '@', REGEXP_EXTRACT(url, 'https?://([/]+)', 1)) AS user_domain_key
FROM doc_str_logs
ORDER BY log_id;

Execution Results:

log_iduser_domain_key
1u001@www.example.com
2u002@shop.example.com
3u003@www.example.com
4u001@api.example.com
5u004@www.example.com
6u002@cdn.example.com
7u005@www.example.com
8u003@shop.example.com

Concatenate Multiple Fields with CONCAT_WS

CONCAT_WS(sep, str1, str2, ...) concatenates multiple columns with a uniform delimiter. It is cleaner than multiple CONCAT calls and automatically skips NULL values:

SELECT
    log_id,
    url,
    CONCAT_WS('|',
        REGEXP_EXTRACT(url, 'https?://([/]+)', 1),
        REGEXP_EXTRACT(url, 'https?://[/?]+(/[?]*)', 1),
        REGEXP_EXTRACT(url, '[?](.*)', 1)
    ) AS url_parts
FROM doc_str_logs
ORDER BY log_id
LIMIT 5;

Execution Results (first 5 rows):

log_idurlurl_parts
1https://www.example.com/product/detail?id=123www.example.com\|/product/detail\|id=123
2https://shop.example.com/category/shoes?color=redshop.example.com|/category/shoes|color=red
3https://www.example.com/blog/post?tag=pythonwww.example.com\|/blog/post\|tag=python
4https://api.example.com/v2/users?page=1api.example.com|/v2/users|page=1
5https://www.example.com/search?q=lakehousewww.example.com\|/search\|q=lakehouse

Scenario 2: Regex Extraction

When URLs or log formats are more complex, REGEXP_EXTRACT is more intuitive and maintainable than the SUBSTR + LOCATE combination.

REGEXP_EXTRACT(str, pattern, group_index) returns the content matched by the group_index-th capture group (starting from 1); a group_index of 0 returns the entire match.

Split a URL into Domain, Path, and Query Parameters

SELECT
    log_id,
    REGEXP_EXTRACT(url, 'https?://([/?]+)', 1)         AS domain,
    REGEXP_EXTRACT(url, 'https?://[/?]+(/[?]*)', 1)   AS path,
    REGEXP_EXTRACT(url, '[?](.*)', 1)                    AS query_string
FROM doc_str_logs
ORDER BY log_id;

Execution Results:

log_iddomainpathquery_string
1www.example.com/product/detailid=123
2shop.example.com/category/shoescolor=red
3www.example.com/blog/posttag=python
4api.example.com/v2/userspage=1
5www.example.com/searchq=lakehouse
6cdn.example.com/images/logo.png(empty)
7www.example.com/product/detailid=456
8shop.example.com/checkoutorder=789

Extract the First Parameter Key and Value from a URL

SELECT
    log_id,
    REGEXP_EXTRACT(url, '[?&]([=]+)=([&]+)', 1) AS param_key,
    REGEXP_EXTRACT(url, '[?&][=]+=([&]+)',    1) AS param_value
FROM doc_str_logs
WHERE url LIKE '%?%'
ORDER BY log_id;

Execution Results:

log_idparam_keyparam_value
1id123
2colorred
3tagpython
4page1
5qlakehouse
7id456
8order789

Scenario 3: Regex Replace and Cleansing

REGEXP_REPLACE(str, pattern, replacement) replaces all parts of the string matching pattern with replacement, commonly used for removing noise and masking data.

Remove Special Characters

Keep Chinese characters, English letters, digits, and spaces; remove punctuation and other special characters:

SELECT
    log_id,
    content,
    REGEXP_REPLACE(content, '[a-zA-Z0-9\\u4e00-\\u9fa5 ]', '') AS cleaned
FROM doc_str_logs
ORDER BY log_id;

Execution Results:

log_idcontentcleaned
1Hello World! This is a test.Hello World This is a test
2spaces aroundspaces around
3Python is great!Python is great
4API response: {"status":"ok"}API response statusok
5Lakehouse Data PlatformLakehouse Data Platform
6CDN Resource FileCDN Resource File
7Phone: 138-0013-8000, please contactPhone13800138000please contact
8Order #789 confirmed!Order 789 confirmed

Mask Phone Numbers

Replace phone numbers in NNN-NNNN-NNNN format with a masked placeholder:

SELECT
    log_id,
    content,
    REGEXP_REPLACE(content, '[0-9]{3}-[0-9]{4}-[0-9]{4}', 'PHONE_HIDDEN') AS masked
FROM doc_str_logs
ORDER BY log_id;

Execution Results:

log_idcontentmasked
1Hello World! This is a test.Hello World! This is a test.
2spaces aroundspaces around
3Python is great!Python is great!
4API response: {"status":"ok"}API response: {"status":"ok"}
5Lakehouse Data PlatformLakehouse Data Platform
6CDN Resource FileCDN Resource File
7Phone: 138-0013-8000, please contactPhone: PHONE_HIDDEN, please contact
8Order #789 confirmed!Order #789 confirmed!

Remove Fixed Prefixes with REPLACE

For simple literal replacements, REPLACE(str, search, replacement) is more efficient than regex:

SELECT
    log_id,
    REPLACE(url, 'https://', '') AS url_no_scheme
FROM doc_str_logs
ORDER BY log_id
LIMIT 4;

Execution Results (first 4 rows):

log_idurl_no_scheme
1www.example.com/product/detail?id=123
2shop.example.com/category/shoes?color=red
3www.example.com/blog/post?tag=python
4api.example.com/v2/users?page=1

Scenario 4: String Split to Rows

When a single column stores multiple values (such as comma-separated tags), you first need to use SPLIT to cut it into an array, then use LATERAL VIEW EXPLODE to expand the array into multiple rows, enabling tag-based grouping and statistics.

Expand a Tag Column

SELECT
    log_id,
    tags,
    tag
FROM doc_str_logs
LATERAL VIEW EXPLODE(SPLIT(tags, ',')) t AS tag
ORDER BY log_id, tag;

Execution Results (first 9 rows):

log_idtagstag
1tech,mobile,reviewmobile
1tech,mobile,reviewreview
1tech,mobile,reviewtech
2fashion,shoes,salefashion
2fashion,shoes,salesale
2fashion,shoes,saleshoes
3tech,python,tutorialpython
3tech,python,tutorialtech
3tech,python,tutorialtutorial

Count Tag Frequencies

After expansion, aggregate analysis on tags can be performed:

SELECT
    tag,
    COUNT(*) AS log_count
FROM doc_str_logs
LATERAL VIEW EXPLODE(SPLIT(tags, ',')) t AS tag
GROUP BY tag
ORDER BY log_count DESC, tag;

Execution Results:

taglog_count
tech3
mobile2
api1
backend1
cdn1
checkout1
cloud1
data1
fashion1
image1
lakehouse1
new1
order1
payment1
python1
rest1
review1
sale1
shoes1
static1
tutorial1

If you only need to convert the string to an array and count the number of elements without expanding into rows, you can use SIZE(SPLIT(...)):

SELECT
    log_id,
    tags,
    SPLIT(tags, ',')        AS tag_array,
    SIZE(SPLIT(tags, ','))  AS tag_count
FROM doc_str_logs
ORDER BY log_id
LIMIT 4;

Execution Results (first 4 rows):

log_idtagstag_arraytag_count
1tech,mobile,review["tech","mobile","review"]3
2fashion,shoes,sale["fashion","shoes","sale"]3
3tech,python,tutorial["tech","python","tutorial"]3
4api,backend,rest["api","backend","rest"]3

Scenario 5: Fuzzy Matching and Filtering

LIKE: Wildcard Matching

LIKE uses % (any number of characters) and _ (a single character) as wildcards. Suitable for simple patterns such as path prefixes and keyword containment:

-- Filter records where the URL path contains /product/
SELECT log_id, url
FROM doc_str_logs
WHERE url LIKE '%/product/%'
ORDER BY log_id;

Execution Results:

log_idurl
1https://www.example.com/product/detail?id=123
7https://www.example.com/product/detail?id=456

RLIKE: Regex Matching

RLIKE (equivalent to REGEXP) supports full regex syntax and is suitable for complex patterns:

-- Filter visits from the www or shop subdomain
SELECT log_id, url
FROM doc_str_logs
WHERE url RLIKE 'https://(www|shop)\\.example\\.com'
ORDER BY log_id;

Execution Results:

log_idurl
1https://www.example.com/product/detail?id=123
2https://shop.example.com/category/shoes?color=red
3https://www.example.com/blog/post?tag=python
5https://www.example.com/search?q=lakehouse
7https://www.example.com/product/detail?id=456
8https://shop.example.com/checkout?order=789
-- Filter records where content contains a phone number pattern
SELECT log_id, content
FROM doc_str_logs
WHERE content RLIKE '[0-9]{3}-[0-9]{4}-[0-9]{4}'
ORDER BY log_id;

Execution Results:

log_idcontent
7Phone: 138-0013-8000, please contact

Scenario 6: String Aggregation

Merge multiple rows of strings into a single row, commonly used for generating report summaries or constructing analytical dimensions.

GROUP_CONCAT: Ordered Aggregation (supports ORDER BY)

GROUP_CONCAT(expr ORDER BY col SEPARATOR sep) supports sorting and custom delimiters during merging:

-- Aggregate access record IDs by user, ordered by log_id ascending
SELECT
    user_id,
    GROUP_CONCAT(log_id ORDER BY log_id SEPARATOR ',') AS log_ids,
    COUNT(*) AS visit_count
FROM doc_str_logs
GROUP BY user_id
ORDER BY user_id;

Execution Results:

user_idlog_idsvisit_count
u0011,42
u0022,62
u0033,82
u00451
u00571
-- Aggregate access tags by user, separated by semicolons
SELECT
    user_id,
    GROUP_CONCAT(tags ORDER BY log_id SEPARATOR '; ') AS all_tags
FROM doc_str_logs
GROUP BY user_id
ORDER BY user_id;

Execution Results:

user_idall_tags
u001tech,mobile,review; api,backend,rest
u002fashion,shoes,sale; image,cdn,static
u003tech,python,tutorial; order,payment,checkout
u004data,lakehouse,cloud
u005tech,mobile,new

WM_CONCAT: Concise Syntax

WM_CONCAT(sep, expr) is a more concise string aggregation syntax that does not support ORDER BY, suitable for scenarios where ordering is not required:

SELECT
    user_id,
    WM_CONCAT(';', tags) AS all_tags
FROM doc_str_logs
GROUP BY user_id
ORDER BY user_id;

Execution Results:

user_idall_tags
u001tech,mobile,review;api,backend,rest
u002fashion,shoes,sale;image,cdn,static
u003tech,python,tutorial;order,payment,checkout
u004data,lakehouse,cloud
u005tech,mobile,new

GROUP_CONCAT DISTINCT: Aggregate Unique Values

Deduplicate and merge the unique domains visited by each user:

SELECT
    user_id,
    GROUP_CONCAT(DISTINCT REGEXP_EXTRACT(url, 'https?://([/]+)', 1) SEPARATOR ', ') AS domains
FROM doc_str_logs
GROUP BY user_id
ORDER BY user_id;

Execution Results:

user_iddomains
u001www.example.com, api.example.com
u002shop.example.com, cdn.example.com
u003www.example.com, shop.example.com
u004www.example.com
u005www.example.com

Scenario 7: Case and Whitespace Handling

Data is often ingested with inconsistent casing or leading/trailing whitespace, requiring normalization before analysis.

Case Conversion

UPPER(str) converts a string to uppercase, LOWER(str) to lowercase:

SELECT
    log_id,
    content,
    UPPER(content) AS upper_content
FROM doc_str_logs
WHERE log_id IN (1, 3, 8)
ORDER BY log_id;

Execution Results:

log_idcontentupper_content
1Hello World! This is a test.HELLO WORLD! THIS IS A TEST.
3Python is great!PYTHON IS GREAT!
8Order #789 confirmed!ORDER #789 CONFIRMED!

Remove Leading/Trailing Whitespace

TRIM(str) removes whitespace from both ends, LTRIM removes from the left only, RTRIM from the right only:

SELECT
    log_id,
    content,
    TRIM(content)   AS trimmed,
    LTRIM(content)  AS ltrimmed,
    RTRIM(content)  AS rtrimmed
FROM doc_str_logs
WHERE log_id = 2;

Execution Results:

log_idcontenttrimmedltrimmedrtrimmed
2spaces aroundspaces aroundspaces aroundspaces around

Combined Normalization: LOWER + TRIM + LENGTH

Before grouping or JOINing, handle casing and whitespace simultaneously to avoid matching failures due to format differences:

SELECT
    log_id,
    user_id,
    LOWER(TRIM(user_id))  AS normalized_uid,
    LENGTH(TRIM(content)) AS content_len
FROM doc_str_logs
ORDER BY log_id;

Execution Results:

log_iduser_idnormalized_uidcontent_len
1u001u00128
2u002u00213
3u003u00316
4u001u00129
5u004u00414
6u002u0027
7u005u00521
8u003u00321

Clean Up Test Data

After completing string processing verification, it is recommended to clean up the test table:

DROP TABLE IF EXISTS doc_str_logs;

Notes

  1. Escaping ? in REGEXP_EXTRACT: In regex, ? is a quantifier. To match a literal ?, write it as [?] or \?. Writing '?(.*)' directly will cause a no argument for repetition operator error.
  2. REGEXP_REPLACE Does Not Support Backreferences: Singdata Lakehouse's REGEXP_REPLACE does not support capture group backreferences like $1 or \1; the replacement content can only be a literal string.
  3. DISTINCT and ORDER BY Are Incompatible in GROUP_CONCAT: Using DISTINCT and ORDER BY together in the same GROUP_CONCAT call will produce a Distinct aggregate call and aggregate call with order by can not coexist error.
  4. LIKE vs RLIKE Performance: LIKE with wildcards performs better than RLIKE; prefer LIKE when full regex capabilities are not needed.
  5. LENGTH Counts Characters, Not Bytes: LENGTH('CDN Resource File') returns the character count, not the byte length. For byte length, use OCTET_LENGTH.
  6. SPLIT Delimiter Is Regex: The second argument of SPLIT(str, pattern) is a regex pattern. If the delimiter contains special regex characters such as . or |, they need to be escaped: SPLIT(str, '\\.').