List of Functions Supported by Regular Expressions
Function | Description |
---|---|
RLIKE | Used to check if a string matches a specified regular expression pattern. |
REGEXP_EXTRACT | Extracts matching text from an input string (str) using a specified regular expression (regexp). |
REGEXP_EXTRACT_ALL | Used to extract all substrings from a string that match a regular expression. |
REGEXP_REPLACE | Finds all substrings in string str that match the regular expression regexp and replaces them with the specified string rep. |
Instructions
The regular expression engine used by Lakehouse is re2:
- POSIX Character Classes: Allows the use of character classes like
[[:alnum:]]
. - Backslash Sequences: Includes
\d
(digit),\D
(non-digit),\s
(whitespace),\S
(non-whitespace),\w
(word character),\W
(non-word character),\b
(word boundary), and\B
(non-word boundary). - POSIX Wildcards: Such as
.
(matches any single character) and*
(matches zero or more occurrences of the preceding element). - Unicode Support: All regular expression functions support Unicode characters.
Handling Escape Characters
When using string literals in SQL statements, if they contain backslash sequences or metacharacters (such as \d
, \s
, *
, ?
), these escape characters will be automatically escaped. For example, the output of SELECT '\\' AS col;
will be \
. For values actually stored in the table as \\
, no escaping will occur; only string literals undergo this escaping.
Therefore, when writing regular expressions that need to match these special sequences, additional escape characters need to be added. For example:
When using regular expressions for matching, it is necessary to escape strings that contain backslash sequences or metacharacters:
Usage of String Prefix r
To avoid manually handling escape characters when it is uncertain when they are needed, Lakehouse supports adding the r
prefix to strings, indicating that escape characters in the string will not be escaped and can be directly input into the regular expression for execution. This way, users can write regular expressions as usual without worrying about SQL engine translation issues. For example:
By using the r
prefix, Lakehouse simplifies the use of regular expressions, making text matching and data processing more intuitive and efficient.