JSON Type

JSON (JavaScript Object Notation) is the native semi-structured data type in Singdata Lakehouse. Compared to storing JSON text as STRING, using the native JSON type leverages columnar storage optimization and predicate pushdown, significantly improving query performance and reducing storage overhead.

Why Use the JSON Type?

FeatureSTRING TypeJSON Type
StoragePlain text, no structural compressionColumnar storage: high-frequency fields are automatically extracted into separate columns
Query PerformanceRequires full scan and text parsingColumn pruning: only reads the needed fields without parsing the entire row
Type SafetyRequires manual CAST after extractionBuilt-in functions like json_extract_bigint directly return SQL types

1. Constructing JSON Data

1.1 Parsing from a String (Recommended)

Use the parse_json (or json_parse) function to convert a string into a JSON object. This is the most common way to handle externally imported data.

SELECT parse_json('{"id": 1, "name": "Lakehouse"}') AS data; -- Result: {"id":1,"name":"Lakehouse"}

1.2 JSON Constants

Define JSON data directly in SQL using the JSON '...' syntax.

SELECT JSON '{"key": "value"}' AS const_json;

1.3 Using Constructor Functions

Dynamically build JSON objects or arrays from SQL values.

-- Build an object SELECT json_object('id', 1, 'name', 'Lakehouse'); -- Result: {"id":1,"name":"Lakehouse"} -- Build an array SELECT json_array(1, 2, 3, 'test'); -- Result: [1,2,3,"test"]


2. Querying and Accessing JSON Data

Lakehouse provides two ways to access data inside JSON: shorthand syntax (returns JSON type) and extraction functions (returns SQL types).

2.1 Shorthand Syntax ['key']

Suitable for quick access; the result is still of JSON type. Supports nested access and array indexing.

WITH t AS ( SELECT parse_json('{"user": {"id": 101, "tags": ["vip", "active"]}}') AS data ) SELECT data['user']['id'] AS user_id_json, -- Returns JSON type 101 data['user']['tags'][0] AS first_tag; -- Returns JSON type "vip"

2.2 Type Extraction Functions (Recommended for Calculations)

Use the json_extract_* family of functions to directly return native SQL types (such as BIGINT, STRING), making subsequent calculations and filtering easier.

FunctionReturn TypeExample
json_extract_stringSTRINGjson_extract_string(data, '$.name')
json_extract_bigintBIGINTjson_extract_bigint(data, '$.id')
json_extract_intINTjson_extract_int(data, '$.age')
json_extract_doubleDOUBLEjson_extract_double(data, '$.price')
json_extract_floatFLOATjson_extract_float(data, '$.score')
json_extract_booleanBOOLEANjson_extract_boolean(data, '$.is_active')
json_extract_dateDATEjson_extract_date(data, '$.birth_date')
json_extract_timestampTIMESTAMPjson_extract_timestamp(data, '$.created')

Example:

SELECT json_extract_bigint(data, '$.user.id') AS user_id, json_extract_string(data, '$.user.tags[0]') AS first_tag FROM (SELECT parse_json('{"user": {"id": 101, "tags": ["vip"]}}') AS data) t;


3. Type Conversion (Important Notes)

When handling JSON strings, parse_json and ::json (CAST) behave completely differently — be sure to pay attention:

MethodSemanticsExample Result
parse_json(str)Parses the string content into a JSON structureparse_json('{"a":1}') → JSON object {"a":1}
str::jsonTreats the entire string as a JSON string value'{"a":1}'::json → JSON string "{\"a\":1}"

Practical comparison:

-- Scenario 1: Correct parsing SELECT parse_json('{"a": 1}')['a']; -- Result: 1 (value successfully retrieved) -- Scenario 2: Incorrect conversion (Cast) SELECT ('{"a": 1}'::json)['a']; -- Result: NULL (because this is a string, there is no key 'a')

Conclusion: When your data source is a JSON-formatted string (such as logs or API responses), you must use parse_json. ::json is typically used to convert a plain string (such as "hello") to the JSON string type.


4. Performance Optimization in Practice

4.1 Predicate Pushdown

When filtering with strongly-typed functions like json_extract_bigint, Lakehouse can directly scan the underlying optimized columns, avoiding full-table parsing.

-- Efficient query: leverages columnar index SELECT * FROM logs WHERE json_extract_bigint(data, '$.status') = 200; -- Inefficient query: full table scan with parsing SELECT * FROM logs WHERE data['status']::bigint = 200; -- Note: some versions may not support pushdown

4.2 Handling Invalid JSON

parse_json is fault-tolerant. If the input is not valid JSON, it returns NULL instead of throwing an error, which is very useful when handling dirty data.

SELECT parse_json('invalid json'); -- Result: NULL SELECT parse_json('{"a": 1}'); -- Result: {"a":1}


5. Limitations and Notes

  • No sorting/grouping support: JSON columns themselves do not support ORDER BY, GROUP BY, or JOIN. You must extract specific fields before performing these operations.
  • Size limit: The default maximum JSON string length is 16MB. To adjust this, modify the table properties:

    ALTER TABLE table_name SET PROPERTIES("cz.storage.write.max.json.bytes"="33554432");

  • Key ordering: The order of keys inside a JSON object may be rearranged during storage; output order being inconsistent with input is normal behavior.