Basic Concepts
SQL Nested Data Types allow complex data structures, such as STRUCT and ARRAY, to be included in the columns of a table. These data structures make it more concise and clear to represent one-to-one, many-to-one, and hierarchical relationships.
- STRUCT: Represents a single field containing multiple related columns, which can represent the attributes of an object.
- ARRAY: Represents an array containing multiple structured fields, suitable for many-to-one relationships.
Advantages
- Simplified Data Schema: Using STRUCT and ARRAY types can simplify the schema of data tables, making it more convenient to store and retrieve related column data.
- Improved Query Performance: Nested data types can reduce table join operations because related data is already nested within a single field.
- Enhanced Readability: Nested structures make the data schema more intuitive, making it easier for developers to understand and use.
- Flexible Data Processing: Nested data types support flexible operations and sorting of complex objects, as well as efficient splitting and reorganization of data during processing.
Use Cases
- One-to-One and Hierarchical Relationships: Suitable for scenarios where multiple related fields need to be combined, such as combining user information and address information.
SELECT
c_custkey,
STRUCT(c_name, c_address, c_phone) AS customer_info
FROM
clickzetta_sample_data.tpch_100g.customer;
- Many-to-One Relationship: Suitable for scenarios where multiple related objects need to be stored in a single field, such as the combination of orders and order items.
SELECT
o.o_orderkey,
ARRAY_AGG(STRUCT(o.o_clerk, o.o_totalprice, o.o_orderpriority)) AS items
FROM
clickzetta_sample_data.tpch_100g.orders o
GROUP BY
o.o_orderkey;
- Data Transformation and Cleaning: Nested data types can be used to split data into smaller parts for transformation and cleaning, and then recombine them.
-- Define orders_array using CTE
WITH orders_array AS (
-- Aggregate all order IDs for each customer into an array
SELECT
o_custkey,
ARRAY_AGG(o_orderkey) AS order_keys
FROM
clickzetta_sample_data.tpch_100g.orders
GROUP BY
o_custkey
),
-- Define modified_items using CTE
modified_items AS (
-- Use the TRANSFORM function to add 1 to each order ID, generating a new array
SELECT
o_custkey,
TRANSFORM(order_keys, x -> x + 1) AS modified_order_keys
FROM
orders_array
)
-- Main query section
SELECT
c.c_custkey,
mi.modified_order_keys
FROM
clickzetta_sample_data.tpch_100g.customer c
-- Left join modified_items based on customer ID (c_custkey)
LEFT JOIN
modified_items mi
ON
c.c_custkey = mi.o_custkey;

Using nested data types (such as STRUCT and ARRAY) for data transformation can simplify data models, improve query performance, and enhance the flexibility and readability of data processing, providing an effective solution for managing complex data structures.
Data Model
TPC-H data represents the data warehouse of an automotive parts supplier, recording orders, items that make up the orders (lineitem), suppliers, customers, parts sold (part), regions, countries, and parts suppliers (partsupp).
Singdata Lakehouse has built-in shared TPC-H data, which each user can directly use by adding the data context, for example:
SELECT * FROM
clickzetta_sample_data.tpch_100g.customer
LIMIT 10;
Efficient Use of Nested Data Types
Using STRUCT to Handle One-to-One and Hierarchical Relationships
-- Without using nested data types
SELECT l.*,
c.*,
s.*
FROM
clickzetta_sample_data.tpch_100g.lineitem l
LEFT JOIN
clickzetta_sample_data.tpch_100g.orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;

-- Using nested data types
SELECT
l.*,
struct(
c.c_custkey,
c.c_name,
c.c_address,
c.c_nationkey,
c.c_phone,
c.c_acctbal,
c.c_mktsegment,
c.c_comment
) AS customer,
struct(
s.s_suppkey,
s.s_name,
s.s_address,
s.s_nationkey,
s.s_phone,
s.s_acctbal,
s.s_comment
) AS supplier
FROM
clickzetta_sample_data.tpch_100g.lineitem l
LEFT JOIN
clickzetta_sample_data.tpch_100g.orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.supplier s ON l.l_suppkey = s.s_suppkey
LIMIT 5;


-- Hierarchical Data
SELECT
l.*,
struct(
c.c_custkey,
c.c_name,
c.c_address,
c.c_nationkey,
c.c_phone,
c.c_acctbal,
c.c_mktsegment,
c.c_comment,
struct(
n.n_nationkey,
n.n_name,
n.n_regionkey,
n.n_comment
)
) AS customer,
struct(
s.s_suppkey,
s.s_name,
s.s_address,
s.s_nationkey,
s.s_phone,
s.s_acctbal,
s.s_comment,
struct(
sn.n_nationkey,
sn.n_name,
sn.n_regionkey,
sn.n_comment
)
) AS supplier
FROM
clickzetta_sample_data.tpch_100g.lineitem l
LEFT JOIN
clickzetta_sample_data.tpch_100g.orders o ON l.l_orderkey = o.o_orderkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.nation n ON c.c_nationkey = n.n_nationkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.supplier s ON l.l_suppkey = s.s_suppkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.nation sn ON s.s_nationkey = sn.n_nationkey
LIMIT 5;


-- Use ARRAY[STRUCT] to handle one-to-many relationships
WITH line_items AS (
SELECT
l_orderkey AS orderkey,
array_agg(
struct(
l.l_linenumber,
l.l_partkey,
l.l_suppkey,
l.l_quantity,
l.l_extendedprice,
l.l_discount,
l.l_tax,
l.l_returnflag,
l.l_linestatus,
l.l_shipdate,
l.l_commitdate,
l.l_receiptdate,
l.l_shipinstruct,
l.l_shipmode,
l.l_comment
)
) AS lineitems
FROM
clickzetta_sample_data.tpch_100g.lineitem l
GROUP BY
l_orderkey
)
SELECT
o.*,
size(l.lineitems) AS num_lineitems,
l.lineitems
FROM
clickzetta_sample_data.tpch_100g.orders o
LEFT JOIN
line_items l ON o.o_orderkey = l.orderkey
LIMIT 5;

Using Nested Data Types in Data Processing
DROP TABLE IF EXISTS wide_orders;
CREATE TABLE IF NOT EXISTS wide_orders AS
WITH line_items AS (
SELECT
l_orderkey AS orderkey,
array_agg(
(
l.l_linenumber AS lineitemkey,
l.l_partkey AS partkey,
l.l_suppkey AS suppkey,
l.l_quantity AS quantity,
l.l_extendedprice AS extendedprice,
l.l_discount AS discount,
l.l_tax AS tax,
l.l_returnflag AS returnflag,
l.l_linestatus AS linestatus,
l.l_shipdate AS shipdate,
l.l_commitdate AS commitdate,
l.l_receiptdate AS receiptdate,
l.l_shipinstruct AS shipinstruct,
l.l_shipmode AS shipmode,
l.l_comment AS comment
)
) AS lineitems
FROM
clickzetta_sample_data.tpch_100g.lineitem l
GROUP BY
l_orderkey
)
SELECT
o.*,
l.lineitems,
(
c.c_custkey AS id,
c.c_name AS name,
c.c_address AS address,
c.c_nationkey AS nationkey,
c.c_phone AS phone,
c.c_acctbal AS acctbal,
c.c_mktsegment AS mktsegment,
c.c_comment AS comment,
(
n.n_nationkey AS nationkey,
n.n_name AS name,
n.n_regionkey AS regionkey,
n.n_comment AS comment
) AS nation
) AS customer
FROM
clickzetta_sample_data.tpch_100g.orders o
LEFT JOIN
line_items l ON o.o_orderkey = l.orderkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.customer c ON o.o_custkey = c.c_custkey
LEFT JOIN
clickzetta_sample_data.tpch_100g.nation n ON c.c_nationkey = n.n_nationkey;
STRUCT makes data architecture and data access simpler
SELECT o_orderkey,
customer.name,
customer.address,
lineitems[0] AS first_lineitem
FROM wide_orders
LIMIT 2;

Expand ARRAY into rows and recombine into ARRAY
-- Convert rows to ARRAY
WITH lineitems AS (
SELECT
o.o_orderkey,
EXPLODE(o.lineitems) AS line_item
FROM
wide_orders o
),
unnested_line_items AS (
SELECT
o_orderkey,
line_item.lineitemkey,
line_item.partkey,
line_item.quantity
FROM
lineitems
)
SELECT
o_orderkey,
array_agg(
struct(
lineitemkey,
partkey,
quantity
)
) AS lineitems
FROM
unnested_line_items
GROUP BY
o_orderkey
LIMIT
5;
-- ARRAY to rows
WITH
lineitems AS (
SELECT
o.o_orderkey,
UNNEST (o.lineitems)
FROM
wide_orders o
)
SELECT
o_orderkey,
lineitemkey,
partkey,
quantity
FROM
lineitems
LIMIT
5;

-- Get lineitem metrics
WITH
lineitems AS (
SELECT
o.o_orderkey,
UNNEST (o.lineitems)
FROM
wide_orders o
)
SELECT
o_orderkey,
COUNT(lineitemkey) AS num_line_items,
SUM(quantity) AS total_line_item_quantity
FROM
lineitems
GROUP BY
1
ORDER BY
1
LIMIT
10;

EXPLAIN
WITH
lineitems AS (
SELECT
o.o_orderkey,
UNNEST (o.lineitems)
FROM
wide_orders o
)
SELECT
o_orderkey,
COUNT(lineitemkey) AS num_line_items,
SUM(quantity) AS total_line_item_quantity
FROM
lineitems
GROUP BY
1
ORDER BY
1
LIMIT
10;
Data
When using nested structures, please pay attention to the query execution performance. You can use EXPLODE to check if the execution plan meets expectations.
EXPLODE
UNNEST
ARRAY_AGG