Array and Map Processing Guide


Quick Reference

FunctionPurposeInputOutput
transformTransform each array elementArray + lambdaNew array
filterFilter array elements by conditionArray + lambdaSub-array
existsCheck if any element satisfies a conditionArray + lambdaBOOLEAN
forallCheck if all elements satisfy a conditionArray + lambdaBOOLEAN
array_aggregateAggregate within an arrayArray + initial value + lambdaAny type
zip_withPair two arrays element-by-elementArray1 + Array2 + lambdaNew array
map_filterFilter Map entries by conditionMap + lambdaNew Map
transform_valuesTransform Map valuesMap + lambdaNew Map
transform_keysTransform Map keysMap + lambdaNew Map
map_zip_withMerge two MapsMap1 + Map2 + lambdaNew Map

Prerequisites

All examples in this guide use the following test data:

-- User event log table: each user has multiple events stored as arrays CREATE TABLE user_events ( user_id BIGINT, event_types ARRAY<VARCHAR>, -- event type array: ['view', 'click', 'purchase'] durations ARRAY<INT>, -- corresponding event duration array: [10, 25, 100] tags ARRAY<VARCHAR> -- user tag array: ['vip', 'new_user'] ); INSERT INTO user_events VALUES (1, ARRAY['view', 'click', 'view', 'purchase', 'view'], ARRAY[10, 25, 15, 100, 8], ARRAY['vip', 'active']), (2, ARRAY['click', 'view', 'click'], ARRAY[5, 30, 12], ARRAY['new_user']), (3, ARRAY['view', 'view', 'view'], ARRAY[20, 25, 15], ARRAY['vip', 'inactive']);


Scenario 1: Filtering Event Arrays from Logs

Problem

Filter events of a specific type from the event type array and retrieve the corresponding durations.

SQL Implementation

SELECT user_id, event_types, durations, -- Filter durations for 'view' events (lambda supports an index parameter, 1-based) filter(durations, (d, i) -> event_types[i] = 'view') AS view_durations, -- Filter durations for 'purchase' events filter(durations, (d, i) -> event_types[i] = 'purchase') AS purchase_durations FROM user_events;

Output:

user_idevent_typesdurationsview_durationspurchase_durations
1view:click:view:purchase:view10:25:15:100:810:15:8100
2click:view:click5:30:1230
3view:view:view20:25:1520:25:15

Key Notes

  • The filter lambda supports two parameters: (element, index), where the index starts at 1
  • You can correlate multiple arrays using the index (e.g., event_types[i] paired with durations[i])

Scenario 2: Conditional Checks (exists / forall)

Problem

Determine whether a user has performed a specific action, or whether all actions match a certain pattern.

SQL Implementation

SELECT user_id, event_types, -- Whether there is a 'purchase' event exists(event_types, t -> t = 'purchase') AS has_purchase, -- Whether all events are 'view' forall(event_types, t -> t = 'view') AS all_views, -- Whether any event has a duration greater than 20 exists(durations, d -> d > 20) AS has_long_session FROM user_events;

Output:

user_idevent_typeshas_purchaseall_viewshas_long_session
1view:click:view:purchase:viewtruefalsetrue
2click:view:clickfalsefalsetrue
3view:view:viewfalsetruetrue

Scenario 3: Transforming Array Elements (transform)

Problem

Apply a uniform transformation to every element in an array, such as unit conversion or formatting.

SQL Implementation

SELECT user_id, durations, -- Convert milliseconds to seconds (assuming raw data is in milliseconds) transform(durations, d -> d * 1000) AS durations_ms, -- Classify durations into levels transform(durations, d -> CASE WHEN d >= 30 THEN 'long' WHEN d >= 10 THEN 'medium' ELSE 'short' END ) AS duration_levels FROM user_events;

Output:

user_iddurationsdurations_msduration_levels
110:25:15:100:810000:25000:15000:100000:8000medium:long:medium:long:short
25:30:125000:30000:12000short:long:medium
320:25:1520000:25000:15000medium:long:medium

Scenario 4: Aggregating Within an Array (array_aggregate)

Problem

Perform aggregation calculations within an array — such as sum, maximum, or count — without exploding into multiple rows.

SQL Implementation

SELECT user_id, durations, -- Total duration array_aggregate(durations, 0, (acc, x) -> acc + x) AS total_duration, -- Maximum duration array_aggregate(durations, 0, (acc, x) -> CASE WHEN x > acc THEN x ELSE acc END) AS max_duration, -- Event count (array length) array_aggregate(durations, 0, (acc, x) -> acc + 1) AS event_count FROM user_events;

Output:

user_iddurationstotal_durationmax_durationevent_count
110:25:15:100:81581005
25:30:1247303
320:25:1560253

Key Notes

  • The three-parameter form array_aggregate(array, initial, (acc, x) -> expr) is the most stable
  • The four-parameter form (with a finish lambda) may have codegen limitations for string operations; numeric aggregation is recommended

Scenario 5: Pairing Two Arrays (zip_with)

Problem

Pair two equal-length arrays by position and perform element-wise calculations.

SQL Implementation

SELECT user_id, event_types, durations, -- Calculate the "value per unit duration" for each event -- (assuming purchase = 100, click = 10, view = 1) zip_with( event_types, durations, (t, d) -> d * CASE t WHEN 'purchase' THEN 100 WHEN 'click' THEN 10 ELSE 1 END ) AS event_values FROM user_events;

Output:

user_idevent_values
110:250:15:10000:8
250:30:120
320:25:15

Scenario 6: Filtering and Transforming Map Data

Problem

Process key-value pair data (such as user profile tags or configuration parameters) by filtering or transforming based on conditions.

SQL Implementation

-- Use MAP_FROM_ARRAYS to create a Map (recommended) WITH user_profiles AS ( SELECT 1 AS user_id, MAP_FROM_ARRAYS(ARRAY['age', 'city', 'tier'], ARRAY[25, 1, 3]) AS profile UNION ALL SELECT 2, MAP_FROM_ARRAYS(ARRAY['age', 'city', 'tier'], ARRAY[30, 2, 1]) UNION ALL SELECT 3, MAP_FROM_ARRAYS(ARRAY['age', 'city', 'tier'], ARRAY[22, 1, 2]) ) SELECT user_id, profile, -- Filter profile entries where tier > 1 map_filter(profile, (k, v) -> k = 'tier' AND v > 1) AS high_tier_filter, -- Double the age value transform_values(profile, (k, v) -> CASE WHEN k = 'age' THEN v * 2 ELSE v END ) AS doubled_age, -- Convert keys to uppercase transform_keys(profile, (k, v) -> UPPER(k)) AS upper_keys FROM user_profiles;

Output:

user_idprofilehigh_tier_filterdoubled_ageupper_keys
1age=25:city=1:tier=3tier=3age=50:city=1:tier=3AGE=25:CITY=1:TIER=3
2age=30:city=2:tier=1age=60:city=2:tier=1AGE=30:CITY=2:TIER=1
3age=22:city=1:tier=2tier=2age=44:city=1:tier=2AGE=22:CITY=1:TIER=2

Scenario 7: Merging Two Maps (map_zip_with)

Problem

Merge two Maps and aggregate values for matching keys (e.g., sum or take the maximum).

SQL Implementation

SELECT map_zip_with( MAP_FROM_ARRAYS(ARRAY['a', 'b'], ARRAY[1, 2]), MAP_FROM_ARRAYS(ARRAY['a', 'c'], ARRAY[10, 30]), (k, v1, v2) -> COALESCE(v1, 0) + COALESCE(v2, 0) ) AS merged;

Output:

merged
a=11:b=2:c=30

Scenario 8: End-to-End Example — User Behavior Tagging

Problem

Based on user event arrays, compute user behavior tags:

  • Whether the user made a purchase
  • Average session duration
  • Whether the user is high-frequency (event count ≥ 4)
  • Primary behavior type (most frequent event)

SQL Implementation

SELECT user_id, event_types, durations, -- Purchase flag CASE WHEN exists(event_types, t -> t = 'purchase') THEN 'buyer' ELSE 'non_buyer' END AS buyer_flag, -- Average duration (total duration / event count) array_aggregate(durations, 0, (acc, x) -> acc + x) / array_aggregate(durations, 0, (acc, x) -> acc + 1) AS avg_duration, -- High-frequency flag CASE WHEN array_aggregate(durations, 0, (acc, x) -> acc + 1) >= 4 THEN 'high_freq' ELSE 'low_freq' END AS freq_flag, -- Prefix each tag in the tag array transform(tags, t -> CONCAT('tag_', t)) AS prefixed_tags FROM user_events;

Output:

user_idbuyer_flagavg_durationfreq_flagprefixed_tags
1buyer31high_freqtag_vip:tag_active
2non_buyer15low_freqtag_new_user
3non_buyer20low_freqtag_vip:tag_inactive

Common Issues

1. Type error when creating a Map with MAP()

-- Wrong: MAP(ARRAY, ARRAY) creates map<array<string>, array<int>> MAP(ARRAY['a', 'b'], ARRAY[1, 2]) -- incorrect type -- Correct: use MAP_FROM_ARRAYS MAP_FROM_ARRAYS(ARRAY['a', 'b'], ARRAY[1, 2]) -- correct type: map<string, int>

2. Lambda index starts at 1

-- The index parameter in filter starts at 1, not 0 filter(durations, (d, i) -> event_types[i] = 'view') -- i=1 refers to the first element

3. Codegen limitation with array_aggregate finish function

-- May error: string operations in the finish function (CONCAT not supported in codegen) array_aggregate(ARRAY['a', 'b'], '', (acc, x) -> acc || x, s -> CONCAT('[', s, ']')) -- Recommended: numeric aggregation is stable; use transform for string operations after aggregation array_aggregate(ARRAY[1, 2, 3], 0, (acc, x) -> acc + x) -- stable

4. Short-circuit logic in exists vs forall

  • exists: returns true as soon as the first element satisfying the condition is found
  • forall: returns false as soon as the first element not satisfying the condition is found
  • Empty array: exists returns false, forall returns true

5. zip_with requires arrays of equal length

-- If arrays have different lengths, extra elements may be dropped or filled with NULL (implementation-dependent) zip_with(ARRAY[1, 2], ARRAY[10, 20, 30], (x, y) -> x + y) -- the third element may be lost


Performance Optimization Tips

ScenarioOptimization Strategy
Filtering large arraysUse filter to shrink the array first, then apply transform or aggregation
Frequent exists checksConsider extracting high-frequency check fields into separate columns to avoid scanning the array each time
Array aggregationarray_aggregate is more efficient than explode + GROUP BY (no shuffle required)
Map operationsUse map_filter to filter first, then process, to reduce unnecessary key-value pair computation

-- Recommended: filter first, then transform (fewer elements to process) transform(filter(durations, d -> d > 10), d -> d * 2) -- Not recommended: transform first, then filter (all elements go through transform) filter(transform(durations, d -> d * 2), d -> d > 20)