Array and Map Processing Guide
Quick Reference
| Function | Purpose | Input | Output |
|---|---|---|---|
transform | Transform each array element | Array + lambda | New array |
filter | Filter array elements by condition | Array + lambda | Sub-array |
exists | Check if any element satisfies a condition | Array + lambda | BOOLEAN |
forall | Check if all elements satisfy a condition | Array + lambda | BOOLEAN |
array_aggregate | Aggregate within an array | Array + initial value + lambda | Any type |
zip_with | Pair two arrays element-by-element | Array1 + Array2 + lambda | New array |
map_filter | Filter Map entries by condition | Map + lambda | New Map |
transform_values | Transform Map values | Map + lambda | New Map |
transform_keys | Transform Map keys | Map + lambda | New Map |
map_zip_with | Merge two Maps | Map1 + Map2 + lambda | New Map |
Prerequisites
All examples in this guide use the following test data:
Scenario 1: Filtering Event Arrays from Logs
Problem
Filter events of a specific type from the event type array and retrieve the corresponding durations.
SQL Implementation
Output:
| user_id | event_types | durations | view_durations | purchase_durations |
|---|---|---|---|---|
| 1 | view:click:view:purchase:view | 10:25:15:100:8 | 10:15:8 | 100 |
| 2 | click:view:click | 5:30:12 | 30 | — |
| 3 | view:view:view | 20:25:15 | 20:25:15 | — |
Key Notes
- The
filterlambda supports two parameters:(element, index), where the index starts at 1 - You can correlate multiple arrays using the index (e.g.,
event_types[i]paired withdurations[i])
Scenario 2: Conditional Checks (exists / forall)
Problem
Determine whether a user has performed a specific action, or whether all actions match a certain pattern.
SQL Implementation
Output:
| user_id | event_types | has_purchase | all_views | has_long_session |
|---|---|---|---|---|
| 1 | view:click:view:purchase:view | true | false | true |
| 2 | click:view:click | false | false | true |
| 3 | view:view:view | false | true | true |
Scenario 3: Transforming Array Elements (transform)
Problem
Apply a uniform transformation to every element in an array, such as unit conversion or formatting.
SQL Implementation
Output:
| user_id | durations | durations_ms | duration_levels |
|---|---|---|---|
| 1 | 10:25:15:100:8 | 10000:25000:15000:100000:8000 | medium:long:medium:long:short |
| 2 | 5:30:12 | 5000:30000:12000 | short:long:medium |
| 3 | 20:25:15 | 20000:25000:15000 | medium:long:medium |
Scenario 4: Aggregating Within an Array (array_aggregate)
Problem
Perform aggregation calculations within an array — such as sum, maximum, or count — without exploding into multiple rows.
SQL Implementation
Output:
| user_id | durations | total_duration | max_duration | event_count |
|---|---|---|---|---|
| 1 | 10:25:15:100:8 | 158 | 100 | 5 |
| 2 | 5:30:12 | 47 | 30 | 3 |
| 3 | 20:25:15 | 60 | 25 | 3 |
Key Notes
- The three-parameter form
array_aggregate(array, initial, (acc, x) -> expr)is the most stable - The four-parameter form (with a finish lambda) may have codegen limitations for string operations; numeric aggregation is recommended
Scenario 5: Pairing Two Arrays (zip_with)
Problem
Pair two equal-length arrays by position and perform element-wise calculations.
SQL Implementation
Output:
| user_id | event_values |
|---|---|
| 1 | 10:250:15:10000:8 |
| 2 | 50:30:120 |
| 3 | 20:25:15 |
Scenario 6: Filtering and Transforming Map Data
Problem
Process key-value pair data (such as user profile tags or configuration parameters) by filtering or transforming based on conditions.
SQL Implementation
Output:
| user_id | profile | high_tier_filter | doubled_age | upper_keys |
|---|---|---|---|---|
| 1 | age=25:city=1:tier=3 | tier=3 | age=50:city=1:tier=3 | AGE=25:CITY=1:TIER=3 |
| 2 | age=30:city=2:tier=1 | — | age=60:city=2:tier=1 | AGE=30:CITY=2:TIER=1 |
| 3 | age=22:city=1:tier=2 | tier=2 | age=44:city=1:tier=2 | AGE=22:CITY=1:TIER=2 |
Scenario 7: Merging Two Maps (map_zip_with)
Problem
Merge two Maps and aggregate values for matching keys (e.g., sum or take the maximum).
SQL Implementation
Output:
| merged |
|---|
| a=11:b=2:c=30 |
Scenario 8: End-to-End Example — User Behavior Tagging
Problem
Based on user event arrays, compute user behavior tags:
- Whether the user made a purchase
- Average session duration
- Whether the user is high-frequency (event count ≥ 4)
- Primary behavior type (most frequent event)
SQL Implementation
Output:
| user_id | buyer_flag | avg_duration | freq_flag | prefixed_tags |
|---|---|---|---|---|
| 1 | buyer | 31 | high_freq | tag_vip:tag_active |
| 2 | non_buyer | 15 | low_freq | tag_new_user |
| 3 | non_buyer | 20 | low_freq | tag_vip:tag_inactive |
Common Issues
1. Type error when creating a Map with MAP()
2. Lambda index starts at 1
3. Codegen limitation with array_aggregate finish function
4. Short-circuit logic in exists vs forall
exists: returns true as soon as the first element satisfying the condition is foundforall: returns false as soon as the first element not satisfying the condition is found- Empty array:
existsreturns false,forallreturns true
5. zip_with requires arrays of equal length
Performance Optimization Tips
| Scenario | Optimization Strategy |
|---|---|
| Filtering large arrays | Use filter to shrink the array first, then apply transform or aggregation |
Frequent exists checks | Consider extracting high-frequency check fields into separate columns to avoid scanning the array each time |
| Array aggregation | array_aggregate is more efficient than explode + GROUP BY (no shuffle required) |
| Map operations | Use map_filter to filter first, then process, to reduce unnecessary key-value pair computation |
