Exploring and Analyzing Data in JSON Files on Data Lake Volumes

1. Introduction to Data Lakes and JSON Data

1.1 What is a Data Lake

The data lake is an important component of the Lakehouse, allowing you to store all structured and unstructured data in its original format without needing to define a schema upfront. This fully embodies the advantages of lakehouse unification, extending Lakehouse data management to the unified management of unstructured data, no longer limited to structured data management within the data warehouse. This flexibility makes it an ideal choice for big data analysis, as you can use different analysis methods and tools to process data as needed. A data lake typically consists of the following parts:

Storage Layer: Such as object storage (S3, OSS, COS, etc.)
Metadata Management: Used for organizing and discovering data
Processing Engine: Used for analyzing and transforming data

1.2 Volumes in the Data Lake

In modern data platforms like the Lakehouse, a Volume is an abstraction that represents a specific location in an external storage system (e.g., a path in object storage). Volumes allow the data platform to seamlessly access data stored in external systems without actually copying or moving that data.

Key features of Volumes:

Direct connection to external storage systems
Allows in-place querying without ETL
Supports multiple file formats (JSON, CSV, Parquet, etc.)
Can integrate with tables, providing SQL access capabilities

1.3 JSON Data Format

JSON (JavaScript Object Notation) is a lightweight data interchange format with the following features:

Semi-Structured: Flexible key-value pair structure
Nesting Capability: Supports complex hierarchical structures and arrays
Broad Support: Almost all programming languages provide JSON parsing support
Human Readable: Easier to understand compared to binary formats

In the big data domain, JSON is widely used for storing event data, API responses, log files, etc. Its flexibility makes it ideal for storing data with varying shapes.

2. Case Study: GitHub Event Data

2.1 Dataset Overview

GitHub event data is a publicly available dataset recording the time series of all public activities on the GitHub platform. In this case study, we analyze GitHub event data stored in the gh_archive Volume.

Data Source Details:

Volume Name: gh_archive
File Path: 2025-05-14-0.json.gz (representing events from the 00:00-01:00 period on May 14, 2025)
File Format: Compressed JSON file (using gzip compression)
Data Volume: A single file is approximately 85.6 MB (compressed), containing nearly 200,000 event records

Data Structure: Each record contains the following main fields:

id: Unique event identifier
type: Event type (e.g., PushEvent, PullRequestEvent, etc.)
actor: Information about the user who performed the action
repo: Information about the related code repository
org: Information about the related organization (if applicable)
payload: Detailed event information (varies by event type)
created_at: Event creation time
public: Whether the event is public

3. Data Lake Exploration and Analysis via SQL

3.1 Directly Querying JSON Files

The Lakehouse data lake platform allows direct SQL queries on files stored on Volumes, without needing to load them into tables first:

-- Analyze event type distribution SELECT type, COUNT(*) as count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') GROUP BY type ORDER BY count DESC

Query Result:

type count ---------------------------- ------ PushEvent 131341 CreateEvent 21700 PullRequestEvent 12537 IssueCommentEvent 7807 WatchEvent 7196 DeleteEvent 4832 PullRequestReviewEvent 3571 IssuesEvent 3569 PullRequestReviewCommentEvent 2362 ForkEvent 1537 ReleaseEvent 1261 ...

Features:

Zero ETL: No need to extract, transform, and load data beforehand
Flexible Access: Direct access to nested fields
Ad-Hoc Queries: Support for immediate data exploration

3.2 Handling Nested and Complex JSON Structures

JSON data often contains nested objects and arrays. SQL provides direct access to these complex structures:

-- Analyze file operation type distribution in PushEvent commits SELECT CASE WHEN LOWER(payload.commits[0].message) LIKE '%add%' THEN 'Add' WHEN LOWER(payload.commits[0].message) LIKE '%fix%' THEN 'Fix' WHEN LOWER(payload.commits[0].message) LIKE '%update%' THEN 'Update' WHEN LOWER(payload.commits[0].message) LIKE '%remove%' OR LOWER(payload.commits[0].message) LIKE '%delete%' THEN 'Remove' WHEN LOWER(payload.commits[0].message) LIKE '%refactor%' THEN 'Refactor' WHEN LOWER(payload.commits[0].message) LIKE '%deploy%' THEN 'Deploy' WHEN LOWER(payload.commits[0].message) LIKE '%test%' THEN 'Test' WHEN LOWER(payload.commits[0].message) LIKE '%doc%' THEN 'Documentation' ELSE 'Other' END as commit_type, COUNT(*) as count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE type = 'PushEvent' AND payload.commits IS NOT NULL AND SIZE(payload.commits) > 0 GROUP BY commit_type ORDER BY count DESC

Query Result:

commit_type count percentage ------------ ------ ---------- Other 46901 35.75 Update 42211 32.17 Deploy 17513 13.35 Add 11775 8.97 Fix 5027 3.83 Test 3389 2.58 Remove 3174 2.42 Documentation 637 0.49 Refactor 575 0.44

Features:

Path Navigation: Use dot notation to access nested objects
Array Indexing: Use square brackets to access array elements
Array Functions: Use functions like SIZE() to handle arrays
Conditional Analysis: Use CASE statements for classification and pattern recognition

3.3 Using JSON Paths to Access Deeply Nested Data

Path expressions can be used to deeply access multi-level nested structures in JSON documents:

-- Analyze branch information in PR events SELECT payload.pull_request.base.ref as base_branch, payload.pull_request.head.ref as head_branch, COUNT(*) as count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE type = 'PullRequestEvent' AND payload.action = 'opened' AND payload.pull_request.base.ref IS NOT NULL AND payload.pull_request.head.ref IS NOT NULL GROUP BY base_branch, head_branch ORDER BY count DESC LIMIT 15

Query Result:

base_branch head_branch count ------------ -------------------------------------------------------- ----- main main 508 master master 499 main dependabot/bundler/kamal-2.6.0 93 main develop 72 main dev 63 main github-actions/upgrade-dev-deps-main 35 dev dev 27 main github-actions/upgrade-main 26 main prBranch 25 develop develop 20 main test 20 ...

3.4 Complex Conditional Filtering and Regular Expressions

Use regular expressions and complex condition combinations to filter JSON data:

-- Find PRs created by bot accounts and analyze target repositories SELECT REGEXP_EXTRACT(actor.login, '(.*?)\\[bot\\]') as bot_name, repo.name as target_repo, COUNT(*) as pr_count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE type = 'PullRequestEvent' AND payload.action = 'opened' AND actor.login LIKE '%[bot]' GROUP BY bot_name, target_repo HAVING pr_count > 5 ORDER BY pr_count DESC LIMIT 15

Query Result:

bot_name target_repo pr_count -------------- ------------------------------------------------ -------- dependabot GolfredoPerezFernandez/nodo-blockchain-blockscout 12 renovate gAmUssA/flink-java-flights 12 github-actions hofferkristof/laravel-lang 12 dependabot j4v3l/tapo-exporter 11 dependabot Sudhanshu-Ambastha/GPT-3-webapp-in-reactjs 10 dependabot jamespurnama1/new-portfolio 10 github-actions zzllbj/lang 9 ...

3.5 Advanced Aggregation and Conditional Statistics

Use conditional aggregation to simultaneously analyze multiple metrics:

-- Analyze repository activity: Calculate the number of different event types per repository SELECT repo.name, COUNT(DISTINCT actor.id) as unique_contributors, SUM(CASE WHEN type = 'PushEvent' THEN 1 ELSE 0 END) as push_count, SUM(CASE WHEN type = 'PullRequestEvent' THEN 1 ELSE 0 END) as pr_count, SUM(CASE WHEN type = 'IssueEvent' THEN 1 ELSE 0 END) as issue_count, SUM(CASE WHEN type = 'WatchEvent' THEN 1 ELSE 0 END) as watch_count, SUM(CASE WHEN type = 'ForkEvent' THEN 1 ELSE 0 END) as fork_count, COUNT(*) as total_events FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') GROUP BY repo.name HAVING total_events > 100 ORDER BY total_events DESC LIMIT 10

Query Result:

repo.name unique_contributors push_count pr_count issue_count watch_count fork_count total_events -------------------------- ------------------- ---------- -------- ----------- ----------- ---------- ------------ samgrover/mb-archive 1 1289 0 0 0 0 1289 chrisxero/bitdepth-microblog 1 1087 0 0 0 0 1087 iniadittt/iniadittt 1 877 0 0 0 0 877 0xios/news-momentum-1 1 730 0 0 0 0 730 JamyJones/Pastebin 1 636 0 0 0 0 636 SoliSpirit/proxy-list 1 586 0 0 0 0 586 ...

3.6 Window Function Analysis

Use window functions for more complex ranking and grouping analysis:

-- Window function analysis: Calculate the most active users within each organization SELECT org_login, username, event_count, rank_in_org FROM ( SELECT org.login as org_login, actor.login as username, COUNT(*) as event_count, RANK() OVER (PARTITION BY org.login ORDER BY COUNT(*) DESC) as rank_in_org FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE org.login IS NOT NULL GROUP BY org_login, username ) t WHERE rank_in_org <= 3 AND org_login IN ( SELECT org.login FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE org.login IS NOT NULL GROUP BY org.login ORDER BY COUNT(*) DESC LIMIT 5 ) ORDER BY org_login, rank_in_org

Query Result:

org_login username event_count rank_in_org ------------------------- ---------------------------------- ----------- ----------- blueprint-house ChineeWetto 559 1 curseforge-mirror github-actions[bot] 525 1 flyteorg github-actions[bot] 444 1 flyteorg flyte-bot 1 2 microsoft wingetbot 67 1 microsoft microsoft-github-policy-service[bot] 57 2 microsoft jstarks 29 3 static-web-apps-testing-org swa-runner-app[bot] 2178 1 static-web-apps-testing-org mkarmark 164 2 static-web-apps-testing-org github-actions[bot] 28 3

3.7 Text Analysis and Time Processing

Use string functions and time functions to analyze time patterns in JSON data:

-- Verify the relationship between event type and hourly activity SELECT SUBSTR(created_at, 12, 2) as hour_of_day, type, COUNT(*) as event_count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') GROUP BY hour_of_day, type ORDER BY hour_of_day, event_count DESC

Query Result:

hour_of_day type event_count ----------- --------------------------- ----------- 00 PushEvent 131341 00 CreateEvent 21700 00 PullRequestEvent 12537 00 IssueCommentEvent 7807 00 WatchEvent 7196 ...

3.8 Analyzing PR Merge Patterns

Analyze Pull Request merge trends:

-- Analyze PR merge trends: View PRs from creation to merging SELECT SUBSTRING(payload.pull_request.merged_at, 1, 10) as merge_date, COUNT(*) as merged_prs FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE type = 'PullRequestEvent' AND payload.action = 'closed' AND payload.pull_request.merged = true AND payload.pull_request.merged_at IS NOT NULL GROUP BY merge_date ORDER BY merge_date

Query Result:

merge_date merged_prs ---------- ---------- 2025-05-13 3 2025-05-14 4448

3.9 Loading File Data into Tables for Further Analysis

For data that needs repeated analysis, creating tables can improve query performance:

CREATE TABLE github_events AS SELECT id, type, actor.login as actor_name, actor.id as actor_id, repo.name as repo_name, repo.id as repo_id, org.login as org_name, created_at, public, payload FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz')

Advantages:

Improved Performance: Tables can use indexes and caching to speed up queries
Data Transformation: Transformations and column renaming can be applied during loading
Persistent Access: Create persistent views to avoid repeated parsing of raw data

4. Key Technical Features of Data Lake Analysis

4.1 Schema Flexibility

A core advantage of data lake analysis is schema flexibility:

Schema-on-Read: Data structure only needs to be defined at query time
Partial Field Access: Only query needed fields, ignoring others
Evolution Adaptation: As data structures change, queries can easily adapt

For example, in our GitHub event analysis, we can focus only on specific event types or specific fields without processing the entire data structure.

4.2 Computation Pushdown and Filter Optimization

The Lakehouse data lake technology supports computation pushdown, pushing filter and transformation operations to the data source:

-- Example of an efficient filter query SELECT actor.login, COUNT(*) as event_count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE type = 'PushEvent' GROUP BY actor.login ORDER BY event_count DESC LIMIT 10

4.3 Unified Data Access and Format Selection

Apply SQL analysis to different types of data files in the same Volume:

-- 1. Query JSON files SELECT COUNT(*) FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz'); -- 2. If the Volume has CSV files, similar syntax can be used for querying -- SELECT * FROM VOLUME volume_name -- USING csv -- OPTIONS('header'='true') -- FILES('data.csv');

4.4 Distributed Processing Capabilities

Data lake analysis engines are typically based on distributed computing frameworks, capable of processing large-scale datasets:

Parallel Processing: Automatically decomposes queries into parallel execution tasks
Memory Management: Optimizes memory usage, processing data exceeding single-machine memory capacity
Fault Tolerance: Handles node failures and recovery

In GitHub event analysis, even though individual files are relatively small, the same queries can scale to analyze TB-level historical event data.

5. Data Analysis Process and Methodology

Through the GitHub event data case study, we can summarize a methodology for analyzing JSON data in a data lake:

5.1 Exploratory Data Analysis Process

Initial Preview: Get data samples to understand the overall structure
SELECT * FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') LIMIT 10
Overview Statistics: Understand data distribution and main dimensions
SELECT type, COUNT(*) as count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') GROUP BY type ORDER BY count DESC
Deep Analysis: Refined analysis targeting specific domains
SELECT actor.login as username, COUNT(*) as event_count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') GROUP BY username ORDER BY event_count DESC LIMIT 15
Correlation Analysis: Cross-analysis connecting multiple dimensions
SELECT org.login as organization, type, COUNT(*) as event_count FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE org.login IS NOT NULL GROUP BY organization, type ORDER BY organization, event_count DESC LIMIT 20

5.2 Data Insight Methods

In the GitHub event analysis, we discovered several key insights:

Activity Distribution: Understanding the proportion of different event types, discovering that Push events dominate (66%)
Automation Trends: Through user activity analysis, discovering that automated bot accounts (such as github-actions[bot]) contribute a large amount of activity
Organizational Ecosystem: Identifying the most active organizations, understanding the composition of the GitHub ecosystem
Development Behavior: Through commit message analysis, understanding developers' work patterns and focus areas
Branch Usage Patterns: main and master are still the most common target branch names
Bot Activity: dependabot, renovate, and github-actions are the most active automation bots

These insights come directly from SQL queries, demonstrating the powerful capability of SQL as a data analysis tool.

6. Best Practices and Optimization Tips

6.1 JSON Data Analysis Optimization

Selective Field Reading: Only select the fields needed for analysis
-- Only read necessary fields instead of all fields SELECT type, actor.login, repo.name FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') LIMIT 10
Predicate Pushdown: Apply filter conditions as early as possible in the query
Predicate Pushdown: Filter conditions (WHERE type = 'PushEvent') are pushed down to the storage layer, leveraging columnar statistics (such as max/min/bloom filters) to skip data blocks that clearly don't match.
Column Pruning: Only read relevant columns (actor.login and type), skipping parsing of unrelated fields.
Filter Optimization: After the storage layer loads data in blocks, further filter out non-matching records in memory, ultimately retaining only valid data for computation.
-- Filter data early to reduce processing volume SELECT COUNT(*) FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') WHERE type = 'PushEvent'
Properly Handle NULL Values: Use IS NULL and IS NOT NULL to check for missing fields
-- Identify and handle missing data SELECT type, COUNT(*) as total_count, SUM(CASE WHEN org.login IS NULL THEN 1 ELSE 0 END) as without_org, SUM(CASE WHEN org.login IS NOT NULL THEN 1 ELSE 0 END) as with_org FROM VOLUME gh_archive USING json OPTIONS('compression'='gzip') FILES('2025-05-14-0.json.gz') GROUP BY type ORDER BY total_count DESC

6.2 Data Lake Query Best Practices

Avoid Full Table Scans: Use filter conditions whenever possible
Use Aggregation Wisely: When data volume is large, prioritize distributed aggregation
Selective Projection: Only select necessary columns to reduce I/O and memory usage
Appropriate Data Transformation: For frequently queried data, consider converting to more efficient formats
Leverage Data Partitioning: Logical partitioning based on file names or paths
Pay Attention to Error Handling: Handle JSON parsing errors and type conversion issues

7. Summary

Key features of data lake exploration and analysis through SQL:

No ETL Needed: Directly query raw data, reducing data preparation time
Flexibility: Adapts to semi-structured data without predefined schemas
Ease of Use: Uses familiar SQL syntax, lowering the learning curve
Scalability: Handles data from GB to PB levels
Unified Access: The same interface handles multiple data formats
Exploration-Friendly: Supports iterative data exploration processes
Performance Optimization: Provides various optimization techniques and strategies

SQL analysis on the data lake combines the ease of use of traditional SQL with the flexibility and scalability of big data processing, making it a powerful tool for modern data analysis. Through the GitHub event data analysis case study in this article, we demonstrated how to use SQL to directly query and analyze JSON files stored on data lake Volumes, quickly obtaining valuable data insights.

References

JSON Data Type

JSON Functions External Volume