Exploring and Analyzing Data in JSON Files on Data Lake Volumes
1. Introduction to Data Lakes and JSON Data
1.1 What is a Data Lake
The data lake is an important component of the Lakehouse, allowing you to store all structured and unstructured data in its original format without needing to define a schema upfront. This fully embodies the advantages of lakehouse unification, extending Lakehouse data management to the unified management of unstructured data, no longer limited to structured data management within the data warehouse. This flexibility makes it an ideal choice for big data analysis, as you can use different analysis methods and tools to process data as needed. A data lake typically consists of the following parts:
- Storage Layer: Such as object storage (S3, OSS, COS, etc.)
- Metadata Management: Used for organizing and discovering data
- Processing Engine: Used for analyzing and transforming data
1.2 Volumes in the Data Lake
In modern data platforms like the Lakehouse, a Volume is an abstraction that represents a specific location in an external storage system (e.g., a path in object storage). Volumes allow the data platform to seamlessly access data stored in external systems without actually copying or moving that data.
Key features of Volumes:
- Direct connection to external storage systems
- Allows in-place querying without ETL
- Supports multiple file formats (JSON, CSV, Parquet, etc.)
- Can integrate with tables, providing SQL access capabilities
1.3 JSON Data Format
JSON (JavaScript Object Notation) is a lightweight data interchange format with the following features:
- Semi-Structured: Flexible key-value pair structure
- Nesting Capability: Supports complex hierarchical structures and arrays
- Broad Support: Almost all programming languages provide JSON parsing support
- Human Readable: Easier to understand compared to binary formats
In the big data domain, JSON is widely used for storing event data, API responses, log files, etc. Its flexibility makes it ideal for storing data with varying shapes.
2. Case Study: GitHub Event Data
2.1 Dataset Overview
GitHub event data is a publicly available dataset recording the time series of all public activities on the GitHub platform. In this case study, we analyze GitHub event data stored in the gh_archive Volume.
Data Source Details:
- Volume Name:
gh_archive - File Path:
2025-05-14-0.json.gz(representing events from the 00:00-01:00 period on May 14, 2025) - File Format: Compressed JSON file (using gzip compression)
- Data Volume: A single file is approximately 85.6 MB (compressed), containing nearly 200,000 event records
Data Structure: Each record contains the following main fields:
id: Unique event identifiertype: Event type (e.g., PushEvent, PullRequestEvent, etc.)actor: Information about the user who performed the actionrepo: Information about the related code repositoryorg: Information about the related organization (if applicable)payload: Detailed event information (varies by event type)created_at: Event creation timepublic: Whether the event is public
3. Data Lake Exploration and Analysis via SQL
3.1 Directly Querying JSON Files
The Lakehouse data lake platform allows direct SQL queries on files stored on Volumes, without needing to load them into tables first:
Query Result:
Features:
- Zero ETL: No need to extract, transform, and load data beforehand
- Flexible Access: Direct access to nested fields
- Ad-Hoc Queries: Support for immediate data exploration
3.2 Handling Nested and Complex JSON Structures
JSON data often contains nested objects and arrays. SQL provides direct access to these complex structures:
Query Result:
Features:
- Path Navigation: Use dot notation to access nested objects
- Array Indexing: Use square brackets to access array elements
- Array Functions: Use functions like
SIZE()to handle arrays - Conditional Analysis: Use CASE statements for classification and pattern recognition
3.3 Using JSON Paths to Access Deeply Nested Data
Path expressions can be used to deeply access multi-level nested structures in JSON documents:
Query Result:
3.4 Complex Conditional Filtering and Regular Expressions
Use regular expressions and complex condition combinations to filter JSON data:
Query Result:
3.5 Advanced Aggregation and Conditional Statistics
Use conditional aggregation to simultaneously analyze multiple metrics:
Query Result:
3.6 Window Function Analysis
Use window functions for more complex ranking and grouping analysis:
Query Result:
3.7 Text Analysis and Time Processing
Use string functions and time functions to analyze time patterns in JSON data:
Query Result:
3.8 Analyzing PR Merge Patterns
Analyze Pull Request merge trends:
Query Result:
3.9 Loading File Data into Tables for Further Analysis
For data that needs repeated analysis, creating tables can improve query performance:
Advantages:
- Improved Performance: Tables can use indexes and caching to speed up queries
- Data Transformation: Transformations and column renaming can be applied during loading
- Persistent Access: Create persistent views to avoid repeated parsing of raw data
4. Key Technical Features of Data Lake Analysis
4.1 Schema Flexibility
A core advantage of data lake analysis is schema flexibility:
- Schema-on-Read: Data structure only needs to be defined at query time
- Partial Field Access: Only query needed fields, ignoring others
- Evolution Adaptation: As data structures change, queries can easily adapt
For example, in our GitHub event analysis, we can focus only on specific event types or specific fields without processing the entire data structure.
4.2 Computation Pushdown and Filter Optimization
The Lakehouse data lake technology supports computation pushdown, pushing filter and transformation operations to the data source:
4.3 Unified Data Access and Format Selection
Apply SQL analysis to different types of data files in the same Volume:
4.4 Distributed Processing Capabilities
Data lake analysis engines are typically based on distributed computing frameworks, capable of processing large-scale datasets:
- Parallel Processing: Automatically decomposes queries into parallel execution tasks
- Memory Management: Optimizes memory usage, processing data exceeding single-machine memory capacity
- Fault Tolerance: Handles node failures and recovery
In GitHub event analysis, even though individual files are relatively small, the same queries can scale to analyze TB-level historical event data.
5. Data Analysis Process and Methodology
Through the GitHub event data case study, we can summarize a methodology for analyzing JSON data in a data lake:
5.1 Exploratory Data Analysis Process
-
Initial Preview: Get data samples to understand the overall structure
-
Overview Statistics: Understand data distribution and main dimensions
-
Deep Analysis: Refined analysis targeting specific domains
-
Correlation Analysis: Cross-analysis connecting multiple dimensions
5.2 Data Insight Methods
In the GitHub event analysis, we discovered several key insights:
- Activity Distribution: Understanding the proportion of different event types, discovering that Push events dominate (66%)
- Automation Trends: Through user activity analysis, discovering that automated bot accounts (such as github-actions[bot]) contribute a large amount of activity
- Organizational Ecosystem: Identifying the most active organizations, understanding the composition of the GitHub ecosystem
- Development Behavior: Through commit message analysis, understanding developers' work patterns and focus areas
- Branch Usage Patterns: main and master are still the most common target branch names
- Bot Activity: dependabot, renovate, and github-actions are the most active automation bots
These insights come directly from SQL queries, demonstrating the powerful capability of SQL as a data analysis tool.
6. Best Practices and Optimization Tips
6.1 JSON Data Analysis Optimization
-
Selective Field Reading: Only select the fields needed for analysis
-
Predicate Pushdown: Apply filter conditions as early as possible in the query
-
Predicate Pushdown: Filter conditions (
WHERE type = 'PushEvent') are pushed down to the storage layer, leveraging columnar statistics (such as max/min/bloom filters) to skip data blocks that clearly don't match. -
Column Pruning: Only read relevant columns (
actor.loginandtype), skipping parsing of unrelated fields. -
Filter Optimization: After the storage layer loads data in blocks, further filter out non-matching records in memory, ultimately retaining only valid data for computation.
-
Properly Handle NULL Values: Use
IS NULLandIS NOT NULLto check for missing fields
6.2 Data Lake Query Best Practices
- Avoid Full Table Scans: Use filter conditions whenever possible
- Use Aggregation Wisely: When data volume is large, prioritize distributed aggregation
- Selective Projection: Only select necessary columns to reduce I/O and memory usage
- Appropriate Data Transformation: For frequently queried data, consider converting to more efficient formats
- Leverage Data Partitioning: Logical partitioning based on file names or paths
- Pay Attention to Error Handling: Handle JSON parsing errors and type conversion issues
7. Summary
Key features of data lake exploration and analysis through SQL:
- No ETL Needed: Directly query raw data, reducing data preparation time
- Flexibility: Adapts to semi-structured data without predefined schemas
- Ease of Use: Uses familiar SQL syntax, lowering the learning curve
- Scalability: Handles data from GB to PB levels
- Unified Access: The same interface handles multiple data formats
- Exploration-Friendly: Supports iterative data exploration processes
- Performance Optimization: Provides various optimization techniques and strategies
SQL analysis on the data lake combines the ease of use of traditional SQL with the flexibility and scalability of big data processing, making it a powerful tool for modern data analysis. Through the GitHub event data analysis case study in this article, we demonstrated how to use SQL to directly query and analyze JSON files stored on data lake Volumes, quickly obtaining valuable data insights.
