Exploring and Analyzing Data in Parquet Files on Data Lake Volumes
Introduction
The data lake is an important component of the Lakehouse, allowing you to store all structured and unstructured data in its original format without needing to define a schema upfront. This fully embodies the advantages of lakehouse unification, extending Lakehouse data management to the unified management of unstructured data, no longer limited to structured data management within the data warehouse. This article demonstrates how to use SQL to explore and analyze Parquet file data stored on Volumes in the Lakehouse environment, using New York City taxi data analysis as a case study to show the complete process and methods of data exploration.
Introduction to Parquet Format
Parquet is a columnar storage format designed specifically for big data processing, with the following key features:
- Columnar Storage: Data is stored by column rather than by row, especially suitable for analytical queries
- Efficient Compression: Compared to traditional row-based storage like CSV, it can save 50-75% of storage space
- High Performance: Supports predicate pushdown, allowing the query engine to read only the necessary columns and rows
- Self-Contained Metadata: Files include schema definitions, supporting self-description
- Broad Compatibility: Widely supported by big data tools such as Hadoop, Spark, Presto, etc.
The Parquet format is particularly suitable for storing and querying large-scale structured data, such as log data, transaction data, and sensor data, making it ideal for data lake scenarios.
Understanding Data Lake Volumes
In the Lakehouse platform, Volume is a key concept:
- Volume is a logical storage unit in the data lake, similar to a tablespace in traditional databases
- Volumes can point to internal storage or external storage (such as OSS, S3, HDFS, etc.)
- Through the Volume mechanism, the data lake can directly query data in external storage systems without importing
- Each Volume can contain multiple files and support directory hierarchies
This article uses yellow_trip_record_data as an example, which is a Volume pointing to external object storage containing New York City yellow taxi trip data.
Sample Data Introduction
The New York City Yellow Taxi Trip Records dataset used in this article is a public transportation dataset:
- Data is stored in a Volume named
yellow_trip_record_data - Data is stored in Parquet format, organized by month
- File naming format is
yellow_tripdata_YYYY-MM.parquet(e.g.,yellow_tripdata_2024-01.parquetrepresents data for January 2024) - Each file contains detailed records of all yellow taxi trips for that month
- Data includes rich information such as pickup/dropoff times, locations, fares, passenger counts, etc.
This data can be used to analyze urban traffic patterns, taxi demand, price trends, and many other aspects.
Table of Contents
- Data Lake Volume Basic Operations
- Initial Data Exploration
- Data Preview
- Basic Data Analysis
- Advanced Analysis and Aggregation
- Time Series Analysis
- Best Practices and Optimization Tips
- Conclusion
1. Data Lake Volume Basic Operations
Before starting analysis, you need to understand and master the basic operations of data lake Volumes, which is the first step in interacting with data.
1.1 Listing Available Volumes
This command allows you to view all available data volumes, including Volume name, creation time, type (internal/external), URL address, and other information. In our example, this command will display all available Volumes including yellow_trip_record_data.
Example execution result:
From the result, we can see this is an external Volume pointing to a path in OSS object storage, with recursive file lookup enabled.
1.2 Listing Files in a Volume
This command displays all files in the root directory of the specified Volume, including file relative paths, full URLs, file sizes, and last modified times. SUBDIRECTORY '/' means viewing files in the root directory; if data has a hierarchical structure, you can specify a subdirectory path.
Example execution result:
This result shows that the Volume contains multiple Parquet files named by month, each approximately 50MB in size, covering data from January 2024 to February 2025.
2. Initial Data Exploration
When working with unfamiliar datasets, you first need to understand the overall picture of the data and establish a preliminary understanding of the data content and structure.
2.1 Listing Files
For our example dataset, first list all available data files:
After executing this command, we see that the Volume contains multiple Parquet files with the naming pattern yellow_tripdata_YYYY-MM.parquet.
2.2 Understanding File Format and Naming Conventions
By observing the file names (e.g., yellow_tripdata_2024-01.parquet), we can understand:
- The file format is Parquet (identified by the .parquet suffix)
- The naming convention is "yellow_tripdata_year-month.parquet"
- Data is organized by month, covering January 2024 to February 2025
- Each file is approximately 50-60MB in size
This naming and organization approach is a common partition strategy in data lakes, segmenting data by time dimension, making it convenient to query data for specific time periods and improving query efficiency.
2.3 Understanding Data Source and Background
The New York City Taxi and Limousine Commission (TLC) collects yellow taxi trip record data containing pickup/dropoff dates/times, locations, trip distances, detailed fares, rate types, payment methods, and other information. This type of public transit data is highly valuable for urban planning, traffic management, and business decision-making.
3. Data Preview
Before diving into analysis, previewing the data structure and content is crucial to understand data quality, format, and possible analysis directions.
3.1 Previewing Parquet File Content
The Lakehouse allows us to directly use SQL queries to preview Parquet file content without first creating tables or importing data:
Key parts of this query syntax:
FROM VOLUME yellow_trip_record_data: Specifies which Volume the data comes fromUSING parquet: Declares the file format as ParquetFILES('yellow_tripdata_2024-01.parquet'): Specifies the specific file to queryLIMIT 10: Only returns the first 10 records for quick preview
Example execution result:
3.2 Analyzing Data Structure
Through the preview, we can understand in detail that the NYC taxi data contains the following key fields:
| Field Name | Data Type | Description |
|---|---|---|
| VendorID | Integer | Provider ID (1=Creative Mobile Technologies, 2=VeriFone) |
| tpep_pickup_datetime | Timestamp | Date and time when the passenger was picked up |
| tpep_dropoff_datetime | Timestamp | Date and time when the passenger was dropped off |
| passenger_count | Integer | Number of passengers in the vehicle |
| trip_distance | Float | Trip distance in miles |
| PULocationID | Integer | Pickup location ID (corresponding to NYC zone code) |
| DOLocationID | Integer | Dropoff location ID (corresponding to NYC zone code) |
| payment_type | Integer | Payment method code (1=Credit Card, 2=Cash, 3=Free, 4=Dispute, 5=Unknown) |
| fare_amount | Float | Base fare for the trip |
| total_amount | Float | Total amount including all charges |
Through this preliminary inspection, we now understand the data structure and field meanings, laying the foundation for subsequent analysis.
4. Basic Data Analysis
Getting a basic overview of the data, including count statistics, averages, and outlier detection.
4.1 Basic Statistical Analysis
4.2 Categorical Analysis
Analyzing the distribution of categorical variables (e.g., payment methods):
4.3 Hotspot Area Analysis
Execution Result:
This query reveals the most popular pickup points:
- PULocationID 132 is the hottest pickup point with 145,240 pickups
- This area has a significantly higher average trip distance (15.49 miles) and the highest average fare ($76.58), suggesting it may be a major airport or transit hub
- Other popular pickup points (such as 161, 237, 236, and 162) have noticeably lower average trip distances and fares than 132, likely being popular downtown locations
4.4 Numeric Distribution Analysis
Execution Result:
This analysis provides important insights into the relationship between trip distance and fare:
- 1-2 miles is the most common trip distance range, with 967,440 trips
- Short distance trips (0-3 miles) account for 73.7% of total trips
- There is a clear positive correlation between trip distance and average fare
- Long trips over 10 miles have an average fare of $83.41, which is 5.5 times that of short trips (0-1 mile)
5. Advanced Analysis and Aggregation
Deeper data analysis, including multi-dimensional aggregation, hotspot analysis, etc.
5.1 Time Dimension Aggregation
5.2 Hotspot Area Analysis
5.3 Multi-Dimensional Comparative Analysis
6. Time Series Analysis
Analyzing data change trends across multiple time periods.
6.1 Multi-Month Data Combined Analysis
Since file name variables cannot be directly referenced for multiple files, the following methods can be used to handle multi-month data:
6.2 Quarterly Data Analysis
7. Best Practices and Optimization Tips
7.1 Data Filtering
When processing large data, reasonable filter conditions can greatly improve query efficiency:
Effective filter conditions can:
- Exclude data outliers, improving analysis result accuracy
- Reduce processed data volume, improving query performance
- Focus on the most relevant data subsets to meet specific analysis needs
7.2 Column Pruning and Predicate Pushdown
Fully leverage the columnar storage and predicate pushdown features of the Parquet format:
This optimization strategy can:
- Reduce I/O volume by reading only the column data needed for the query
- Leverage Parquet's built-in filtering capabilities to skip data blocks in files that don't match conditions
- Significantly improve query performance on large datasets
7.3 Staged Analysis
For large datasets, adopt a staged analysis strategy:
- First perform exploratory analysis on a single file
- Verify analysis logic before expanding to all data
- Use CTEs to organize complex queries, enhancing readability and maintainability
This approach makes complex queries easier to understand, debug, and maintain.
7.4 Data Quality Checks
Perform data quality checks before analysis:
Data quality checks can:
- Discover issues and anomalies in the data
- Guide data cleaning and transformation in subsequent analysis
- Improve the reliability and accuracy of analysis results
Conclusion
By directly analyzing Parquet files on data lake Volumes through SQL, we can:
- Quickly explore and understand data structure
- Perform comprehensive statistical analysis
- Discover temporal and spatial patterns in the data
- Compare data change trends across different time periods
This approach avoids the overhead of data copying and transformation, performing analysis directly on the data lake, improving efficiency. For big data environments, this analysis approach is both flexible and efficient, particularly suitable for data scientists and analysts conducting data exploration and preliminary analysis.
Through the NYC taxi data case study in this article, we demonstrated the complete workflow from basic data exploration to complex time series analysis, providing a practical guide for Parquet file analysis in data lake environments. Compared to the traditional import-then-analyze approach, this direct query method greatly improves the flexibility and response speed of data analysis, especially suitable for initial data exploration and hypothesis validation.
In real-world projects, these analyses can provide data support for various scenarios such as taxi companies optimizing vehicle dispatching, urban management departments improving traffic planning, and financial analysts understanding consumption trends. SQL analysis on data lakes is becoming a key bridge connecting raw data to business insights.
