Data Import into Lakehouse: Batch Loading via COPY Command
This guide will help you import large amounts of data from public URL Parquet files (such as New York City Taxi and For-Hire Vehicle Data) into Singdata Lakehouse using scripts and the Singdata Lakehouse SQLLine command-line tool.
-
download_raw_data.sh: Download Public URL Parquet files to local. This article uses NYC Taxi Data as an example, containing over 450 files and 60GB of data (Parquet format).
-
initialize_database.sh: Create Singdata Lakehouse schema and tables via Lakehouse SQLLine commands.
-
Import taxi and FHV data: Convert Parquet files to CSV files using R, then import data from local CSV files into Singdata Lakehouse tables using the Lakehouse SQLLine COPY command, supporting batch import of multiple files.
Operation Guide
Script download address: https://github.com/yunqiqiliang/nyc-taxi-data-clickzetta
1. Install Singdata SQLLine
2. Install R
From CRAN
Note that R used to be optional for this repo, but is required starting with the 2022 file format change. The scripts use R to convert Parquet files to CSV before loading into Postgres. There are other ways to convert from Parquet to CSV that wouldn't require R, but I found that R's arrow package was faster than some of the other CLI tools I tried
3. Download raw data
./download_raw_data.sh
When done, modify download_raw_data.sh and run again to download data from 202212: wget -i setup_files/raw_data_urls_new.txt -P data/ -w 2
4. Initialize database and set up schema
./initialize_database.sh
5. Import taxi and FHV data
./import_yellow_taxi_trip_data.sh
./import_green_taxi_trip_data.sh
./import_fhv_taxi_trip_data.sh
./import_fhvhv_trip_data.sh
Note that the full import process might take several hours or possibly even over a day, depending on computing power.
Schema
tripstable contains all yellow and green taxi trips. Each trip has acab_type_id, which references thecab_typestable and refers to one ofyelloworgreenfhv_tripstable contains all for-hire vehicle trip records, including ride-hailing apps Uber, Lyft, Via, and Junofhv_basesmapsfhv_tripsto base names and "doing business as" labels, which include ride-hailing app namesnyct2010table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areastaxi_zonestable contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDscentral_park_weather_observationshas summary weather data by date
