Quick Analysis of Local CSV Files
Objective:
Through this document, you will be able to quickly analyze local CSV files using Lakehouse
Analyze CSV Format Files:
Data Preparation:
This article uses the Brazilian e-commerce public dataset as an example. In the local path /User/Downloads/brazil-ecommerce on the client, there are the following CSV files:
-------------/User/Downloads/brazil-ecommerce ------------------
olist_customers_dataset.csv.gz
olist_geolocation_dataset.csv.gz
olist_order_items_dataset.csv.gz
olist_order_payments_dataset.csv.gz
olist_order_reviews_dataset.csv.gz
olist_orders_dataset.csv.gz
olist_products_dataset.csv.gz
olist_sellers_dataset.csv.gz
product_category_name_translation.csv.gz
Using the Lakehouse JDBC client SQLLine (or DBeaver / Datagrip, etc.) to upload data to the Lakehouse User volume space (restriction: single file must be smaller than 5G):
(USER VOLUME is the file storage space that is automatically enabled for the current user in the current workspace, no need to create it in advance)
--Execute in SQLline or any Lakehouse JDBC client to upload a single file: olist_customers_dataset.csv.gz
PUT '/User/Downloads/brazil-ecommerce/olist_customers_dataset.csv.gz' TO USER VOLUME SUBDIRECTORY 'bz_olist_data';
In the client tool, or in the Studio SQL task node, execute the following command to display the files that have been uploaded to the USER VOLUME:
show user volume directory like '%olist%';
Quickly Analyze Data with Lakehouse SQL
For example: Count the number of users in each state and sort in descending order by the number of users
SELECT count(1) as customer_number_by_state,
customer_state,
FROM USER VOLUME (
customer_id STRING,
customer_unique_id STRING,
customer_zip_code_prefix INT,
customer_city STRING,
customer_state STRING
) using csv Options
(
'sep' = ',',
'compression'='gzip', -- Currently supports zstd/gzip/zlib compression formats. Remove this parameter for no compression
'header'='true'
)
FILES ('bz_olist_data/olist_customers_dataset.csv.gz')
group by customer_state
order by customer_number_by_state desc;
其中,Options 中的选项:
- sep:列分隔符,默认为 ” , “。最多支持长度为 1 的字符,例如:
'sep'=','
- compression:配置文件压缩格式。支持的压缩格式有::gzip/zstd/zlib。例如:
'compression'='gzip'
- header:是否解析表头,默认为 false。布尔类型,例如:
'header'='true'
除此之外,还支持以下参数:
- timeZone:配置时区,没有默认值。用于指定文件中时间格式的时区。例如:
'timeZone' = 'Asia/Shanghai'
- escape:用于转义已加引号的值中的引号,默认值为”\“,例如:
'escape'='\'
- lineSep:行分隔符,默认值为"\n"。最多支持长度为 2 的字符,例如:
'lineSep'='$'
- quote:设置用于转义引号值的单个字符。默认值为双引号“"“,例如:
'quote'='"'
Generate an internal table in the lakehouse
CREATE TABLE OLIST_CUSTOMER_TBL
AS
SELECT * FROM USER VOLUME (
customer_id STRING,
customer_unique_id STRING,
customer_zip_code_prefix INT,
customer_city STRING,
customer_state STRING
) using csv Options
(
'sep' = ',',
'compression'='gzip',
'header'='true'
)
FILES ('bz_olist_data/olist_customers_dataset.csv.gz');