Quick Analysis of Local CSV Files

Objective:

Through this document, you will be able to quickly analyze local CSV files using Lakehouse

Analyze CSV Format Files:

Data Preparation:

This article uses the Brazilian e-commerce public dataset as an example. In the local path /User/Downloads/brazil-ecommerce on the client, there are the following CSV files:


-------------/User/Downloads/brazil-ecommerce ------------------ 
olist_customers_dataset.csv.gz 
olist_geolocation_dataset.csv.gz
olist_order_items_dataset.csv.gz 
olist_order_payments_dataset.csv.gz 
olist_order_reviews_dataset.csv.gz 
olist_orders_dataset.csv.gz
olist_products_dataset.csv.gz 
olist_sellers_dataset.csv.gz 
product_category_name_translation.csv.gz

Using the Lakehouse JDBC client SQLLine (or DBeaver / Datagrip, etc.) to upload data to the Lakehouse User volume space (restriction: single file must be smaller than 5G):

(USER VOLUME is the file storage space that is automatically enabled for the current user in the current workspace, no need to create it in advance)


--Execute in SQLline or any Lakehouse JDBC client to upload a single file: olist_customers_dataset.csv.gz

PUT '/User/Downloads/brazil-ecommerce/olist_customers_dataset.csv.gz' TO USER VOLUME SUBDIRECTORY 'bz_olist_data';

In the client tool, or in the Studio SQL task node, execute the following command to display the files that have been uploaded to the USER VOLUME:


show user volume directory like '%olist%';

Quickly Analyze Data with Lakehouse SQL

For example: Count the number of users in each state and sort in descending order by the number of users


SELECT count(1) as customer_number_by_state,
       customer_state,
FROM USER VOLUME (
    customer_id STRING,
    customer_unique_id STRING,
    customer_zip_code_prefix INT,
    customer_city STRING,
    customer_state STRING
 ) using csv Options
 (
     'sep' = ',',
     'compression'='gzip',  -- Currently supports zstd/gzip/zlib compression formats. Remove this parameter for no compression
     'header'='true'
 )
 FILES ('bz_olist_data/olist_customers_dataset.csv.gz')
group by customer_state
order by customer_number_by_state desc;

其中，Options 中的选项：

sep：列分隔符，默认为 ” , “。最多支持长度为 1 的字符，例如：'sep'=','
compression：配置文件压缩格式。支持的压缩格式有：：gzip/zstd/zlib。例如：'compression'='gzip'
header：是否解析表头，默认为 false。布尔类型，例如：'header'='true'

除此之外，还支持以下参数：

timeZone：配置时区，没有默认值。用于指定文件中时间格式的时区。例如：'timeZone' = 'Asia/Shanghai'
escape：用于转义已加引号的值中的引号，默认值为”\“，例如：'escape'='\'
lineSep：行分隔符，默认值为"\n"。最多支持长度为 2 的字符，例如：'lineSep'='$'
quote：设置用于转义引号值的单个字符。默认值为双引号“"“，例如：'quote'='"'

Generate an internal table in the lakehouse


CREATE TABLE OLIST_CUSTOMER_TBL
AS
SELECT * FROM USER VOLUME (
    customer_id STRING,
    customer_unique_id STRING,
    customer_zip_code_prefix INT,
    customer_city STRING,
    customer_state STRING
 ) using csv Options
 (
     'sep' = ',',
     'compression'='gzip',
     'header'='true'
 )
 FILES ('bz_olist_data/olist_customers_dataset.csv.gz');