Quick Analysis of Local CSV Files

Objective:

Through this document, you will be able to quickly analyze local CSV files using Lakehouse

Analyze CSV Format Files:

Data Preparation:

This article uses the Brazilian e-commerce public dataset as an example. In the local path /User/Downloads/brazil-ecommerce on the client, there are the following CSV files:

-------------/User/Downloads/brazil-ecommerce ------------------ 
olist_customers_dataset.csv.gz 
olist_geolocation_dataset.csv.gz
olist_order_items_dataset.csv.gz 
olist_order_payments_dataset.csv.gz 
olist_order_reviews_dataset.csv.gz 
olist_orders_dataset.csv.gz
olist_products_dataset.csv.gz 
olist_sellers_dataset.csv.gz 
product_category_name_translation.csv.gz

Using the Lakehouse JDBC client SQLLine (or DBeaver / Datagrip, etc.) to upload data to the Lakehouse User volume space (restriction: single file must be smaller than 5G):

(USER VOLUME is the file storage space that is automatically enabled for the current user in the current workspace, no need to create it in advance)

--Execute in SQLline or any Lakehouse JDBC client to upload a single file: olist_customers_dataset.csv.gz

PUT '/User/Downloads/brazil-ecommerce/olist_customers_dataset.csv.gz' TO USER VOLUME SUBDIRECTORY 'bz_olist_data';

In the client tool, or in the Studio SQL task node, execute the following command to display the files that have been uploaded to the USER VOLUME:

show user volume directory like '%olist%';

Quickly Analyze Data with Lakehouse SQL

For example: Count the number of users in each state and sort in descending order by the number of users

SELECT count(1) as customer_number_by_state,
       customer_state,
FROM USER VOLUME (
    customer_id STRING,
    customer_unique_id STRING,
    customer_zip_code_prefix INT,
    customer_city STRING,
    customer_state STRING
 ) using csv Options
 (
     'sep' = ',',
     'compression'='gzip',  -- Currently supports zstd/gzip/zlib compression formats. Remove this parameter for no compression
     'header'='true'
 )
 FILES ('bz_olist_data/olist_customers_dataset.csv.gz')
group by customer_state
order by customer_number_by_state desc;

其中,Options 中的选项:

  • sep:列分隔符,默认为 ” , “。最多支持长度为 1 的字符,例如:'sep'=','
  • compression:配置文件压缩格式。支持的压缩格式有::gzip/zstd/zlib。例如:'compression'='gzip'
  • header:是否解析表头,默认为 false。布尔类型,例如:'header'='true'

除此之外,还支持以下参数:

  • timeZone:配置时区,没有默认值。用于指定文件中时间格式的时区。例如:'timeZone' = 'Asia/Shanghai'
  • escape:用于转义已加引号的值中的引号,默认值为”\“,例如:'escape'='\'
  • lineSep:行分隔符,默认值为"\n"。最多支持长度为 2 的字符,例如:'lineSep'='$'
  • quote:设置用于转义引号值的单个字符。默认值为双引号“"“,例如:'quote'='"'

Generate an internal table in the lakehouse

CREATE TABLE OLIST_CUSTOMER_TBL
AS
SELECT * FROM USER VOLUME (
    customer_id STRING,
    customer_unique_id STRING,
    customer_zip_code_prefix INT,
    customer_city STRING,
    customer_state STRING
 ) using csv Options
 (
     'sep' = ',',
     'compression'='gzip',
     'header'='true'
 )
 FILES ('bz_olist_data/olist_customers_dataset.csv.gz');