Hive → Lakehouse Migration in Practice: E-commerce User Behavior Data Warehouse

If your data warehouse runs on Hive, migrating to Singdata Lakehouse requires less effort than you might expect. The core Hive SQL syntax — SELECT, JOIN, GROUP BY, window functions, conditional aggregation — runs directly in Lakehouse without modification. Changes are concentrated in 4 areas: storage format declarations, data loading methods, dynamic partition SET statements, and SerDe configuration.

This article validates this with a real project: a complete migration of a Hive 4.0-based e-commerce user behavior data warehouse (ODS → DWD → DWS → ADS four-layer architecture) to Singdata Lakehouse, passing all 10 automated validation checks.

Full code on GitHub: hive2lakehouse-ecommerce-events


Original Project

The data source is the Kaggle E-commerce User Behavior Dataset, containing user behavior logs from a cosmetics e-commerce website in October–November 2019. Fields include event_time, event_type (view/cart/purchase), product_id, category_code, brand, price, user_id, user_session, totaling approximately 6.5 million rows.

The Hive implementation runs in a Docker container (apache/hive:4.0.1) with a four-layer architecture:

LayerTableDescription
ODSods_events_rawRaw events, partitioned by date, ORC format
DWDdwd_events_cleanCleaned events, bucketed into 8 buckets by user_id, ORC format
DWSdws_user_behaviorDaily user behavior summary (view/cart/purchase counts and amounts)
ADSads_funnel_dailyDaily funnel conversion rates (view→cart→purchase)

The migrated code is in the 03_lakehouse/ directory and can be compared file-by-file with 01_hive/.

Conclusion First

Your SQL query logic does not need a single change. The 5 changes are all in CREATE TABLE statements and data loading methods — remove Hive-specific storage configuration, replace LOAD DATA syntax, and the business logic is fully preserved.

ChangeEffortDescription
Remove STORED AS ORC / TBLPROPERTIESVery lowDelete directly, no logic changes
LOAD DATACOPY INTO FROM VOLUMELowDifferent syntax structure, but same logic
Remove dynamic partition SET statementsVery lowDelete 3 SET lines directly
Remove SerDe configurationVery lowDelete ROW FORMAT SERDE block directly

SELECT / JOIN / GROUP BY / window functions / conditional aggregation — the core operations of a data warehouse — have identical syntax and require no changes.


Technology Stack Comparison

Hive 4.0Lakehouse
Storage formatORC (requires explicit STORED AS ORC)Native Parquet (no declaration needed)
CSV parsingROW FORMAT SERDE 'OpenCSVSerde'COPY INTO ... USING CSV OPTIONS (...)
Bucketing accelerationCLUSTERED BY (col) INTO N BUCKETSSame syntax, directly compatible
Dynamic partitioningRequires 3 SET statements to enableEnabled by default, no SET needed
Data loadingLOAD DATA LOCAL INPATH + staging tableCOPY INTO FROM VOLUME
Runtime environmentDocker container (beeline client)cz-cli / Python SDK


Project Background

The data architecture has four layers, each corresponding to a schema:

  • ODS (ecommerce_ods): Raw data, partitioned by date, retaining original fields
  • DWD (ecommerce_dwd): Cleansing layer, event_time converted to TIMESTAMP, category_code split into three levels, dirty data filtered
  • DWS (ecommerce_dws): Daily user behavior summary — view/cart/purchase counts and spending per user per day
  • ADS (ecommerce_ads): Funnel analysis — daily view→cart→purchase conversion rates

Sample data (20 rows) validation results:

MetricValue
ODS row count19 (1 row filtered for column-shift dirty data)
DWD row count19
DWS user count6
View user count6
Cart user count4
Purchase user count3
View→cart conversion rate66.67%
Cart→purchase conversion rate75%

Migration Steps

Step 1: Remove Storage Format Declarations

Hive requires every table to declare a storage format. Lakehouse uses native Parquet — just delete the declarations.

-- Hive CREATE TABLE dwd_events_clean (...) PARTITIONED BY (dt STRING) CLUSTERED BY (user_id) INTO 8 BUCKETS STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY"); -- Lakehouse: just delete the last three lines CREATE TABLE dwd_events_clean (...) PARTITIONED BY (dt STRING);

Step 2: Replace Data Loading Method

Hive uses LOAD DATA to load files into an EXTERNAL TABLE. Lakehouse uses COPY INTO FROM VOLUME.

-- Hive: two steps (staging table + INSERT) LOAD DATA LOCAL INPATH '/tmp/events.csv' OVERWRITE INTO TABLE ods_events_staging; INSERT OVERWRITE TABLE ods_events_raw PARTITION (dt) SELECT ..., SUBSTR(event_time, 1, 10) AS dt FROM ods_events_staging WHERE event_type IN ('view', 'cart', 'purchase'); -- Lakehouse: also two steps, but different syntax -- COPY INTO does not support computed columns (e.g., extracting dt), -- so COPY INTO staging first, then INSERT INTO partitioned table COPY INTO ecommerce_ods.ods_events_staging FROM VOLUME ecommerce_ods.ecommerce_vol USING CSV OPTIONS ('header' = 'true', 'nullValue' = '') FILES ('raw/events_sample.csv') ON_ERROR = CONTINUE; INSERT OVERWRITE TABLE ecommerce_ods.ods_events_raw PARTITION (dt) SELECT ..., SUBSTR(event_time, 1, 10) AS dt FROM ecommerce_ods.ods_events_staging WHERE event_type IN ('view', 'cart', 'purchase') AND user_id > 100000;

Step 3: Remove Dynamic Partition SET Statements

Hive dynamic partitioning is disabled by default and requires 3 SET statements to enable. Lakehouse enables it by default — just delete the SET statements.

-- Hive (required, otherwise errors) SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.enforce.bucketing=true; INSERT OVERWRITE TABLE dwd_events_clean PARTITION (dt) SELECT ..., dt FROM ods_events_raw; -- Lakehouse: write directly, delete all SET statements INSERT OVERWRITE TABLE ecommerce_dwd.dwd_events_clean PARTITION (dt) SELECT ..., dt FROM ecommerce_ods.ods_events_raw;

Step 4: Remove SerDe Configuration

Hive requires SerDe configuration to parse CSV. Lakehouse specifies this in COPY INTO's OPTIONS — no format configuration is needed at table creation time.

-- Hive: SerDe must be configured at table creation CREATE EXTERNAL TABLE ods_events_staging (...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "\"" ) STORED AS TEXTFILE TBLPROPERTIES ("skip.header.line.count"="1"); -- Lakehouse: table creation only declares columns; format is specified in COPY INTO CREATE TABLE ecommerce_ods.ods_events_staging ( event_time STRING, event_type STRING, ... ); -- Specify format at load time COPY INTO ecommerce_ods.ods_events_staging FROM VOLUME ecommerce_ods.ecommerce_vol USING CSV OPTIONS ('header' = 'true', 'nullValue' = '') FILES ('raw/events_sample.csv');


Fully Compatible Parts

The following Hive SQL runs directly in Lakehouse without any modification:

ETL transformation logic (ODS → DWD):

-- Identical on both sides, reuse directly INSERT OVERWRITE TABLE dwd_events_clean PARTITION (dt) SELECT CAST(REGEXP_REPLACE(event_time, ' UTC$', '') AS TIMESTAMP) AS event_ts, event_type, product_id, SPLIT(category_code, '\\.')[0] AS category_l1, CASE WHEN SIZE(SPLIT(category_code, '\\.')) > 1 THEN SPLIT(category_code, '\\.')[1] END AS category_l2, brand, price, user_id, user_session, dt FROM ods_events_raw WHERE price > 0 OR event_type != 'purchase';

Aggregation logic (DWD → DWS/ADS):

-- Identical on both sides, reuse directly SELECT COUNT(DISTINCT CASE WHEN event_type = 'view' THEN user_id END) AS view_users, COUNT(DISTINCT CASE WHEN event_type = 'cart' THEN user_id END) AS cart_users, COUNT(DISTINCT CASE WHEN event_type = 'purchase' THEN user_id END) AS purchase_users, ROUND( COUNT(DISTINCT CASE WHEN event_type = 'cart' THEN user_id END) * 1.0 / NULLIF(COUNT(DISTINCT CASE WHEN event_type = 'view' THEN user_id END), 0), 4 ) AS view_to_cart_rate, dt FROM dwd_events_clean GROUP BY dt;

Window function analysis:

-- Identical on both sides, reuse directly SELECT user_id, purchase_amt, RANK() OVER (ORDER BY purchase_amt DESC) AS spending_rank FROM dws_user_behavior WHERE dt = '2019-10-01';


Notes

1. OpenCSVSerde Column Shift

Symptom: When CSV has consecutive empty fields (e.g., brand is empty, appearing as ,,), OpenCSVSerde parsing causes subsequent columns to shift left — the value of price gets read as user_id.

Example:

2019-10-01,view,28719074,...,apparel.shoes.keds,,35.79,541312140,...

brand is empty → price=35.79 is read as user_id, producing a dirty data row with user_id=35.

Handling: Filter abnormal user_ids when INSERT INTO ODS:

WHERE CAST(user_id AS BIGINT) > 100000

Lakehouse impact: COPY INTO uses ON_ERROR=CONTINUE to skip malformed rows; the same filter is applied during INSERT.

2. Hive 4.0 Bucketed Table GROUP BY Returns Empty

Symptom: Under Hive 4.0 + Tez engine, GROUP BY queries on CLUSTERED BY bucketed ORC tables return empty results, but COUNT(*) returns the correct row count via metadata stats.

Root cause: CombineHiveInputFormat (default) has a bug in the execution plan for bucketed ORC tables.

Temporary fix:

SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

Lakehouse impact: This issue does not exist. Lakehouse also supports CLUSTERED BY ... INTO N BUCKETS bucketing, but GROUP BY works correctly without this bug.

3. COPY INTO Does Not Support Column Reference Syntax

Symptom: Lakehouse COPY INTO does not support $1, $2 column references:

-- Error: Syntax error at or near '$' COPY INTO t FROM (SELECT $1, $2 FROM VOLUME ...)

Correct approach: Use FROM VOLUME ... USING CSV OPTIONS (...) FILES (...) syntax, mapping columns directly by order. When computed columns are needed (e.g., extracting dt), first COPY INTO an unpartitioned staging table, then INSERT INTO the target table.

4. Dynamic Partition Default Behavior Differences

BehaviorHiveLakehouse
Dynamic partition switchDisabled by default, requires SET hive.exec.dynamic.partition=trueEnabled by default
Strict modeStrict by default, requires SET ... mode=nonstrictNo such restriction
Bucketed writesRequires SET hive.enforce.bucketing=trueSupported by default, no SET needed

End-to-End Validation

03_lakehouse/e2e.py runs 10 automated checks on the migration results:

CheckExpectedResult
ODS row count19
DWD row count19
DWS user count6
Funnel view_users6
Funnel cart_users4
Funnel purchase_users3
View→cart conversion rate0.6667
Cart→purchase conversion rate0.75
Top spending user ID526595547
Top spending amount1422.0

Actual run result: 10/10 passed.


Full Compatibility Reference

CategoryHive SyntaxLakehouseCompatibility
Partitioned tablePARTITIONED BY (dt STRING)Same✅ Fully compatible
Dynamic partition writeINSERT OVERWRITE ... PARTITION (dt)Same✅ Fully compatible
REGEXP_REPLACEREGEXP_REPLACE(col, pattern, replace)Same✅ Fully compatible
SPLITSPLIT(col, '\\.')Same✅ Fully compatible
SIZESIZE(SPLIT(...))Same✅ Fully compatible
SUBSTRSUBSTR(col, 1, 10)Same✅ Fully compatible
CASTCAST(col AS TIMESTAMP)Same✅ Fully compatible
NULLIFNULLIF(expr, 0)Same✅ Fully compatible
Conditional aggregationSUM(CASE WHEN ... END)Same✅ Fully compatible
COUNT DISTINCTCOUNT(DISTINCT CASE WHEN ...)Same✅ Fully compatible
Window functionsRANK() OVER (ORDER BY ...)Same✅ Fully compatible
STORED AS ORCSTORED AS ORCNot needed✅ Delete it
CLUSTERED BYCLUSTERED BY (col) INTO N BUCKETSSame syntax✅ Fully compatible
OpenCSVSerdeROW FORMAT SERDE 'OpenCSVSerde'Not needed✅ Delete, use COPY INTO instead
LOAD DATALOAD DATA LOCAL INPATHNot supported⚠️ Use COPY INTO FROM VOLUME
Dynamic partition SETSET hive.exec.dynamic.partition=trueNot needed✅ Delete it

Migration Conclusion

The effort for migrating a Hive data warehouse to Lakehouse is mainly in DDL and loading statements, not in business logic. In this project, the ETL transformation SQL (REGEXP_REPLACE, SPLIT, conditional aggregation) and analytical queries (window functions, funnel calculations) are all reused directly. Changes are concentrated in 4 known difference points, all of which are mechanical replacements.

SQL-Level Benefits

  • The GROUP BY bug on bucketed ORC tables disappears naturally
  • No need to maintain SerDe configuration and TBLPROPERTIES
  • Dynamic partitioning is enabled by default, no SET statements needed
  • COPY INTO is cleaner than LOAD DATA + staging table

Deployment Model Benefits

Migrating from a self-managed Docker Hive cluster to Singdata Lakehouse SaaS brings benefits beyond SQL simplification:

Hive (self-managed Docker)Lakehouse (SaaS)
Cluster operationsMust maintain Docker containers, JVM parameters, YARN queuesNo operations needed, fully managed
Compute resourcesFixed resources, wasted during idle periodsElastic scaling, billed per query
Storage formatORC files require periodic compactionNative Parquet, automatically managed
Tuning costMust tune Hive parameters, Tez configuration, bucketing strategyFocus on SQL logic, no low-level tuning
Version upgradesManual Hive and Hadoop dependency upgradesPlatform auto-upgrades

After migration, data engineers can shift their focus from "keeping Hive running" to "making data valuable."


References