HR Employee Lifecycle Data Warehouse Best Practices
Integrate HRIS system data, performance evaluations, and attendance records to build an analytics data warehouse covering the full employee lifecycle from onboarding to offboarding. Outputs include attrition risk predictions, compensation equity analysis, and organizational effectiveness diagnostic reports. Using the Kaggle HR Analytics dataset (1,470 employee records) as the foundation, this guide demonstrates the end-to-end build process: ODS → DWD → DWS → ADS four-layer architecture, covering Column Masking (compensation field masking), Dynamic Tables (automatic monthly metric aggregation), and window functions (LEAD/LAG for promotion analysis).
Overview
The core challenges of an HR data warehouse are that the data is highly sensitive (compensation, performance, attrition reasons) while also needing to support different roles' analytical needs (HR BP sees all data, employees see their own, analysts see masked data). Singdata Lakehouse addresses these challenges through the following combination:
Problem
Solution
Compensation and performance fields are highly sensitive; different roles see different data
System automatically identifies upstream changes and refreshes incrementally
REFRESH DYNAMIC TABLE
Manually trigger one refresh
Used during initial build or debugging
AVG() OVER (PARTITION BY ...)
Calculate average compensation for peers in the same department and level
Used for compensation equity analysis
RANK() OVER (ORDER BY ...)
Calculate employee's compensation rank within department
Used for compensation competitiveness report
LAG()
Calculate relative change in promotion intervals
Identifies employees with stalled promotions
Prerequisites
All examples in this guide run under the best_practice_hr_analytics schema.
CREATE SCHEMA IF NOT EXISTS best_practice_hr_analytics;
The dataset is from Kaggle HR Analytics Case Study (Apache 2.0), with 1,470 records and 35 columns covering employee age, department, level, annual salary, performance rating, attrition label, and other core fields. This guide uses the first 40 rows as the demonstration dataset.
ODS Layer: Raw Employee Data
Create Table
CREATE TABLE IF NOT EXISTS best_practice_hr_analytics.doc_ods_employees (
emp_id INT,
age INT,
attrition STRING, -- 'Yes'/'No', whether the employee has left
business_travel STRING,
daily_rate INT,
department STRING,
distance_from_home INT,
education INT, -- 1=Below College ... 5=Doctor
education_field STRING,
employee_number INT,
env_satisfaction INT, -- 1-4
gender STRING,
hourly_rate INT,
job_involvement INT, -- 1-4
job_level INT, -- 1=Entry ... 5=C-Level
job_role STRING,
job_satisfaction INT, -- 1=Low 4=Very High
marital_status STRING,
monthly_income INT, -- Sensitive field, will be bound to Column Masking
monthly_rate INT,
num_companies_worked INT,
overtime STRING, -- 'Yes'/'No'
pct_salary_hike INT, -- Last salary increase percentage (%)
performance_rating INT, -- 3=Excellent 4=Outstanding
relationship_satisfaction INT, -- 1-4
stock_option_level INT,
total_working_years INT,
training_times_last_year INT,
work_life_balance INT, -- 1-4
years_at_company INT,
years_in_current_role INT,
years_since_last_promo INT,
years_with_curr_mgr INT
);
Load Data
This guide loads the first 40 rows from a local CSV file (downloaded from Kaggle) via INSERT. In production, use MySQL batch sync or COPY INTO from Volume to import the full dataset:
-- Step 1: Upload the local CSV file to User Volume via SQL PUT
PUT '/path/to/your/data.csv' TO USER VOLUME FILE 'data.csv';
-- Step 2: COPY INTO the table from User Volume
COPY INTO best_practice_hr_analytics.doc_ods_employees
FROM USER VOLUME
USING csv
OPTIONS('header'='true', 'sep'=',', 'nullValue'='')
FILES ('data.csv');
You can also insert small batches of test data inline (no CSV file needed):
SELECT COUNT(*) AS ods_row_count
FROM best_practice_hr_analytics.doc_ods_employees;
ods_row_count
-------------
40
Column Masking: Compensation Field De-identification
monthly_income is a highly sensitive field. Policy: HR administrators see the original value; other users' queries return -1.
-- Create masking function
CREATE OR REPLACE FUNCTION best_practice_hr_analytics.mask_salary(salary INT)
RETURNS INT
AS CASE
WHEN current_user() IN ('privileged_user') THEN salary -- replace with actual authorized username
ELSE -1
END;
-- Bind to monthly_income column
ALTER TABLE best_practice_hr_analytics.doc_ods_employees
CHANGE COLUMN monthly_income
SET MASK best_practice_hr_analytics.mask_salary;
💡 Tip: Replace 'privileged_user' with the actual username(s) that need access to plaintext data. Column Masking uses the current_user() function to match the current connection's username. All authorized usernames must be explicitly listed in the IN() list.
⚠️ Note: Column Masking is transparently applied to all queries (including Dynamic Tables) — when the DWD layer inherits monthly_income from the ODS layer, non-privileged users receive the masked value of -1.
Verify masking (admin account can see original values):
SELECT emp_id, department, job_role, monthly_income
FROM best_practice_hr_analytics.doc_ods_employees
LIMIT 5;
emp_id | department | job_role | monthly_income
-------+--------------------------+------------------------+---------------
31 | Research & Development | Laboratory Technician | 2496
32 | Research & Development | Healthcare Representative | 6465
33 | Research & Development | Laboratory Technician | 2206
34 | Sales | Sales Representative | 2086
35 | Research & Development | Research Scientist | 2293
DWD Layer: Employee Event Timeline
The DWD layer derives three categories of analytical fields from ODS raw data: tenure bands (tenure_band), promotion stall flags (promotion_stalled_flag), and retention risk scores (retention_risk_score).
Create Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_hr_analytics.doc_dwd_employee_timeline
AS
SELECT
emp_id,
department,
job_role,
job_level,
age,
gender,
marital_status,
education,
education_field,
years_at_company,
years_in_current_role,
years_since_last_promo,
years_with_curr_mgr,
total_working_years,
monthly_income,
pct_salary_hike,
performance_rating,
job_satisfaction,
work_life_balance,
env_satisfaction,
overtime,
business_travel,
attrition,
-- Tenure band for grouped analysis
CASE
WHEN years_at_company <= 1 THEN 'New'
WHEN years_at_company <= 3 THEN 'Junior'
WHEN years_at_company <= 7 THEN 'Mid'
WHEN years_at_company <= 15 THEN 'Senior'
ELSE 'Veteran'
END AS tenure_band,
-- Promotion stall flag: no promotion for 2+ years with performance rating >= 3
CASE
WHEN years_since_last_promo >= 2 AND performance_rating >= 3 THEN 1
ELSE 0
END AS promotion_stalled_flag,
-- Retention risk score (0.0 ~ 1.0, higher is more at risk)
ROUND(
CASE WHEN overtime = 'Yes' THEN 0.25 ELSE 0.0 END
+ CASE WHEN job_satisfaction <= 2 THEN 0.25 ELSE 0.0 END
+ CASE WHEN work_life_balance <= 2 THEN 0.20 ELSE 0.0 END
+ CASE WHEN years_since_last_promo >= 3 THEN 0.15 ELSE 0.0 END
+ CASE WHEN pct_salary_hike <= 12 THEN 0.15 ELSE 0.0 END
, 2) AS retention_risk_score
FROM best_practice_hr_analytics.doc_ods_employees;
⚠️ Note: Do not write REFRESH INTERVAL in the CREATE DYNAMIC TABLE DDL. Refresh scheduling is managed via Lakehouse Studio Tasks (see the "Scheduling Configuration" section), which allows attaching monitoring alerts and data quality rules to the same task.
Configure Studio Refresh Task
💡 Tip: The examples below use cz-cli (the Singdata Lakehouse command-line tool). If cz-cli is not installed, refer to the cz-cli setup guide. Alternatively, you can execute SQL in Lakehouse Studio → Development → SQL Editor and configure scheduled tasks on the Studio → Tasks page.
Task path: best_practices/hr_analytics/refresh_hr_dwd_timeline, triggered daily at 01:00. Additional configurations can be attached to this task: data row count alerts (notification when DWD row count drops sharply), data quality rules (retention_risk_score must not be all zeros), etc.
Manually Trigger First Refresh
REFRESH DYNAMIC TABLE best_practice_hr_analytics.doc_dwd_employee_timeline;
SELECT COUNT(*) AS dwd_count
FROM best_practice_hr_analytics.doc_dwd_employee_timeline;
dwd_count
---------
40
Attrition Risk Analysis by Tenure Band
Tenure bands help HR identify which stage has the highest employee attrition risk:
SELECT
tenure_band,
COUNT(*) AS emp_count,
ROUND(AVG(retention_risk_score), 2) AS avg_risk,
SUM(CASE WHEN attrition = 'Yes' THEN 1 ELSE 0 END) AS attrition_cnt
FROM best_practice_hr_analytics.doc_dwd_employee_timeline
GROUP BY tenure_band
ORDER BY avg_risk DESC;
Result interpretation: Senior (7–15 years) and Mid (3–7 years) employees have the highest average retention risk at 0.39 and 0.36 respectively, suggesting these groups bear greater overtime and promotion stall pressure. Veteran (15+ years) employees have the lowest risk, likely having reached a stable career level.
DWS Layer: Department Headcount Metric Aggregation
The DWS layer aggregates DWD layer data at department granularity, outputting department-level attrition rates, average tenure, compensation benchmarks, and risk indices.
Create Table
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_hr_analytics.doc_dws_dept_headcount_metrics
AS
SELECT
department,
COUNT(*) AS headcount,
SUM(CASE WHEN attrition = 'Yes' THEN 1 ELSE 0 END) AS attrition_count,
ROUND(SUM(CASE WHEN attrition = 'Yes' THEN 1 ELSE 0 END)
* 100.0 / COUNT(*), 2) AS attrition_rate_pct,
ROUND(AVG(years_at_company), 2) AS avg_tenure_years,
ROUND(AVG(monthly_income), 0) AS avg_monthly_income,
ROUND(AVG(performance_rating), 2) AS avg_performance,
ROUND(AVG(job_satisfaction), 2) AS avg_job_satisfaction,
ROUND(AVG(retention_risk_score), 2) AS avg_retention_risk,
SUM(CASE WHEN overtime = 'Yes' THEN 1 ELSE 0 END) AS overtime_headcount,
SUM(promotion_stalled_flag) AS promotion_stalled_count
FROM best_practice_hr_analytics.doc_dwd_employee_timeline
GROUP BY department;
Sales department attrition rate (44.44%) is far higher than R&D (16.13%), but Sales average compensation (6973) is higher than R&D (4650), indicating compensation is not the main driver of Sales attrition. Looking at avg_retention_risk, the R&D team's overall risk average (0.34) is actually higher than Sales (0.29), possibly because overtime and promotion stalls are more pronounced in R&D.
Performance averages are similar across both departments (3.11 vs 3.16), indicating that those leaving are not low performers. HR should prioritize organizational culture and career development pathways.
ADS Layer: Attrition Risk Report and Compensation Analysis
Attrition Risk Report
CREATE DYNAMIC TABLE IF NOT EXISTS best_practice_hr_analytics.doc_ads_attrition_risk_report
AS
SELECT
emp_id,
department,
job_role,
job_level,
tenure_band,
years_at_company,
years_since_last_promo,
overtime,
job_satisfaction,
work_life_balance,
performance_rating,
monthly_income,
pct_salary_hike,
promotion_stalled_flag,
retention_risk_score,
attrition,
-- Risk level
CASE
WHEN retention_risk_score >= 0.6 THEN 'HIGH'
WHEN retention_risk_score >= 0.3 THEN 'MEDIUM'
ELSE 'LOW'
END AS risk_level,
-- Primary attrition signal
CASE
WHEN overtime = 'Yes' AND job_satisfaction <= 2 THEN 'Overwork+LowSatisfaction'
WHEN overtime = 'Yes' THEN 'Overwork'
WHEN job_satisfaction <= 2 THEN 'LowSatisfaction'
WHEN years_since_last_promo >= 3 THEN 'PromotionStalled'
WHEN pct_salary_hike <= 12 THEN 'LowPayRaise'
ELSE 'Normal'
END AS primary_risk_signal
FROM best_practice_hr_analytics.doc_dwd_employee_timeline;
REFRESH DYNAMIC TABLE best_practice_hr_analytics.doc_ads_attrition_risk_report;
SELECT risk_level,
COUNT(*) AS emp_count,
ROUND(AVG(retention_risk_score), 2) AS avg_risk,
SUM(CASE WHEN attrition = 'Yes' THEN 1 ELSE 0 END) AS actual_attrition
FROM best_practice_hr_analytics.doc_ads_attrition_risk_report
GROUP BY risk_level
ORDER BY avg_risk DESC;
SELECT emp_id, department, job_role, years_at_company,
years_since_last_promo, monthly_income,
pct_salary_hike, retention_risk_score, risk_level, primary_risk_signal
FROM best_practice_hr_analytics.doc_ads_attrition_risk_report
WHERE risk_level = 'HIGH'
ORDER BY retention_risk_score DESC;
emp_id | department | job_role | years_at_company | years_since_last_promo | monthly_income | pct_salary_hike | retention_risk_score | risk_level | primary_risk_signal
-------+--------------------------+----------------------+------------------+------------------------+----------------+-----------------+----------------------+------------+--------------------------
17 | Research & Development | Research Scientist | 6 | 0 | 3298 | 12 | 0.85 | HIGH | Overwork+LowSatisfaction
7 | Research & Development | Laboratory Technician| 1 | 0 | 2670 | 20 | 0.70 | HIGH | Overwork+LowSatisfaction
27 | Research & Development | Research Scientist | 10 | 6 | 3919 | 22 | 0.65 | HIGH | Overwork+LowSatisfaction
30 | Sales | Manager | 2 | 2 | 18947 | 12 | 0.60 | HIGH | LowSatisfaction
1 | Sales | Sales Executive | 6 | 0 | 5993 | 11 | 0.60 | HIGH | Overwork
Result interpretation:
emp_id=17 (R&D Research Scientist) has the highest risk (0.85), triggered by both overtime and low job satisfaction, with a salary hike of only 12% — right at the LowPayRaise threshold. Priority interview and compensation adjustment are recommended.
emp_id=27 (R&D Research Scientist, 10-year tenure) has not been promoted in 6 consecutive years (years_since_last_promo=6). Although the salary hike is relatively reasonable (22%), career stagnation is the primary risk signal.
emp_id=30 (Sales Manager, high compensation of 18947) has high compensation but low job satisfaction, indicating non-monetary factors need attention for management retention.
Overtime × Satisfaction Cross Analysis
SELECT
overtime,
job_satisfaction,
COUNT(*) AS emp_count,
ROUND(AVG(retention_risk_score), 2) AS avg_risk,
SUM(CASE WHEN attrition = 'Yes' THEN 1 ELSE 0 END) AS attrition_cnt
FROM best_practice_hr_analytics.doc_dwd_employee_timeline
GROUP BY overtime, job_satisfaction
ORDER BY avg_risk DESC
LIMIT 8;
Result interpretation: The combination of overtime=Yes + job_satisfaction <= 2 has a significantly higher average risk (0.62–0.85) than other combinations and is the highest-priority intervention dimension. Notably, the no-overtime but low-satisfaction group (overtime=No, satisfaction=1) has an average risk of 0.41, close to the overtime + medium-satisfaction group, indicating that the independent effect of job satisfaction cannot be ignored.
Compensation Equity Analysis (Window Functions)
The following queries do not build Dynamic Tables — they run directly on the DWD layer as ad-hoc analytical workloads.
Peer Compensation Equity Analysis
SELECT
emp_id,
department,
job_role,
job_level,
monthly_income,
years_at_company,
pct_salary_hike,
AVG(monthly_income) OVER (PARTITION BY department, job_level) AS dept_level_avg_income,
ROUND(
(monthly_income - AVG(monthly_income) OVER (PARTITION BY department, job_level))
/ AVG(monthly_income) OVER (PARTITION BY department, job_level) * 100
, 1) AS income_vs_peer_pct,
RANK() OVER (PARTITION BY department ORDER BY monthly_income DESC) AS income_rank_in_dept
FROM best_practice_hr_analytics.doc_dwd_employee_timeline
WHERE department = 'Sales'
ORDER BY job_level DESC, monthly_income DESC;
emp_id=34 (Sales Representative, job_level=2) has compensation 58.9% below the peer average with only 1 year of tenure — a low-compensation high-risk employee. Check whether the compensation setting at hiring was appropriate.
emp_id=28 (Sales Executive) has compensation 34.6% above the peer average. Given their 9-year tenure, this is a normal senior premium. However, with a salary hike of only 11%, dissatisfaction may accumulate if this continues.
Promotion Lag Analysis (LAG)
SELECT
emp_id,
department,
job_role,
job_level,
years_at_company,
years_since_last_promo,
performance_rating,
LAG(performance_rating) OVER (
PARTITION BY department ORDER BY years_at_company
) AS prev_emp_perf,
years_since_last_promo - LAG(years_since_last_promo) OVER (
PARTITION BY department ORDER BY years_at_company
) AS promo_lag_delta
FROM best_practice_hr_analytics.doc_dwd_employee_timeline
WHERE department = 'Research & Development'
ORDER BY years_since_last_promo DESC
LIMIT 10;
emp_id | department | job_role | job_level | years_at_company | years_since_last_promo | performance_rating | prev_emp_perf | promo_lag_delta
-------+--------------------------+-----------------------------+-----------+------------------+------------------------+--------------------+---------------+----------------
16 | Research & Development | Manufacturing Director | 3 | 10 | 8 | 3 | 4 | 7
10 | Research & Development | Healthcare Representative | 2 | 7 | 7 | 3 | 3 | 4
27 | Research & Development | Research Scientist | 1 | 10 | 6 | 4 | 3 | -2
29 | Research & Development | Healthcare Representative | 3 | 22 | 5 | 3 | 3 | 1
13 | Research & Development | Research Scientist | 1 | 5 | 4 | 3 | 3 | 4
26 | Research & Development | Manager | 5 | 14 | 4 | 3 | 3 | 2
Result interpretation:
emp_id=16 (Manufacturing Director) has not been promoted for 8 years with a sustained performance rating of 3 (Excellent). promo_lag_delta=7 means their promotion wait time is 7 years longer than the previous employee with similar tenure in the department — a clear promotion anomaly.
emp_id=27 (Research Scientist, 10-year tenure, 6 consecutive years without promotion, performance rating 4 = Outstanding) is the most typical case of a "high-performer blocked from promotion." Already on the HIGH risk list, HR BP intervention for career path evaluation is needed.
Data Warehouse Object Summary
After completing the full build, the objects in the best_practice_hr_analytics schema:
Column Masking is transparently applied to Dynamic Tables: After the DWD layer inherits monthly_income from the ODS layer, non-privileged users will see the masked value (-1) for compensation fields in DWD / DWS / ADS. For compensation equity analysis, you must use an account authorized in the Column Masking policy (i.e., explicitly listed in the masking function's IN() list) to see unmasked original values.
Dynamic Tables do not write REFRESH INTERVAL: All Dynamic Table DDLs omit the REFRESH INTERVAL parameter. Refresh scheduling is unified via Lakehouse Studio Tasks. This allows attaching data quality rules (e.g., check that retention_risk_score is not all zeros) and alert notifications to the same task node.
Retention risk score is a heuristic model: retention_risk_score is based on stacked rules for overtime, satisfaction, promotion stalls, and salary hike. It is suitable for quickly screening high-risk groups but cannot replace professional statistical prediction models. In production, consider running machine learning models in ZettaPark Python Tasks and writing results back to the ADS layer.
Dynamic Table incremental refresh depends on ODS change tracking: The first REFRESH performs a full snapshot; subsequent incremental refreshes only process rows in the ODS layer that were added or changed since the last refresh. If the ODS layer uses INSERT OVERWRITE or a full rewrite, the Dynamic Table will fall back to full refresh, significantly increasing computation overhead.
Window semantics of promotion lag LAG analysis: LAG with PARTITION BY department ORDER BY years_at_company orders by tenure within the same department, comparing promotion stall differences between employees with similar tenure — not a historical time-series comparison of the same employee. For comparing a single employee across different time points, an employee historical snapshot table (with a time dimension) is needed.
⚠️ Pending manual verification: Column Masking currently uses current_user() to match usernames for authorization. All usernames with access to plaintext data must be individually added to the masking function's whitelist. If your Lakehouse version supports role-based dynamic evaluation (e.g., HAS_ROLE('role_name')), you can use roles instead of username lists for more flexible maintenance. Please contact Singdata technical support to confirm whether this function is supported in your current version.