Lakehouse Missing Value Filling Guide
Overview
Data missingness is the most common problem in data cleaning. Singdata Lakehouse provides multiple functions for handling NULL values, including conditional replacement, default value filling, and forward/backward filling. This guide categorizes usage by business scenario to help you quickly master efficient missing value handling methods.
Quick Navigation
- Basic NULL Replacement -- Use COALESCE to replace null values
- Two-Value Conditional Replacement -- Use IFNULL / NULLIF for simple scenarios
- Forward/Backward Filling -- Use LAG / LEAD to fill missing values in time series
- Group Mean Filling -- Use window functions to fill statistical values by group
- NULL Value Judgment -- Correctly handle logical checks on NULL
SQL Commands Covered
| Command/Function | Purpose | Applicable Scenario |
|---|---|---|
COALESCE(col1, col2, ...) | Return the first non-NULL value | Multi-priority default value filling |
IFNULL(col, value) | Return specified value when NULL | Simple default value filling |
NULLIF(col, value) | Return NULL when equal to specified value | Convert specific values to NULL |
LAG(col) / LEAD(col) | Access previous/next row data | Time series forward/backward filling |
AVG() OVER (PARTITION BY) | Group average | Group statistical value filling |
IS NULL / IS NOT NULL | NULL value check | Conditional filtering |
Prerequisites
The following examples use a simulated user profile table user_profiles containing some missing data:
Basic NULL Replacement
Use the COALESCE function to replace NULL values by priority, supporting multiple fallback values.
Result:
| user_id | user_name | city |
|---|---|---|
| 1 | Alice | Shanghai |
| 2 | Bob | Beijing |
| 3 | Carol | Unknown |
| 4 | David | Unknown |
| 5 | Eve | Shanghai |
Multi-Priority Filling
COALESCE can be chained to try values in order of priority:
Result:
| user_id | user_name | display_location |
|---|---|---|
| 1 | Alice | Shanghai |
| 2 | Bob | Beijing |
| 3 | Carol | Carol |
| 4 | David | David |
| 5 | Eve | Shanghai |
Two-Value Conditional Replacement
When only switching between NULL and one default value, IFNULL is more concise.
Result:
| user_id | user_name | age |
|---|---|---|
| 1 | Alice | 28 |
| 2 | Bob | 0 |
| 3 | Carol | 35 |
| 4 | David | 0 |
| 5 | Eve | 22 |
Convert Specific Values to NULL
Use NULLIF to convert meaningless placeholder values (e.g., 0, empty string) to NULL:
Forward/Backward Filling
In time series data, the previous or next row's value is commonly used to fill current missing values.
Result:
| user_id | user_name | last_login | prev_login | filled_login |
|---|---|---|---|---|
| 1 | Alice | 2024-06-01 | NULL | 2024-06-01 |
| 2 | Bob | 2024-05-15 | 2024-06-01 | 2024-05-15 |
| 3 | Carol | 2024-06-02 | 2024-05-15 | 2024-06-02 |
| 4 | David | 2024-04-20 | 2024-06-02 | 2024-04-20 |
| 5 | Eve | NULL | 2024-04-20 | 2024-04-20 |
Group Mean Filling
For numeric missing values, the mean or median of the same group is commonly used for filling.
Result:
| user_id | user_name | age | city | city_avg_age | filled_age |
|---|---|---|---|---|---|
| 1 | Alice | 28 | Shanghai | 25 | 28 |
| 2 | Bob | NULL | Beijing | NULL | NULL |
| 3 | Carol | 35 | NULL | NULL | NULL |
| 4 | David | NULL | NULL | NULL | NULL |
| 5 | Eve | 22 | Shanghai | 25 | 22 |
NULL Value Judgment
Correctly handle logical checks on NULL values to avoid common pitfalls.
Result:
| user_id | user_name | age_group |
|---|---|---|
| 1 | Alice | Adult |
| 2 | Bob | Unknown |
| 3 | Carol | Adult |
| 4 | David | Unknown |
| 5 | Eve | Adult |
Common Pitfall
Clean Up Test Data
After completing missing value handling verification, it is recommended to clean up test tables:
Important Notes
- COALESCE vs IFNULL:
COALESCEis standard SQL and supports multiple parameters;IFNULLsupports only two parameters and is a shorthand form ofCOALESCE. - NULL vs Empty String:
NULLrepresents a missing value, while''(empty string) is a valid value. Distinguish between them during handling. - Aggregate Functions and NULL:
COUNT(col)does not count NULL values;SUM()andAVG()ignore NULL values. - NULL Value Display: In Lakehouse, numeric type NULL displays as
nan, time type NULL displays asNaT, but logical checks are unaffected.
