Lakehouse Missing Value Filling Guide

Overview

Data missingness is the most common problem in data cleaning. Singdata Lakehouse provides multiple functions for handling NULL values, including conditional replacement, default value filling, and forward/backward filling. This guide categorizes usage by business scenario to help you quickly master efficient missing value handling methods.


SQL Commands Covered

Command/FunctionPurposeApplicable Scenario
COALESCE(col1, col2, ...)Return the first non-NULL valueMulti-priority default value filling
IFNULL(col, value)Return specified value when NULLSimple default value filling
NULLIF(col, value)Return NULL when equal to specified valueConvert specific values to NULL
LAG(col) / LEAD(col)Access previous/next row dataTime series forward/backward filling
AVG() OVER (PARTITION BY)Group averageGroup statistical value filling
IS NULL / IS NOT NULLNULL value checkConditional filtering

Prerequisites

The following examples use a simulated user profile table user_profiles containing some missing data:

-- Create test table
CREATE TABLE IF NOT EXISTS user_profiles (
    user_id INT,
    user_name STRING,
    age INT,
    city STRING,
    last_login DATE
);

-- Insert test data (including NULL values)
INSERT INTO user_profiles VALUES
(1, 'Alice', 28, 'Shanghai', '2024-06-01'),
(2, 'Bob', NULL, 'Beijing', '2024-05-15'),
(3, 'Carol', 35, NULL, '2024-06-02'),
(4, 'David', NULL, NULL, '2024-04-20'),
(5, 'Eve', 22, 'Shanghai', NULL);

Basic NULL Replacement

Use the COALESCE function to replace NULL values by priority, supporting multiple fallback values.

-- Use COALESCE to fill missing city information
SELECT 
    user_id,
    user_name,
    COALESCE(city, 'Unknown') as city
FROM user_profiles
ORDER BY user_id;

Result:

user_iduser_namecity
1AliceShanghai
2BobBeijing
3CarolUnknown
4DavidUnknown
5EveShanghai

Multi-Priority Filling

COALESCE can be chained to try values in order of priority:

-- Multi-priority filling: prefer city, then user_name, then 'N/A'
SELECT 
    user_id,
    user_name,
    COALESCE(city, user_name, 'N/A') as display_location
FROM user_profiles
ORDER BY user_id;

Result:

user_iduser_namedisplay_location
1AliceShanghai
2BobBeijing
3CarolCarol
4DavidDavid
5EveShanghai

Two-Value Conditional Replacement

When only switching between NULL and one default value, IFNULL is more concise.

-- Use IFNULL to fill missing age
SELECT 
    user_id,
    user_name,
    IFNULL(age, 0) as age
FROM user_profiles
ORDER BY user_id;

Result:

user_iduser_nameage
1Alice28
2Bob0
3Carol35
4David0
5Eve22

Convert Specific Values to NULL

Use NULLIF to convert meaningless placeholder values (e.g., 0, empty string) to NULL:

-- Convert records with age 0 to NULL (indicating unknown)
SELECT 
    user_id,
    user_name,
    NULLIF(age, 0) as age
FROM user_profiles
ORDER BY user_id;

Forward/Backward Filling

In time series data, the previous or next row's value is commonly used to fill current missing values.

-- Use the previous login time to fill missing values
SELECT 
    user_id,
    user_name,
    last_login,
    LAG(last_login) OVER (ORDER BY user_id) as prev_login,
    COALESCE(last_login, LAG(last_login) OVER (ORDER BY user_id)) as filled_login
FROM user_profiles
ORDER BY user_id;

Result:

user_iduser_namelast_loginprev_loginfilled_login
1Alice2024-06-01NULL2024-06-01
2Bob2024-05-152024-06-012024-05-15
3Carol2024-06-022024-05-152024-06-02
4David2024-04-202024-06-022024-04-20
5EveNULL2024-04-202024-04-20

Group Mean Filling

For numeric missing values, the mean or median of the same group is commonly used for filling.

-- Fill missing age with the average age of users in the same city (grouped by city)
SELECT 
    user_id,
    user_name,
    age,
    city,
    ROUND(AVG(age) OVER (PARTITION BY city), 1) as city_avg_age,
    COALESCE(age, AVG(age) OVER (PARTITION BY city)) as filled_age
FROM user_profiles
ORDER BY user_id;

Result:

user_iduser_nameagecitycity_avg_agefilled_age
1Alice28Shanghai2528
2BobNULLBeijingNULLNULL
3Carol35NULLNULLNULL
4DavidNULLNULLNULLNULL
5Eve22Shanghai2522

NULL Value Judgment

Correctly handle logical checks on NULL values to avoid common pitfalls.

-- Correct way to check NULL
SELECT 
    user_id,
    user_name,
    CASE 
        WHEN age IS NULL THEN 'Unknown'
        WHEN age < 18 THEN 'Minor'
        WHEN age < 60 THEN 'Adult'
        ELSE 'Senior'
    END as age_group
FROM user_profiles
ORDER BY user_id;

Result:

user_iduser_nameage_group
1AliceAdult
2BobUnknown
3CarolAdult
4DavidUnknown
5EveAdult

Common Pitfall

-- Wrong: NULL compared with any value returns NULL (not TRUE)
SELECT * FROM user_profiles WHERE age = NULL;  -- Returns 0 rows

-- Correct: Use IS NULL
SELECT * FROM user_profiles WHERE age IS NULL;

Clean Up Test Data

After completing missing value handling verification, it is recommended to clean up test tables:

-- Drop test table
DROP TABLE IF EXISTS user_profiles;

Important Notes

  1. COALESCE vs IFNULL: COALESCE is standard SQL and supports multiple parameters; IFNULL supports only two parameters and is a shorthand form of COALESCE.
  2. NULL vs Empty String: NULL represents a missing value, while '' (empty string) is a valid value. Distinguish between them during handling.
  3. Aggregate Functions and NULL: COUNT(col) does not count NULL values; SUM() and AVG() ignore NULL values.
  4. NULL Value Display: In Lakehouse, numeric type NULL displays as nan, time type NULL displays as NaT, but logical checks are unaffected.