Lakehouse Ranking and Percentile Analysis Guide

Overview

Ranking and percentile analysis are advanced scenarios in data analysis, used to compute the relative position of data within groups, percentile scores, and more. Singdata Lakehouse provides comprehensive window function support, including ranking functions, offset functions, and aggregate window functions. This guide categorizes usage by business scenario to help you quickly master efficient ranking and percentile analysis methods.


SQL Commands Covered

Command/FunctionPurposeApplicable Scenario
ROW_NUMBER()Generate unique row numbersDeduplication, pagination, Top-N
RANK()Ranking (ties skip numbers)Score ranking, competition ranking
DENSE_RANK()Ranking (ties don't skip numbers)Continuous ranking
PERCENT_RANK()Percentile rankingRelative position analysis
LAG() / LEAD()Access previous/next row dataPeriod-over-period, year-over-year
AVG() OVER (... ROWS BETWEEN)Moving averageTrend analysis
PERCENTILE()Quantile calculationMedian, quartiles

Prerequisites

The following examples use a simulated employee performance table performance:

-- Create test table
CREATE TABLE IF NOT EXISTS performance (
    emp_id INT,
    emp_name STRING,
    dept STRING,
    score DOUBLE,
    review_date DATE
);

-- Insert test data
INSERT INTO performance VALUES
(1, 'Alice', 'Engineering', 95, '2024-06-01'),
(2, 'Bob', 'Engineering', 85, '2024-06-01'),
(3, 'Carol', 'Engineering', 95, '2024-06-01'),
(4, 'David', 'Marketing', 88, '2024-06-01'),
(5, 'Eve', 'Marketing', 92, '2024-06-01'),
(6, 'Frank', 'Marketing', 78, '2024-06-01'),
(7, 'Grace', 'HR', 90, '2024-06-01'),
(8, 'Henry', 'HR', 85, '2024-06-01');

Basic Ranking

Use ranking functions to compute employee performance rankings within departments. Lakehouse provides three ranking functions with slightly different behaviors.

-- Comparison of three ranking functions
SELECT 
    emp_name,
    dept,
    score,
    ROW_NUMBER() OVER (PARTITION BY dept ORDER BY score DESC) as row_num,
    RANK() OVER (PARTITION BY dept ORDER BY score DESC) as rank,
    DENSE_RANK() OVER (PARTITION BY dept ORDER BY score DESC) as dense_rank
FROM performance
ORDER BY dept, row_num;

Result:

emp_namedeptscorerow_numrankdense_rank
AliceEngineering95111
CarolEngineering95211
BobEngineering85332
EveMarketing92111
DavidMarketing88222
FrankMarketing78333
GraceHR90111
HenryHR85222

Differences Between Ranking Functions

FunctionTie HandlingNumber Continuity
ROW_NUMBER()Randomly assigns different numbersContinuous
RANK()Same rank for ties, subsequent numbers skipDiscontinuous (1,1,3)
DENSE_RANK()Same rank for ties, subsequent numbers are continuousContinuous (1,1,2)

Percentile Ranking

Use PERCENT_RANK() to compute an employee's relative position within the department (between 0 and 1).

-- Compute percentile ranking
SELECT 
    emp_name,
    dept,
    score,
    ROUND(PERCENT_RANK() OVER (PARTITION BY dept ORDER BY score DESC), 2) as pct_rank
FROM performance
ORDER BY dept, pct_rank DESC;

Result:

emp_namedeptscorepct_rank
AliceEngineering950
CarolEngineering950
BobEngineering850.5
EveMarketing920
DavidMarketing880.5
FrankMarketing781
GraceHR900
HenryHR851

Before/After Row Comparison

Use LAG() and LEAD() to access the previous or next row's data relative to the current row, commonly used for period-over-period change calculations.

-- Compute score differences between employees within departments
SELECT 
    emp_name,
    dept,
    score,
    LAG(score, 1) OVER (PARTITION BY dept ORDER BY score DESC) as prev_score,
    score - LAG(score, 1) OVER (PARTITION BY dept ORDER BY score DESC) as diff
FROM performance
ORDER BY dept, score DESC;

Result:

emp_namedeptscoreprev_scorediff
AliceEngineering95NULLNULL
CarolEngineering95950
BobEngineering8595-10
EveMarketing92NULLNULL
DavidMarketing8892-4
FrankMarketing7888-10
GraceHR90NULLNULL
HenryHR8590-5

Moving Average

Use window frames (ROWS BETWEEN) to compute sliding window statistics, suitable for trend analysis.

-- Compute moving average within department ordered by score (current row and two preceding rows)
SELECT 
    emp_name,
    dept,
    score,
    ROUND(AVG(score) OVER (
        PARTITION BY dept 
        ORDER BY score DESC 
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 1) as moving_avg_3
FROM performance
ORDER BY dept, score DESC;

Result:

emp_namedeptscoremoving_avg_3
AliceEngineering9595
CarolEngineering9595
BobEngineering8591.7
EveMarketing9292
DavidMarketing8890
FrankMarketing7886
GraceHR9090
HenryHR8587.5

Quantile Calculation

Use the PERCENTILE() function to compute data quantiles such as the median (0.5) and quartiles (0.25, 0.75).

-- Compute median and quartiles of scores by department
SELECT 
    dept,
    PERCENTILE(score, 0.5) as median,
    PERCENTILE(score, 0.25) as q1,
    PERCENTILE(score, 0.75) as q3,
    MIN(score) as min_score,
    MAX(score) as max_score
FROM performance
GROUP BY dept
ORDER BY dept;

Result:

deptmedianq1q3min_scoremax_score
Engineering9590958595
HR87.586.2588.758590
Marketing8883907892

Clean Up Test Data

After completing ranking analysis verification, it is recommended to clean up test tables:

-- Drop test table
DROP TABLE IF EXISTS performance;

Important Notes

  1. Window Function Execution Order: Window functions execute after WHERE and GROUP BY; window function results cannot be used directly in WHERE. Use QUALIFY or subqueries to filter.
  2. NULL Value Sorting: ORDER BY places NULL values last (DESC) or first (ASC) by default; use NULLS FIRST/LAST to control.
  3. Performance Optimization: Choose PARTITION BY columns with moderate cardinality to avoid individual partitions becoming too large.
  4. PERCENT_RANK Syntax: PERCENT_RANK() does not accept arguments; write PERCENT_RANK() OVER (...) directly.