Retention and Cohort Analysis Guide


Quick Reference

Analysis TypeBusiness QuestionCore TechniqueComplexity
Basic retentionHow many users return on day N after registration?DATEDIFF + conditional aggregation⭐⭐
Cohort matrixHow do retention trends compare across registration months?DATE_TRUNC + pivot⭐⭐⭐
Consecutive retentionWhat is a user's longest consecutive active streak?ROW_NUMBER + difference grouping⭐⭐⭐
Churn analysisWhich users have been inactive for N days?MAX + DATEDIFF + threshold classification⭐⭐
Retention curveHow does retention decay over time?Weekly/monthly aggregation + line chart data⭐⭐

Sample Data

All examples in this guide are based on the following user behavior data:

-- User registration table CREATE TABLE users ( user_id BIGINT, register_date VARCHAR -- Use DATE type in production ); -- User activity log table CREATE TABLE activity ( user_id BIGINT, activity_date VARCHAR );

The sample data simulates 3 cohorts (5 users registered in January, February, and March respectively), totaling 15 users with activity records spanning 3 months.


1. Basic Retention: Day-N Retention Rate

Retention Definition

  • Day 0 retention: Users active on their registration day (should be 100%).
  • Day N retention: Number of users still active on day N after registration / total users in the cohort.

SQL Implementation

WITH retention_base AS ( SELECT u.user_id, u.register_date, a.activity_date, DATEDIFF(a.activity_date, u.register_date) AS days_since_register FROM users u JOIN activity a ON u.user_id = a.user_id ), cohort_size AS ( SELECT DATE_TRUNC('MONTH', CAST(register_date AS DATE)) AS cohort_month, COUNT(DISTINCT user_id) AS cohort_users FROM users GROUP BY 1 ), retention_by_day AS ( SELECT DATE_TRUNC('MONTH', CAST(rb.register_date AS DATE)) AS cohort_month, rb.days_since_register, COUNT(DISTINCT rb.user_id) AS retained_users FROM retention_base rb GROUP BY 1, 2 ) SELECT r.cohort_month, r.days_since_register, r.retained_users, c.cohort_users, ROUND(r.retained_users * 100.0 / c.cohort_users, 1) AS retention_pct FROM retention_by_day r JOIN cohort_size c ON r.cohort_month = c.cohort_month WHERE r.days_since_register IN (0, 1, 3, 7, 14, 30) ORDER BY r.cohort_month, r.days_since_register;

Sample output:

cohort_monthdays_since_registerretained_userscohort_usersretention_pct
2026-01-01055100.0
2026-01-01155100.0
2026-01-0132540.0
2026-01-0173560.0
2026-02-01055100.0
2026-02-0174580.0

Key Notes

  • DATEDIFF(end, start) returns the number of days between two dates.
  • DATE_TRUNC('MONTH', date) truncates a date to the first day of the month, used for cohort grouping.
  • Choose retention day checkpoints based on your business cycle: gaming typically uses day 1/3/7; SaaS typically uses day 7/14/30.

2. Cohort Retention Matrix (Pivot)

Use Case

The retention matrix is a common BI report format: rows represent cohorts, columns represent retention days, and cells contain retention rates.

SQL Implementation

-- Build a pivot on top of the basic retention query using CASE SELECT r.cohort_month, c.cohort_users, MAX(CASE WHEN r.days_since_register = 0 THEN ROUND(r.retained_users * 100.0 / c.cohort_users, 1) END) AS day_0, MAX(CASE WHEN r.days_since_register = 1 THEN ROUND(r.retained_users * 100.0 / c.cohort_users, 1) END) AS day_1, MAX(CASE WHEN r.days_since_register = 3 THEN ROUND(r.retained_users * 100.0 / c.cohort_users, 1) END) AS day_3, MAX(CASE WHEN r.days_since_register = 7 THEN ROUND(r.retained_users * 100.0 / c.cohort_users, 1) END) AS day_7, MAX(CASE WHEN r.days_since_register = 14 THEN ROUND(r.retained_users * 100.0 / c.cohort_users, 1) END) AS day_14, MAX(CASE WHEN r.days_since_register = 30 THEN ROUND(r.retained_users * 100.0 / c.cohort_users, 1) END) AS day_30 FROM retention_by_day r JOIN cohort_size c ON r.cohort_month = c.cohort_month GROUP BY r.cohort_month, c.cohort_users ORDER BY r.cohort_month;

Sample output:

cohort_monthcohort_usersday_0day_1day_3day_7day_14day_30
2026-01-0151001004060NULLNULL
2026-02-015100100208020NULL
2026-03-015100100206020NULL

Dynamic Columns

To support additional retention checkpoints (e.g., day 60 or day 90), simply append more MAX(CASE WHEN ...) columns to the SELECT clause.


3. Consecutive Retention: Longest Active Streak

Core Logic

Use the window function ROW_NUMBER() to identify consecutive ranges:

  • If dates are consecutive, the value of activity_date - row_number stays constant.
  • Rows sharing the same difference value belong to the same consecutive streak.

SQL Implementation

WITH ranked AS ( SELECT user_id, CAST(activity_date AS DATE) AS activity_date, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_date) AS rn FROM activity ), grouped AS ( SELECT user_id, activity_date, rn, -- Core trick: compute days since a fixed baseline, then subtract rn DATEDIFF(activity_date, CAST('2000-01-01' AS DATE)) - rn AS grp FROM ranked ), streaks AS ( SELECT user_id, grp, MIN(activity_date) AS streak_start, MAX(activity_date) AS streak_end, COUNT(*) AS streak_length FROM grouped GROUP BY user_id, grp ), max_streaks AS ( SELECT user_id, MAX(streak_length) AS max_consecutive_days FROM streaks GROUP BY user_id ) SELECT ms.user_id, ms.max_consecutive_days, s.streak_start, s.streak_end FROM max_streaks ms JOIN streaks s ON ms.user_id = s.user_id AND ms.max_consecutive_days = s.streak_length ORDER BY ms.max_consecutive_days DESC, ms.user_id;

Sample output:

user_idmax_consecutive_daysstreak_startstreak_end
122026-01-052026-01-06
222026-01-102026-01-11
............

Extension: Users with a Streak of N or More Days

-- Count users with a streak of at least 3 consecutive days SELECT COUNT(DISTINCT user_id) AS users_with_3day_streak FROM streaks WHERE streak_length >= 3;


4. Churn Analysis: Identifying Inactive Users

User Status Classification

StatusDefinitionRecommended Action
activeActive within the last 7 daysMaintain experience
warningInactive for 7–13 daysSend a reminder
at_riskInactive for 14–29 daysOffer an incentive
churnedInactive for 30+ daysRun a win-back campaign

SQL Implementation

-- Analysis reference date: 2026-03-25 WITH user_last_active AS ( SELECT u.user_id, u.register_date, MAX(a.activity_date) AS last_active_date, DATEDIFF(CAST('2026-03-25' AS DATE), CAST(MAX(a.activity_date) AS DATE)) AS days_since_active FROM users u LEFT JOIN activity a ON u.user_id = a.user_id GROUP BY u.user_id, u.register_date ) SELECT user_id, register_date, last_active_date, days_since_active, CASE WHEN days_since_active >= 30 THEN 'churned' WHEN days_since_active >= 14 THEN 'at_risk' WHEN days_since_active >= 7 THEN 'warning' ELSE 'active' END AS user_status FROM user_last_active ORDER BY days_since_active DESC;

Sample output:

user_idregister_datelast_active_datedays_since_activeuser_status
52026-01-252026-01-2658churned
42026-01-202026-01-2757churned
82026-02-102026-02-2528at_risk
122026-03-052026-03-205active

Churn Rate Summary

SELECT user_status, COUNT(*) AS user_count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct FROM user_last_active GROUP BY user_status ORDER BY user_count DESC;


5. Retention Curve: Weekly Aggregated Retention Trend

Use Case

Generate line chart data showing how retention decays over time.

SQL Implementation

WITH retention_base AS ( SELECT u.user_id, u.register_date, a.activity_date, DATEDIFF(a.activity_date, u.register_date) AS days_since_register FROM users u JOIN activity a ON u.user_id = a.user_id ), cohort_size AS ( SELECT DATE_TRUNC('MONTH', CAST(register_date AS DATE)) AS cohort_month, COUNT(DISTINCT user_id) AS cohort_users FROM users GROUP BY 1 ), retention_by_week AS ( SELECT DATE_TRUNC('MONTH', CAST(rb.register_date AS DATE)) AS cohort_month, FLOOR(rb.days_since_register / 7.0) AS week_number, COUNT(DISTINCT rb.user_id) AS retained_users FROM retention_base rb GROUP BY 1, 2 ) SELECT r.cohort_month, r.week_number, r.retained_users, c.cohort_users, ROUND(r.retained_users * 100.0 / c.cohort_users, 1) AS retention_pct FROM retention_by_week r JOIN cohort_size c ON r.cohort_month = c.cohort_month ORDER BY r.cohort_month, r.week_number;

Sample output:

cohort_monthweek_numberretained_userscohort_usersretention_pct
2026-01-01055100.0
2026-01-0114580.0
2026-01-0122540.0
2026-01-0132540.0
2026-02-01055100.0
2026-02-0114580.0

Visualization Tips

  • X-axis: week_number
  • Y-axis: retention_pct
  • Legend: cohort_month (different colors for different registration months)

6. Advanced: Cohort × Behavioral Attribute Cross-Analysis

Scenario: Retention Differences by Acquisition Channel

SELECT u.channel, DATE_TRUNC('MONTH', CAST(u.register_date AS DATE)) AS cohort_month, COUNT(DISTINCT u.user_id) AS cohort_users, COUNT(DISTINCT CASE WHEN DATEDIFF(a.activity_date, u.register_date) = 7 THEN u.user_id END) AS day_7_retained, ROUND( COUNT(DISTINCT CASE WHEN DATEDIFF(a.activity_date, u.register_date) = 7 THEN u.user_id END) * 100.0 / COUNT(DISTINCT u.user_id), 1 ) AS day_7_retention_pct FROM users u LEFT JOIN activity a ON u.user_id = a.user_id GROUP BY u.channel, DATE_TRUNC('MONTH', CAST(u.register_date AS DATE)) ORDER BY cohort_month, day_7_retention_pct DESC;

Scenario: Cohort LTV (Lifetime Value)

SELECT DATE_TRUNC('MONTH', CAST(u.register_date AS DATE)) AS cohort_month, COUNT(DISTINCT u.user_id) AS cohort_users, SUM(o.order_amount) AS total_revenue, ROUND(SUM(o.order_amount) / COUNT(DISTINCT u.user_id), 2) AS avg_ltv FROM users u LEFT JOIN orders o ON u.user_id = o.user_id GROUP BY 1 ORDER BY 1;


Common Issues

1. DATEDIFF Argument Order

-- Correct: DATEDIFF(end, start) DATEDIFF('2026-03-25', '2026-01-01') -- returns 83 -- Wrong: reversed arguments produce a negative number DATEDIFF('2026-01-01', '2026-03-25') -- returns -83

2. DATE_ADD Syntax

-- Correct: DATE_ADD(date, INTERVAL n unit) DATE_ADD(CAST('2026-01-10' AS DATE), INTERVAL 3 DAY) -- Wrong: 3-argument form is not supported DATE_ADD('2026-01-10', 3, 'DAY') -- syntax error

3. Baseline Date for Consecutive Streak Calculation

-- Correct: use a fixed baseline date (e.g., '2000-01-01') DATEDIFF(activity_date, CAST('2000-01-01' AS DATE)) - rn AS grp -- Wrong: using a dynamic baseline causes incorrect grouping DATEDIFF(activity_date, MIN(activity_date)) - rn -- baseline differs per group

4. Division by Zero in Retention Rate

-- Safe: use NULLIF to avoid divide-by-zero errors ROUND(retained_users * 100.0 / NULLIF(cohort_users, 0), 1)

5. Cohort Grouping Granularity

-- Monthly grouping (recommended) DATE_TRUNC('MONTH', CAST(register_date AS DATE)) -- Weekly grouping DATE_TRUNC('WEEK', CAST(register_date AS DATE)) -- Daily grouping (not recommended for large datasets) CAST(register_date AS DATE)


Performance Optimization Tips

ScenarioStrategy
Large-scale retention calculationDeduplicate the activity log by user_id first, then JOIN
Cohort matrixMaterialize retention_base as a Dynamic Table
Consecutive activity trackingBuild a Bloom Filter index on (user_id, activity_date) in the activity table
Churn analysisParameterize the reference date to avoid hardcoding

-- Optimization example: deduplicate before calculating WITH distinct_activity AS ( SELECT DISTINCT user_id, CAST(activity_date AS DATE) AS activity_date FROM activity ), retention_base AS ( SELECT u.user_id, u.register_date, a.activity_date, DATEDIFF(a.activity_date, u.register_date) AS days_since_register FROM users u JOIN distinct_activity a ON u.user_id = a.user_id ) -- Remaining logic is unchanged...