Skip to content

viv-analytics/portfolio__ml_causal_inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Causal Inference for 401(k) Policy Evaluation

End-to-end causal analysis of a US retirement savings programme demonstrating: selection bias diagnosis, five causal estimation methods (PSM, IPW, AIPW, Double ML, Causal Forest), heterogeneous treatment effect estimation, sensitivity analysis, and IV/2SLS natural experiment cross-validation. True causal effect: ~$9,800 — roughly half the $19,500 naive estimate.


Key Results

Method Estimand ATE / LATE 95% CI Note
Naive OLS ATE $19,559 ($18,200, $20,900) ❌ selection bias
OLS + controls ATE $12,000 ($10,800, $13,200) ❌ residual confounding
PSM / IPW ATE $9,200–$9,500 ✓ matched comparable groups
AIPW (Doubly Robust) ATE $9,800 ($8,600, $11,000) ✓ robust to one misspecification
Double ML ATE $9,800 ($8,500, $11,100) ✓ ML-powered residualization
Causal Forest ATE $9,900 ($8,700, $11,100) ✓ individual-level effects
IV/2SLS LATE $7,327 ($2,773, $11,881) ✓ natural experiment, compliers only

ATE = Average Treatment Effect (all employees). LATE = Local Average Treatment Effect (compliers only — employees who participate iff their employer offers the plan).

Method comparison — all causal estimates converge on ~$9,800


Table of Contents

  1. The Business Problem
  2. Dataset
  3. Project Structure
  4. Pipeline
  5. Why Naive Regression Fails
  6. Methods
  7. Key Findings
  8. Business Recommendations
  9. Design Decisions
  10. How to Run
  11. References

The Business Problem

A 401(k) is a US employer-sponsored retirement savings plan. Employees who enrol can contribute a portion of their salary pre-tax; many employers match contributions up to a limit. The question every HR and benefits team wants to answer:

Does enrolling in the 401(k) plan actually cause employees to accumulate more wealth — or do people who were already good savers simply choose to enrol?

This matters because the answer determines whether it's worth spending money on enrollment campaigns, employer matching, and financial education. If the effect is mostly selection bias, those dollars are wasted.

The challenge is that employees who enrol are not a random sample. They tend to earn more, be more educated, and be more financially motivated. A naive comparison of their savings measures self-selection, not the plan's causal effect. This project separates the two.


Dataset

Source: Survey of Income and Program Participation (SIPP), US government longitudinal survey. Used as a benchmark dataset in Chernozhukov et al. (2018). Loaded automatically from the doubleml package — no download required.

Property Value
Rows 9,915 households
Features 9 (demographics, income, household structure)
Outcome Net total financial assets (net_tfa)
Treatment 401(k) participation (p401, binary)
Instrument Employer eligibility (e401, binary)

Variables

Variable Plain English Role
net_tfa Net financial assets — total savings + investments, excluding home equity and pension Outcome
p401 Does this household participate in a 401(k) plan? (yes/no) Treatment (endogenous)
e401 Does the employer offer a 401(k) plan? (yes/no) Instrument — company policy, not employee choice
inc Annual family income Confounder — higher earners enrol more and save more
age Age of household head Confounder — older workers save more and enrol more
educ Years of education Confounder — more educated workers earn and plan more
fsize Family size Confounder — larger families face more spending pressure
marr Married household (yes/no) Confounder
twoearn Two-earner household (yes/no) Confounder
pira Has an individual IRA (yes/no) Confounder — proxy for savings propensity

Project Structure

ml_causal_inference/
├── pyproject.toml
├── uv.lock
├── notebooks/
│   ├── 01_problem_and_data.ipynb       # EDA · Causal DAG · Selection bias
│   ├── 02_naive_regression.ipynb       # OLS · Omitted variable bias · Sensitivity
│   ├── 03_matching_ipw.ipynb           # PSM · IPW · AIPW · Balance diagnostics
│   ├── 04_double_ml.ipynb              # Manual DML · Cross-fitting · econml LinearDML
│   ├── 05_causal_forests.ipynb         # Causal Forest · CATE · Policy tree · ROI
│   ├── 06_sensitivity_analysis.ipynb   # DoWhy refutations · E-values · Stability
│   └── 07_iv_2sls.ipynb                # IV/2SLS · First stage · LATE vs ATE
├── reports/figures/                    # 16 auto-generated PNGs
├── src/causal_401k/
│   ├── __init__.py
│   ├── constants.py    # Centralised thresholds, seeds, reference estimates
│   ├── data.py         # load_401k(), get_feature_groups()
│   ├── plots.py        # 13 reusable plot functions, Mazzanti style
│   └── utils.py        # balance_stats(), compute_evalue(), first_stage_stats(), DML helpers
└── tests/
    ├── test_data.py    # 20 tests
    ├── test_plots.py   # 35 tests — all 13 plot functions
    └── test_utils.py   # 34 tests — all utility functions

Pipeline

Raw data (fetch_401K)
  │
  ▼
01_problem_and_data   EDA, causal DAG, selection bias visualisation
  │
  ▼
02_naive_regression   Baseline OLS, coefficient sensitivity as controls are added
  │
  ▼
03_matching_ipw       PSM (1:1 NN), IPW, AIPW — overlap check, SMD balance test
  │
  ▼
04_double_ml          Manual DML (Lasso + GBM nuisance), econml LinearDML
  │
  ▼
05_causal_forests     CausalForestDML, CATE distribution, policy tree, ROI calc
  │
  ▼
06_sensitivity        DoWhy refutations, coefficient stability, E-values
  │
  ▼
07_iv_2sls            IV/2SLS via linearmodels, first-stage diagnostics, Wald check

Each notebook is self-contained and can be run independently. Notebooks 06–07 reference the ~$9,800 DML estimate from NB 04 as a constant.


Why Naive Regression Fails

Employees who enrol in 401(k) plans are systematically different from those who don't — they earn more, are more educated, and are more financially motivated. They would have saved more regardless of the plan.

Income and education are higher for participants before the plan even starts

Causal structure: income and savings propensity confound both participation and wealth

Adding controls helps but doesn't eliminate the bias. The OLS coefficient shrinks monotonically from $27K to $12K as demographics, income, and household controls are added — but it never reaches the true causal estimate. The gap between full-OLS and DML is the signature of non-linear confounding that linear regression can't absorb.

OLS coefficient shrinks as controls are added but stays above the causal estimate


Methods

Matching and Reweighting (NB 03)

PSM and IPW construct a synthetic control group of non-participants who look statistically similar to participants. Common support is confirmed first — the propensity score distributions overlap well, so matching is comparing like-for-like.

Propensity score overlap — common support is satisfied

After matching, all standardized mean differences (SMD) fall below the 0.1 threshold. The ATE estimate stabilises around $9,200–$9,500.

Matching eliminates covariate imbalance — all SMD below 0.1

Double ML (NB 04)

Double ML (Chernozhukov et al. 2018) removes confounding with machine learning rather than parametric assumptions. It regresses out the effect of covariates from both the outcome and the treatment using cross-fitted nuisance models, then estimates the ATE from the clean residuals. The slope of this residual scatter is the causal effect.

DML residual scatter — slope equals the ATE

Causal Forests and Heterogeneity (NB 05)

The average effect masks substantial individual variation. Causal Forest estimates a treatment effect for each household — some benefit by $20K+, others by near-zero.

401(k) treatment effects are heterogeneous across the workforce

A policy tree translates this into an actionable targeting rule. Income and age are the dominant drivers of heterogeneity.

Simple policy tree for targeting maximum ROI

Feature importance — income and age drive heterogeneity


Key Findings

The Estimate Is Robust

Four automated DoWhy stress-tests all pass: a randomly added confounder barely moves the estimate; replacing treatment with random noise collapses it to ~$0; estimates on 80% subsamples are stable; bootstrap variance is low.

All four DoWhy refutation tests pass

Coefficient stability confirms the monotone shrinkage pattern; the E-value of 1.57 means an unmeasured confounder would need to be simultaneously 1.57× more prevalent among participants and 1.57× more associated with wealth — a high bar after already controlling for income.

Coefficient stability from naive to DML

E-value — confounder needs 1.57× strength on both dimensions to explain away the result

The Natural Experiment Agrees

IV/2SLS uses employer-offered eligibility (e401) as a natural experiment — it's company policy, not the employee's choice. The first-stage F-statistic of 12,940 (threshold: 10) confirms eligibility is a strong instrument.

The 2SLS LATE of $7,327 is lower than the ATE of ~$9,800 because it captures only compliers — employees who participate iff their employer offers the plan. Always-takers (who'd find other savings vehicles regardless) benefit more and are excluded from the LATE.

The Wald estimator (ITT ÷ compliance rate = $5,028 ÷ 0.686) exactly reproduces the 2SLS coefficient — confirming internal consistency.

First stage: employer eligibility strongly predicts participation

LATE vs ATE: IV confirms positive causal effect under a different identification assumption


Business Recommendations

1. The programme works — but the effect size should reset expectations. The true causal effect (~$9,800) is real and statistically robust. But it is roughly half the $19,500 figure that naive analysis produces. Any ROI model built on the naive number overstates the programme's return by ~2×. Rebaseline financial projections accordingly.

2. Target enrollment campaigns at mid-to-high income, middle-aged employees. The Causal Forest identifies this group as having the highest individual treatment effects — they benefit most from participation. Blanket campaigns waste resources on employees whose counterfactual savings behaviour would be similar with or without a 401(k).

3. Focus on new employer adoption, not just increasing participation rates. The IV/2SLS result shows that expanding employer eligibility (making the plan available where it currently isn't) produces a LATE of ~$7,300 per complier. Policies that extend 401(k) access to smaller firms or lower-wage sectors are likely to have real and measurable wealth effects on the employees who respond.

4. Don't assume the effect holds at the individual level. The CATE distribution shows near-zero effects for a meaningful minority. Expensive personalised outreach to employees with low predicted treatment effects is unlikely to pay off. Use the policy tree splits (age, income) to triage.


Design Decisions

Show the wrong answer first. NB 02 deliberately starts with naive OLS to make the bias concrete before introducing corrections. Seeing $19,559 shrink to $9,800 across methods is more convincing than presenting the correct answer in isolation.

Five methods with increasing rigour, not just one. PSM/IPW, DML, and Causal Forest all assume conditional independence but relax different parametric constraints. Convergence across methods under different assumptions is itself evidence — if all five land on ~$9,800, it's unlikely to be a modelling artefact.

Cross-fitting in DML. Without cross-fitting (sample splitting across folds), regularisation bias from Lasso or GBM leaks into the ATE estimate. Cross-fitting eliminates this via the Neyman orthogonality condition and is the key technical differentiator between DML and a simple two-step residual regression.

Causal Forest over LinearDML for heterogeneity. LinearDML assumes the treatment effect is linear in covariates — it would miss non-linear heterogeneity. CausalForestDML is fully nonparametric. The wide CATE distribution confirms non-linear heterogeneity is present, justifying the extra complexity.

IV/2SLS as independent validation, not the primary estimator. Instrumental variables requires the exclusion restriction (eligibility affects wealth only through participation), which is untestable and somewhat fragile here (larger firms offer both 401(k) plans and higher wages). It is included as a robustness check under a different identification assumption, not as the headline estimate.

E-values over qualitative discussion of unmeasured confounding. Saying "there might be unmeasured savings propensity" is vague. Saying "a confounder would need to be ≥1.57× as prevalent and ≥1.57× as strongly associated with the outcome to explain away the effect" is a specific, falsifiable claim.


How to Run

Prerequisites: Python 3.12+, uv

git clone https://github.com/viv-analytics/portfolio__ml_causal_inference
cd portfolio__ml_causal_inference
uv sync
uv run jupyter lab

Or with pip:

pip install -e .
jupyter lab

Run notebooks in order (each is also standalone). To execute non-interactively:

uv run jupyter nbconvert --to notebook --execute --inplace notebooks/01_problem_and_data.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/02_naive_regression.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/03_matching_ipw.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/04_double_ml.ipynb      # ~3 min
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/05_causal_forests.ipynb # ~5 min
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/06_sensitivity_analysis.ipynb # ~2 min
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/07_iv_2sls.ipynb

All figures are saved automatically to reports/figures/.

Figure Contents
01_selection_bias.png Income and education distributions by participation status
01_causal_dag.png Causal DAG with observed confounders and unobserved savings propensity
02_coefficient_path.png OLS coefficient as controls are progressively added
03_propensity_overlap.png Propensity score distributions — common support check
03_balance_table.png SMD before and after matching
04_ate_comparison.png ATE point estimates with CIs across all methods
04_dml_residuals.png DML residual scatter — Lasso vs GBM nuisance
05_cate_distribution.png Individual treatment effect distribution from Causal Forest
05_policy_tree.png Decision tree for targeting maximum ROI
05_feature_importance.png CATE feature importance — income and age dominate
06_refutation_tests.png DoWhy automated stress-tests
06_coefficient_stability.png Monotone shrinkage from naive OLS to DML
06_evalue.png E-value for unmeasured confounding
07_first_stage.png IV first stage — compliance rate by eligibility
07_late_vs_ate.png LATE vs ATE comparison including IV/2SLS

References
  • Chernozhukov et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1).
  • Wager & Athey (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA, 113(523).
  • Athey & Imbens (2016). Recursive partitioning for heterogeneous causal effects. PNAS, 113(27).
  • Rosenbaum & Rubin (1983). The central role of the propensity score in observational studies. Biometrika, 70(1).
  • Angrist, Imbens & Rubin (1996). Identification of causal effects using instrumental variables. JASA, 91(434).
  • Ding & VanderWeele (2016). Sensitivity analysis without assumptions. Epidemiology, 27(3).
  • Sharma & Kiciman (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216.
  • Poterba, Venti & Wise (1994). 401(k) plans and tax-deferred saving. In Studies in the Economics of Aging, NBER.

License

This project is licensed under a custom Personal Use License.

You are free to:

  • Use the code for personal or educational purposes
  • Publish your own fork or modified version on GitHub with attribution

You are not allowed to:

  • Use this code or its derivatives for commercial purposes
  • Resell or redistribute the code as your own product
  • Remove or change the license or attribution

For any use beyond personal or educational purposes, please contact the author for written permission.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors