Causal Inference for 401(k) Policy Evaluation

End-to-end causal analysis of a US retirement savings programme demonstrating: selection bias diagnosis, five causal estimation methods (PSM, IPW, AIPW, Double ML, Causal Forest), heterogeneous treatment effect estimation, sensitivity analysis, and IV/2SLS natural experiment cross-validation. True causal effect: ~$9,800 — roughly half the $19,500 naive estimate.

Key Results

Method	Estimand	ATE / LATE	95% CI	Note
Naive OLS	ATE	$19,559	($18,200, $20,900)	❌ selection bias
OLS + controls	ATE	$12,000	($10,800, $13,200)	❌ residual confounding
PSM / IPW	ATE	$9,200–$9,500	—	✓ matched comparable groups
AIPW (Doubly Robust)	ATE	$9,800	($8,600, $11,000)	✓ robust to one misspecification
Double ML	ATE	$9,800	($8,500, $11,100)	✓ ML-powered residualization
Causal Forest	ATE	$9,900	($8,700, $11,100)	✓ individual-level effects
IV/2SLS	LATE	$7,327	($2,773, $11,881)	✓ natural experiment, compliers only

ATE = Average Treatment Effect (all employees). LATE = Local Average Treatment Effect (compliers only — employees who participate iff their employer offers the plan).

The Business Problem

A 401(k) is a US employer-sponsored retirement savings plan. Employees who enrol can contribute a portion of their salary pre-tax; many employers match contributions up to a limit. The question every HR and benefits team wants to answer:

Does enrolling in the 401(k) plan actually cause employees to accumulate more wealth — or do people who were already good savers simply choose to enrol?

This matters because the answer determines whether it's worth spending money on enrollment campaigns, employer matching, and financial education. If the effect is mostly selection bias, those dollars are wasted.

The challenge is that employees who enrol are not a random sample. They tend to earn more, be more educated, and be more financially motivated. A naive comparison of their savings measures self-selection, not the plan's causal effect. This project separates the two.

Dataset

Source: Survey of Income and Program Participation (SIPP), US government longitudinal survey. Used as a benchmark dataset in Chernozhukov et al. (2018). Loaded automatically from the doubleml package — no download required.

Property	Value
Rows	9,915 households
Features	9 (demographics, income, household structure)
Outcome	Net total financial assets (`net_tfa`)
Treatment	401(k) participation (`p401`, binary)
Instrument	Employer eligibility (`e401`, binary)

Variables

Variable	Plain English	Role
`net_tfa`	Net financial assets — total savings + investments, excluding home equity and pension	Outcome
`p401`	Does this household participate in a 401(k) plan? (yes/no)	Treatment (endogenous)
`e401`	Does the employer offer a 401(k) plan? (yes/no)	Instrument — company policy, not employee choice
`inc`	Annual family income	Confounder — higher earners enrol more and save more
`age`	Age of household head	Confounder — older workers save more and enrol more
`educ`	Years of education	Confounder — more educated workers earn and plan more
`fsize`	Family size	Confounder — larger families face more spending pressure
`marr`	Married household (yes/no)	Confounder
`twoearn`	Two-earner household (yes/no)	Confounder
`pira`	Has an individual IRA (yes/no)	Confounder — proxy for savings propensity

Project Structure

ml_causal_inference/
├── pyproject.toml
├── uv.lock
├── notebooks/
│   ├── 01_problem_and_data.ipynb       # EDA · Causal DAG · Selection bias
│   ├── 02_naive_regression.ipynb       # OLS · Omitted variable bias · Sensitivity
│   ├── 03_matching_ipw.ipynb           # PSM · IPW · AIPW · Balance diagnostics
│   ├── 04_double_ml.ipynb              # Manual DML · Cross-fitting · econml LinearDML
│   ├── 05_causal_forests.ipynb         # Causal Forest · CATE · Policy tree · ROI
│   ├── 06_sensitivity_analysis.ipynb   # DoWhy refutations · E-values · Stability
│   └── 07_iv_2sls.ipynb                # IV/2SLS · First stage · LATE vs ATE
├── reports/figures/                    # 16 auto-generated PNGs
├── src/causal_401k/
│   ├── __init__.py
│   ├── constants.py    # Centralised thresholds, seeds, reference estimates
│   ├── data.py         # load_401k(), get_feature_groups()
│   ├── plots.py        # 13 reusable plot functions, Mazzanti style
│   └── utils.py        # balance_stats(), compute_evalue(), first_stage_stats(), DML helpers
└── tests/
    ├── test_data.py    # 20 tests
    ├── test_plots.py   # 35 tests — all 13 plot functions
    └── test_utils.py   # 34 tests — all utility functions

Pipeline

Raw data (fetch_401K)
  │
  ▼
01_problem_and_data   EDA, causal DAG, selection bias visualisation
  │
  ▼
02_naive_regression   Baseline OLS, coefficient sensitivity as controls are added
  │
  ▼
03_matching_ipw       PSM (1:1 NN), IPW, AIPW — overlap check, SMD balance test
  │
  ▼
04_double_ml          Manual DML (Lasso + GBM nuisance), econml LinearDML
  │
  ▼
05_causal_forests     CausalForestDML, CATE distribution, policy tree, ROI calc
  │
  ▼
06_sensitivity        DoWhy refutations, coefficient stability, E-values
  │
  ▼
07_iv_2sls            IV/2SLS via linearmodels, first-stage diagnostics, Wald check

Each notebook is self-contained and can be run independently. Notebooks 06–07 reference the ~$9,800 DML estimate from NB 04 as a constant.

Why Naive Regression Fails

Employees who enrol in 401(k) plans are systematically different from those who don't — they earn more, are more educated, and are more financially motivated. They would have saved more regardless of the plan.

Adding controls helps but doesn't eliminate the bias. The OLS coefficient shrinks monotonically from $27K to $12K as demographics, income, and household controls are added — but it never reaches the true causal estimate. The gap between full-OLS and DML is the signature of non-linear confounding that linear regression can't absorb.

Methods

Matching and Reweighting (NB 03)

PSM and IPW construct a synthetic control group of non-participants who look statistically similar to participants. Common support is confirmed first — the propensity score distributions overlap well, so matching is comparing like-for-like.

After matching, all standardized mean differences (SMD) fall below the 0.1 threshold. The ATE estimate stabilises around $9,200–$9,500.

Double ML (NB 04)

Double ML (Chernozhukov et al. 2018) removes confounding with machine learning rather than parametric assumptions. It regresses out the effect of covariates from both the outcome and the treatment using cross-fitted nuisance models, then estimates the ATE from the clean residuals. The slope of this residual scatter is the causal effect.

Causal Forests and Heterogeneity (NB 05)

The average effect masks substantial individual variation. Causal Forest estimates a treatment effect for each household — some benefit by $20K+, others by near-zero.

A policy tree translates this into an actionable targeting rule. Income and age are the dominant drivers of heterogeneity.

Key Findings

The Estimate Is Robust

Four automated DoWhy stress-tests all pass: a randomly added confounder barely moves the estimate; replacing treatment with random noise collapses it to ~$0; estimates on 80% subsamples are stable; bootstrap variance is low.

Coefficient stability confirms the monotone shrinkage pattern; the E-value of 1.57 means an unmeasured confounder would need to be simultaneously 1.57× more prevalent among participants and 1.57× more associated with wealth — a high bar after already controlling for income.

The Natural Experiment Agrees

IV/2SLS uses employer-offered eligibility (e401) as a natural experiment — it's company policy, not the employee's choice. The first-stage F-statistic of 12,940 (threshold: 10) confirms eligibility is a strong instrument.

The 2SLS LATE of $7,327 is lower than the ATE of ~$9,800 because it captures only compliers — employees who participate iff their employer offers the plan. Always-takers (who'd find other savings vehicles regardless) benefit more and are excluded from the LATE.

The Wald estimator (ITT ÷ compliance rate = $5,028 ÷ 0.686) exactly reproduces the 2SLS coefficient — confirming internal consistency.

Business Recommendations

1. The programme works — but the effect size should reset expectations. The true causal effect (~$9,800) is real and statistically robust. But it is roughly half the $19,500 figure that naive analysis produces. Any ROI model built on the naive number overstates the programme's return by ~2×. Rebaseline financial projections accordingly.

2. Target enrollment campaigns at mid-to-high income, middle-aged employees. The Causal Forest identifies this group as having the highest individual treatment effects — they benefit most from participation. Blanket campaigns waste resources on employees whose counterfactual savings behaviour would be similar with or without a 401(k).

3. Focus on new employer adoption, not just increasing participation rates. The IV/2SLS result shows that expanding employer eligibility (making the plan available where it currently isn't) produces a LATE of ~$7,300 per complier. Policies that extend 401(k) access to smaller firms or lower-wage sectors are likely to have real and measurable wealth effects on the employees who respond.

4. Don't assume the effect holds at the individual level. The CATE distribution shows near-zero effects for a meaningful minority. Expensive personalised outreach to employees with low predicted treatment effects is unlikely to pay off. Use the policy tree splits (age, income) to triage.

Design Decisions

Show the wrong answer first. NB 02 deliberately starts with naive OLS to make the bias concrete before introducing corrections. Seeing $19,559 shrink to $9,800 across methods is more convincing than presenting the correct answer in isolation.

Five methods with increasing rigour, not just one. PSM/IPW, DML, and Causal Forest all assume conditional independence but relax different parametric constraints. Convergence across methods under different assumptions is itself evidence — if all five land on ~$9,800, it's unlikely to be a modelling artefact.

Cross-fitting in DML. Without cross-fitting (sample splitting across folds), regularisation bias from Lasso or GBM leaks into the ATE estimate. Cross-fitting eliminates this via the Neyman orthogonality condition and is the key technical differentiator between DML and a simple two-step residual regression.

Causal Forest over LinearDML for heterogeneity. LinearDML assumes the treatment effect is linear in covariates — it would miss non-linear heterogeneity. CausalForestDML is fully nonparametric. The wide CATE distribution confirms non-linear heterogeneity is present, justifying the extra complexity.

IV/2SLS as independent validation, not the primary estimator. Instrumental variables requires the exclusion restriction (eligibility affects wealth only through participation), which is untestable and somewhat fragile here (larger firms offer both 401(k) plans and higher wages). It is included as a robustness check under a different identification assumption, not as the headline estimate.

E-values over qualitative discussion of unmeasured confounding. Saying "there might be unmeasured savings propensity" is vague. Saying "a confounder would need to be ≥1.57× as prevalent and ≥1.57× as strongly associated with the outcome to explain away the effect" is a specific, falsifiable claim.

How to Run

Prerequisites: Python 3.12+, uv

git clone https://github.com/viv-analytics/portfolio__ml_causal_inference
cd portfolio__ml_causal_inference
uv sync
uv run jupyter lab

Or with pip:

pip install -e .
jupyter lab

Run notebooks in order (each is also standalone). To execute non-interactively:

uv run jupyter nbconvert --to notebook --execute --inplace notebooks/01_problem_and_data.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/02_naive_regression.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/03_matching_ipw.ipynb
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/04_double_ml.ipynb      # ~3 min
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/05_causal_forests.ipynb # ~5 min
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/06_sensitivity_analysis.ipynb # ~2 min
uv run jupyter nbconvert --to notebook --execute --inplace notebooks/07_iv_2sls.ipynb

All figures are saved automatically to reports/figures/.

Figure	Contents
`01_selection_bias.png`	Income and education distributions by participation status
`01_causal_dag.png`	Causal DAG with observed confounders and unobserved savings propensity
`02_coefficient_path.png`	OLS coefficient as controls are progressively added
`03_propensity_overlap.png`	Propensity score distributions — common support check
`03_balance_table.png`	SMD before and after matching
`04_ate_comparison.png`	ATE point estimates with CIs across all methods
`04_dml_residuals.png`	DML residual scatter — Lasso vs GBM nuisance
`05_cate_distribution.png`	Individual treatment effect distribution from Causal Forest
`05_policy_tree.png`	Decision tree for targeting maximum ROI
`05_feature_importance.png`	CATE feature importance — income and age dominate
`06_refutation_tests.png`	DoWhy automated stress-tests
`06_coefficient_stability.png`	Monotone shrinkage from naive OLS to DML
`06_evalue.png`	E-value for unmeasured confounding
`07_first_stage.png`	IV first stage — compliance rate by eligibility
`07_late_vs_ate.png`	LATE vs ATE comparison including IV/2SLS

References

Chernozhukov et al. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1).
Wager & Athey (2018). Estimation and inference of heterogeneous treatment effects using random forests. JASA, 113(523).
Athey & Imbens (2016). Recursive partitioning for heterogeneous causal effects. PNAS, 113(27).
Rosenbaum & Rubin (1983). The central role of the propensity score in observational studies. Biometrika, 70(1).
Angrist, Imbens & Rubin (1996). Identification of causal effects using instrumental variables. JASA, 91(434).
Ding & VanderWeele (2016). Sensitivity analysis without assumptions. Epidemiology, 27(3).
Sharma & Kiciman (2020). DoWhy: An end-to-end library for causal inference. arXiv:2011.04216.
Poterba, Venti & Wise (1994). 401(k) plans and tax-deferred saving. In Studies in the Economics of Aging, NBER.

License

This project is licensed under a custom Personal Use License.

You are free to:

Use the code for personal or educational purposes
Publish your own fork or modified version on GitHub with attribution

You are not allowed to:

Use this code or its derivatives for commercial purposes
Resell or redistribute the code as your own product
Remove or change the license or attribution

For any use beyond personal or educational purposes, please contact the author for written permission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal Inference for 401(k) Policy Evaluation

Key Results

Table of Contents

The Business Problem

Dataset

Variables

Project Structure

Pipeline

Why Naive Regression Fails

Methods

Matching and Reweighting (NB 03)

Double ML (NB 04)

Causal Forests and Heterogeneity (NB 05)

Key Findings

The Estimate Is Robust

The Natural Experiment Agrees

Business Recommendations

Design Decisions

How to Run

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebooks		notebooks
reports/figures		reports/figures
src/causal_401k		src/causal_401k
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Causal Inference for 401(k) Policy Evaluation

Key Results

Table of Contents

The Business Problem

Dataset

Variables

Project Structure

Pipeline

Why Naive Regression Fails

Methods

Matching and Reweighting (NB 03)

Double ML (NB 04)

Causal Forests and Heterogeneity (NB 05)

Key Findings

The Estimate Is Robust

The Natural Experiment Agrees

Business Recommendations

Design Decisions

How to Run

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages