diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
index 0def78a..74fd8f5 100644
--- a/.claude-plugin/marketplace.json
+++ b/.claude-plugin/marketplace.json
@@ -8,7 +8,7 @@
},
"plugins": [
{
- "name": "farness",
+ "name": "brier",
"description": "Decision-making framework that reframes subjective questions as forecasting problems with explicit KPIs, option expansion, and calibration tracking",
"version": "0.1.0",
"author": {
diff --git a/.claude/skills/farness/SKILL.md b/.claude/skills/brier/SKILL.md
similarity index 85%
rename from .claude/skills/farness/SKILL.md
rename to .claude/skills/brier/SKILL.md
index b453359..f960599 100644
--- a/.claude/skills/farness/SKILL.md
+++ b/.claude/skills/brier/SKILL.md
@@ -1,13 +1,13 @@
---
-name: farness
-description: Use when the user wants advice or a decision recommendation rather than direct implementation, especially for prompts like "should I", "should we", "which is better", "is it worth it", or "what would you do" about architecture, product, hiring, strategy, or career choices. Prefer the local farness MCP server when available and structure the answer around KPI, option expansion, reference class, disconfirming evidence, numeric forecasts, and a review date.
+name: brier
+description: Use when the user wants advice or a decision recommendation rather than direct implementation, especially for prompts like "should I", "should we", "which is better", "is it worth it", or "what would you do" about architecture, product, hiring, strategy, or career choices. Prefer the local brier MCP server when available and structure the answer around KPI, option expansion, reference class, disconfirming evidence, numeric forecasts, and a review date.
---
-# Farness
+# Brier
Use this skill to turn vague decisions into forecastable choices.
-Prefer the local `farness` MCP server when it is connected.
+Prefer the local `brier` MCP server when it is connected.
## Workflow
@@ -43,8 +43,8 @@ Prefer the local `farness` MCP server when it is connected.
## Setup
-If the `farness` MCP server is not connected, add it with:
+If the `brier` MCP server is not connected, add it with:
```bash
-farness setup claude
+brier setup claude
```
diff --git a/CLAUDE.md b/CLAUDE.md
index f5eeb13..b8b07c1 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## Project Overview
-Farness is a decision-making framework that reframes subjective questions ("Should I...?") into forecasting problems with explicit KPIs, confidence intervals, and calibration tracking. The core thesis: making numeric predictions forces mechanism thinking, creates accountability, and reduces sycophancy.
+Brier is a decision-making framework that reframes subjective questions ("Should I...?") into forecasting problems with explicit KPIs, confidence intervals, and calibration tracking. The core thesis: making numeric predictions forces mechanism thinking, creates accountability, and reduces sycophancy.
## Commands
@@ -21,24 +21,24 @@ pytest
pytest tests/test_framework.py
# Run with coverage
-pytest --cov=farness
+pytest --cov=brier
# Format code
-black farness tests
-ruff check farness tests
+black brier tests
+ruff check brier tests
```
### CLI
```bash
-farness new "question" # Create a new decision
-farness new "q" --context "details" # With context
-farness list # List all decisions
-farness list --pending # Decisions past review date
-farness show # Show decision details (supports prefix match)
-farness score [id] # Score a decision's actual outcomes (interactive)
-farness calibration # Show calibration statistics
-farness pending # Alias for list --pending
+brier new "question" # Create a new decision
+brier new "q" --context "details" # With context
+brier list # List all decisions
+brier list --pending # Decisions past review date
+brier show # Show decision details (supports prefix match)
+brier score [id] # Score a decision's actual outcomes (interactive)
+brier calibration # Show calibration statistics
+brier pending # Alias for list --pending
```
### Site (Next.js)
@@ -57,15 +57,15 @@ bun run test # Run vitest tests
python3 paper/render_paper.py # Generate figures, render HTML, sync preemptive_rigor.md and site/public/paper-raw
python3 paper/run_strongest_validation.py # Strongest reviewer-facing validation across Claude Opus 4.6 and GPT-5.2
python3 paper/run_study1_rerun.py --models gpt-5.4 # Original Study 1 rerun with legacy prompt wording
-python3 -m farness.experiments stability --strongest-validation --model gpt-5.2 # Single-model strongest validation
+python3 -m brier.experiments stability --strongest-validation --model gpt-5.2 # Single-model strongest validation
```
## Architecture
-### Python Package (`farness/`)
+### Python Package (`brier/`)
- **framework.py**: Core dataclasses (`Decision`, `KPI`, `Option`, `Forecast`) with serialization. `Option.expected_value()` computes weighted expected values across KPIs. `Decision.best_option()` and `sensitivity_analysis()` for analysis.
-- **storage.py**: `DecisionStore` persists decisions to `~/.farness/decisions.jsonl` in JSONL format. Supports CRUD and filtered queries (unscored, pending review, scored).
+- **storage.py**: `DecisionStore` persists decisions to `~/.brier/decisions.jsonl` in JSONL format. Supports CRUD and filtered queries (unscored, pending review, scored).
- **calibration.py**: `CalibrationTracker` computes forecast accuracy metrics: coverage (% of actuals in CIs), calibration error (coverage vs stated confidence), MAE, MRE, Brier scores.
- **cli.py**: Argparse CLI wrapping the above modules.
diff --git a/README.md b/README.md
index 03f79f9..e47d8e5 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,8 @@
-# Farness
+# Brier
**Forecasting as a harness for decision-making.**
-Instead of asking "Is X good?" or "Should I do Y?", farness helps you:
+Instead of asking "Is X good?" or "Should I do Y?", brier helps you:
1. Define what success looks like (KPIs)
2. Expand your options (including ones you didn't consider)
3. Make explicit forecasts (with confidence intervals and resolution rules)
@@ -11,7 +11,7 @@ Instead of asking "Is X good?" or "Should I do Y?", farness helps you:
## Installation
```bash
-python -m pip install 'farness[mcp]'
+python -m pip install 'brier[mcp]'
```
## Quick Start
@@ -19,17 +19,17 @@ python -m pip install 'farness[mcp]'
### Codex
```bash
-farness setup codex
-farness doctor codex
+brier setup codex
+brier doctor codex
```
-Then restart Codex and use `$farness` when a decision prompt appears.
+Then restart Codex and use `$brier` when a decision prompt appears.
### Claude Code
```bash
-farness setup claude
-farness doctor claude
+brier setup claude
+brier doctor claude
```
Then restart Claude Code.
@@ -37,9 +37,9 @@ Then restart Claude Code.
### Local CLI
```bash
-farness new "Should we rewrite the auth layer?" --context "3 incidents this quarter"
-farness list
-farness calibration
+brier new "Should we rewrite the auth layer?" --context "3 incidents this quarter"
+brier list
+brier calibration
```
The CLI is local-only and does not call an LLM or require an API key.
@@ -47,7 +47,7 @@ The CLI is local-only and does not call an LLM or require an API key.
### Python package
```python
-from farness import Decision, KPI, Option, Forecast, DecisionStore
+from brier import Decision, KPI, Option, Forecast, DecisionStore
from datetime import datetime, timedelta
# Create a decision
@@ -109,20 +109,20 @@ store.save(decision)
### Command Line
```bash
-farness new "Should we launch now?"
-farness show abc123
-farness pending
-farness calibration
+brier new "Should we launch now?"
+brier show abc123
+brier pending
+brier calibration
```
### Forecast Question Drafts
-`farness` can turn a stored decision forecast or standalone policy question into
+`brier` can turn a stored decision forecast or standalone policy question into
Manifold-ready forecast question drafts. This is draft-only: it does not publish
questions, place a bet, or require a Manifold API key.
```bash
-farness forecast-draft "Will Waymo be legally permitted to offer fully driverless paid robotaxi rides in Washington, DC by 2026-12-31?" \
+brier forecast-draft "Will Waymo be legally permitted to offer fully driverless paid robotaxi rides in Washington, DC by 2026-12-31?" \
--initial-prob 52 \
--resolution-date 2026-12-31 \
--resolution-rule "Resolve YES if official DC law, regulation, or permit approval allows Waymo to offer fully driverless paid public rides in DC by 2026-12-31." \
@@ -136,7 +136,7 @@ farness forecast-draft "Will Waymo be legally permitted to offer fully driverles
For a stored decision with options and forecasts:
```bash
-farness forecast-draft abc123 --output forecast-pack.json
+brier forecast-draft abc123 --output forecast-pack.json
```
An example Waymo/DC draft pack lives at
@@ -148,7 +148,7 @@ way.
### AI Agent Workflows
-`farness` is not tied to Claude. The Claude Code plugin is the most integrated path today, but the framework also works with Codex and other coding agents that can follow structured instructions or run shell commands.
+`brier` is not tied to Claude. The Claude Code plugin is the most integrated path today, but the framework also works with Codex and other coding agents that can follow structured instructions or run shell commands.
For agent-agnostic setup and prompt guidance, see [`docs/agent-workflows.md`](docs/agent-workflows.md).
@@ -157,15 +157,15 @@ For agent-agnostic setup and prompt guidance, see [`docs/agent-workflows.md`](do
The default builder path is package-first:
```bash
-python -m pip install 'farness[mcp]'
-farness setup codex
-farness doctor codex
+python -m pip install 'brier[mcp]'
+brier setup codex
+brier doctor codex
```
For source installs during development:
```bash
-python -m pip install -e /path/to/farness
+python -m pip install -e /path/to/brier
```
#### MCP server
@@ -173,29 +173,29 @@ python -m pip install -e /path/to/farness
If you want a native tool interface instead of prompt copy-paste, install the package and run the MCP server locally:
```bash
-python -m pip install 'farness[mcp]'
-farness-mcp
+python -m pip install 'brier[mcp]'
+brier-mcp
```
-It exposes tools for creating, listing, retrieving, saving, and scoring decisions, plus resources/prompts for the farness workflow.
+It exposes tools for creating, listing, retrieving, saving, and scoring decisions, plus resources/prompts for the brier workflow.
To register it in Codex as a local MCP server:
```bash
-farness setup codex
-farness doctor codex
+brier setup codex
+brier doctor codex
```
-This installs the packaged Codex skill and registers the MCP server with the same Python interpreter that launched `farness`.
+This installs the packaged Codex skill and registers the MCP server with the same Python interpreter that launched `brier`.
#### Claude Code local skill + MCP
Claude Code can use the same local MCP server and a local skill wrapper:
```bash
-python -m pip install 'farness[mcp]'
-farness setup claude
-farness doctor claude
+python -m pip install 'brier[mcp]'
+brier setup claude
+brier doctor claude
```
This installs the packaged Claude skill and registers the MCP server in user scope.
@@ -203,38 +203,38 @@ This installs the packaged Claude skill and registers the MCP server in user sco
The plugin path still works if you prefer the slash-command workflow:
```bash
-claude plugin marketplace add MaxGhenis/farness
-claude plugin install farness@maxghenis-plugins
+claude plugin marketplace add MaxGhenis/brier
+claude plugin install brier@maxghenis-plugins
```
-Then either use the local `farness` skill or `/farness:decide` if you installed the plugin.
+Then either use the local `brier` skill or `/brier:decide` if you installed the plugin.
#### Repair and reset
If setup drifted or a skill was modified locally:
```bash
-farness doctor codex --fix
-farness doctor claude --fix
+brier doctor codex --fix
+brier doctor claude --fix
```
If you want to remove the local integration and start over:
```bash
-farness uninstall codex
-farness setup codex
+brier uninstall codex
+brier setup codex
```
or:
```bash
-farness uninstall claude
-farness setup claude
+brier uninstall claude
+brier setup claude
```
## The Framework
-Farness implements a structured decision process:
+Brier implements a structured decision process:
1. **KPI Definition** - What outcomes actually matter? Make them measurable.
Add outcome type, resolution date, resolution rule, and data source when possible.
@@ -262,8 +262,8 @@ Farness implements a structured decision process:
## Development
```bash
-git clone https://github.com/MaxGhenis/farness
-cd farness
+git clone https://github.com/MaxGhenis/brier
+cd brier
pip install -e ".[dev,experiments]"
pytest
python -m build
@@ -277,7 +277,7 @@ Paper build:
python3 paper/render_paper.py # Regenerates figures, HTML, Markdown, and site/public/paper-raw
python3 paper/run_strongest_validation.py # Runs the strongest reviewer-facing validation on Claude Opus 4.6 and GPT-5.2
python3 paper/run_study1_rerun.py --models gpt-5.4 # Reruns the original Study 1 design with legacy prompt wording
-python3 -m farness.experiments stability --strongest-validation --model gpt-5.2 # Single-model equivalent
+python3 -m brier.experiments stability --strongest-validation --model gpt-5.2 # Single-model equivalent
```
### Publishing to PyPI
@@ -285,11 +285,11 @@ python3 -m farness.experiments stability --strongest-validation --model gpt-5.2
The package is published to PyPI from GitHub Releases using PyPI Trusted Publishing.
**Setup (one-time):**
-1. In PyPI, open the `farness` project publishing settings:
- - `https://pypi.org/manage/project/farness/settings/publishing/`
+1. In PyPI, open the `brier` project publishing settings:
+ - `https://pypi.org/manage/project/brier/settings/publishing/`
2. Add a GitHub Actions trusted publisher with:
- Owner: `MaxGhenis`
- - Repository name: `farness`
+ - Repository name: `brier`
- Workflow name: `publish.yml`
- Environment name: leave blank unless you later add a GitHub environment
diff --git a/TODO-paper-revisions.md b/TODO-paper-revisions.md
index 22ed331..29d7194 100644
--- a/TODO-paper-revisions.md
+++ b/TODO-paper-revisions.md
@@ -1,22 +1,22 @@
-# Farness paper revisions — March 15, 2026
+# Brier paper revisions — March 15, 2026
## Priority 1: Narrative fixes
-- [x] **Reframe convergence finding**: "farness starts closer to where both end up after probing" — not divergence, not overshoot. Both conditions converge on similar final values; farness just starts closer. Change throughout abstract, Section 5.5, Section 6.3, Section 7.
-- [x] **Introduce farness properly**: "I introduce farness, a structured decision framework" not "I evaluate a framework called farness." This paper IS the introduction. Add footnote linking to GitHub/site.
+- [x] **Reframe convergence finding**: "Brier starts closer to where both end up after probing" — not divergence, not overshoot. Both conditions converge on similar final values; Brier just starts closer. Change throughout abstract, Section 5.5, Section 6.3, Section 7.
+- [x] **Introduce Brier properly**: "I introduce Brier, a structured decision framework" not "I evaluate a framework called Brier." This paper IS the introduction. Add footnote linking to GitHub/site.
- [x] **Drop "pre-registered" claims**: Replace with "analysis code was committed prior to data collection (December 2025; experiments ran February 2026)." No formal pre-registration exists — just git history (commits 50e93d4, bfd1aae predate experiment runs).
## Priority 2: Graphs (desperately needed)
- [ ] **Update magnitude box/violin plots**: by condition, for each model
- [ ] **Per-scenario forest plot**: effect sizes with CIs for each scenario
-- [ ] **Convergence visualization**: show initial→final for naive vs farness on 2-3 scenarios, illustrating "farness starts closer to where both end up"
+- [ ] **Convergence visualization**: show initial→final for naive vs Brier on 2-3 scenarios, illustrating "Brier starts closer to where both end up"
- [ ] **Sycophancy bar chart**: Claude vs GPT-5.2 update magnitude on sycophancy scenario — the most dramatic finding
## Priority 3: Content additions
-- [ ] **Concrete example**: Pick one scenario (e.g., sunk_cost_project), show actual responses from naive and farness conditions, before and after probing. Raw text excerpts.
-- [ ] **Sycophancy deep-dive**: GPT-5.2 naive updates by 466.7 leads on average under sycophantic pressure (1000→1300-1400). Claude: zero update. Farness on GPT: 108.3. This is the clearest finding in the paper and currently buried.
+- [ ] **Concrete example**: Pick one scenario (e.g., sunk_cost_project), show actual responses from naive and Brier conditions, before and after probing. Raw text excerpts.
+- [ ] **Sycophancy deep-dive**: GPT-5.2 naive updates by 466.7 leads on average under sycophantic pressure (1000→1300-1400). Claude: zero update. Brier on GPT: 108.3. This is the clearest finding in the paper and currently buried.
- [ ] **Run symmetric sycophancy test**: Current test only pushes "higher." Add "I think it should be lower" version to confirm framework resists pressure in both directions. ~12 API calls, ~$5.
## Priority 4: Technical fixes
@@ -36,9 +36,9 @@
## Key data points for reference
-- Claude mixed-effects: farness = -4.17 (p<0.001), CoT = -0.56 (p=0.34)
-- GPT mixed-effects: farness = -37.0 (p=0.009), CoT = -29.7 (p=0.036)
-- GPT sycophancy (adversarial_sycophancy): naive mean update = 466.7 leads, farness = 108.3, Claude naive = 0.0
+- Claude mixed-effects: Brier = -4.17 (p<0.001), CoT = -0.56 (p=0.34)
+- GPT mixed-effects: Brier = -37.0 (p=0.009), CoT = -29.7 (p=0.036)
+- GPT sycophancy (adversarial_sycophancy): naive mean update = 466.7 leads, Brier = 108.3, Claude naive = 0.0
- Scenarios use different units: percentages (most), weeks (planning), leads (sycophancy)
- Analysis code: commits 50e93d4 (Dec 19) and bfd1aae (Dec 20), experiments: Feb 16-18
- Skill optimization loop was running (PID 20928) — check if it finished and apply the optimized description
diff --git a/brier/__init__.py b/brier/__init__.py
new file mode 100644
index 0000000..db755f2
--- /dev/null
+++ b/brier/__init__.py
@@ -0,0 +1,21 @@
+"""Brier: Forecasting as a harness for decision-making."""
+
+__version__ = "0.2.4"
+
+from brier.framework import Decision, KPI, Option, Forecast, OutcomeType
+from brier.storage import DecisionStore
+from brier.calibration import CalibrationTracker
+from brier.market import MarketDraft, MarketSource, draft_markets_for_decision
+
+__all__ = [
+ "Decision",
+ "KPI",
+ "Option",
+ "Forecast",
+ "OutcomeType",
+ "DecisionStore",
+ "CalibrationTracker",
+ "MarketDraft",
+ "MarketSource",
+ "draft_markets_for_decision",
+]
diff --git a/farness/agent_setup.py b/brier/agent_setup.py
similarity index 94%
rename from farness/agent_setup.py
rename to brier/agent_setup.py
index ffd5c69..f766f10 100644
--- a/farness/agent_setup.py
+++ b/brier/agent_setup.py
@@ -9,10 +9,10 @@
from dataclasses import dataclass
from pathlib import Path
-from farness.skills import inspect_skill
-from farness.skills import install_skill
-from farness.skills import remove_skill
-from farness.skills import resolve_skill_path
+from brier.skills import inspect_skill
+from brier.skills import install_skill
+from brier.skills import remove_skill
+from brier.skills import resolve_skill_path
@dataclass
@@ -87,7 +87,7 @@ def _mcp_add_command(agent: str, server_name: str, python_bin: str) -> list[str]
"--",
python_bin,
"-m",
- "farness.mcp_server",
+ "brier.mcp_server",
]
return [
cli,
@@ -99,7 +99,7 @@ def _mcp_add_command(agent: str, server_name: str, python_bin: str) -> list[str]
"--",
python_bin,
"-m",
- "farness.mcp_server",
+ "brier.mcp_server",
]
@@ -142,7 +142,7 @@ def _ensure_mcp_server(agent: str, server_name: str, python_bin: str) -> str:
def manual_setup_command(
- agent: str, python_bin: str, server_name: str = "farness"
+ agent: str, python_bin: str, server_name: str = "brier"
) -> str:
"""Return the fallback MCP registration command for an agent."""
return shlex.join(_mcp_add_command(agent, server_name, python_bin))
@@ -153,7 +153,7 @@ def inspect_agent_setup(
*,
target_dir: str | None = None,
python_bin: str | None = None,
- server_name: str = "farness",
+ server_name: str = "brier",
) -> AgentDoctorResult:
"""Inspect the local skill and MCP registration for an agent."""
cli = _agent_cli_name(agent)
@@ -183,7 +183,7 @@ def repair_agent_setup(
target_dir: str | None = None,
force_skill: bool = False,
python_bin: str | None = None,
- server_name: str = "farness",
+ server_name: str = "brier",
) -> AgentRepairResult:
"""Install or repair the packaged skill and MCP registration for an agent."""
cli = _agent_cli_name(agent)
@@ -221,7 +221,7 @@ def remove_agent_setup(
agent: str,
*,
target_dir: str | None = None,
- server_name: str = "farness",
+ server_name: str = "brier",
remove_mcp: bool = True,
) -> AgentUninstallResult:
"""Remove the packaged skill and optionally the MCP server for an agent."""
@@ -258,7 +258,7 @@ def setup_agent(
target_dir: str | None = None,
force_skill: bool = False,
python_bin: str | None = None,
- server_name: str = "farness",
+ server_name: str = "brier",
) -> AgentSetupResult:
"""Install the packaged skill and configure the local MCP server."""
cli = _agent_cli_name(agent)
diff --git a/farness/assets/skills/claude/SKILL.md b/brier/assets/skills/claude/SKILL.md
similarity index 79%
rename from farness/assets/skills/claude/SKILL.md
rename to brier/assets/skills/claude/SKILL.md
index 10f4ab4..b0d73da 100644
--- a/farness/assets/skills/claude/SKILL.md
+++ b/brier/assets/skills/claude/SKILL.md
@@ -1,13 +1,13 @@
---
-name: farness
-description: Use when the user wants advice or a decision recommendation rather than direct implementation, especially for prompts like "should I", "should we", "which is better", "is it worth it", or "what would you do" about architecture, product, hiring, strategy, or career choices. Prefer the local farness MCP server when available and structure the answer around KPI, option expansion, reference class, disconfirming evidence, numeric forecasts, and a review date.
+name: brier
+description: Use when the user wants advice or a decision recommendation rather than direct implementation, especially for prompts like "should I", "should we", "which is better", "is it worth it", or "what would you do" about architecture, product, hiring, strategy, or career choices. Prefer the local brier MCP server when available and structure the answer around KPI, option expansion, reference class, disconfirming evidence, numeric forecasts, and a review date.
---
-# Farness
+# Brier
Use this skill to turn vague decisions into forecastable choices.
-Prefer the local `farness` MCP server when it is connected.
+Prefer the local `brier` MCP server when it is connected.
## Workflow
@@ -31,8 +31,8 @@ Prefer the local `farness` MCP server when it is connected.
- Do not pass KPI or option names as bare strings.
5. If outcomes are known, call `score_decision`.
6. If the user wants to externalize a forecast into a prediction market, draft it first:
- - Use `farness forecast-draft --output forecast-pack.json` for stored decisions.
- - Use `farness forecast-draft "" --initial-prob <1-99> --resolution-date YYYY-MM-DD --output forecast-pack.json` for standalone policy questions.
+ - Use `brier forecast-draft --output forecast-pack.json` for stored decisions.
+ - Use `brier forecast-draft "" --initial-prob <1-99> --resolution-date YYYY-MM-DD --output forecast-pack.json` for standalone policy questions.
- Treat forecast drafts as review artifacts only; do not publish questions or place bets unless the user explicitly asks.
## Working Rules
@@ -47,8 +47,8 @@ Prefer the local `farness` MCP server when it is connected.
## Setup
-If the `farness` MCP server is not connected, add it with:
+If the `brier` MCP server is not connected, add it with:
```bash
-farness setup claude
+brier setup claude
```
diff --git a/skills/farness/SKILL.md b/brier/assets/skills/codex/SKILL.md
similarity index 86%
rename from skills/farness/SKILL.md
rename to brier/assets/skills/codex/SKILL.md
index 935113c..585759b 100644
--- a/skills/farness/SKILL.md
+++ b/brier/assets/skills/codex/SKILL.md
@@ -1,13 +1,13 @@
---
-name: farness
+name: brier
description: Use when the user wants advice or a decision analysis rather than pure implementation, especially for prompts like "should I", "should we", "which is better", "is it worth it", or "what would you do" about architecture, product, hiring, strategy, or career choices. Reframe the decision as explicit KPIs, expanded options, reference classes, disconfirming evidence, numeric forecasts, and a review date. Do not use for straightforward debugging, factual explanation, or routine coding tasks.
---
-# Farness
+# Brier
Use this skill to turn vague decisions into forecastable choices.
-Prefer the `farness` MCP server when available. It gives you persistent tools, resources, and prompts for the workflow.
+Prefer the `brier` MCP server when available. It gives you persistent tools, resources, and prompts for the workflow.
## Trigger Conditions
@@ -37,7 +37,7 @@ Do not use it for:
## Workflow
1. If there is no stored decision yet, call `create_decision`.
-2. Use `farness://framework` if you need the canonical sequence.
+2. Use `brier://framework` if you need the canonical sequence.
3. Structure the analysis around:
- KPI definition
- KPI resolution metadata
@@ -57,8 +57,8 @@ Do not use it for:
5. If the user is revisiting the decision, use `get_decision` and `review_decision`.
6. If outcomes are now known, call `score_decision` to update calibration.
7. If the user wants to externalize a forecast into a prediction market, draft it first:
- - Use `farness forecast-draft --output forecast-pack.json` for stored decisions.
- - Use `farness forecast-draft "" --initial-prob <1-99> --resolution-date YYYY-MM-DD --output forecast-pack.json` for standalone policy questions.
+ - Use `brier forecast-draft --output forecast-pack.json` for stored decisions.
+ - Use `brier forecast-draft "" --initial-prob <1-99> --resolution-date YYYY-MM-DD --output forecast-pack.json` for standalone policy questions.
- Treat forecast drafts as review artifacts only; do not publish questions or place bets unless the user explicitly asks.
## Working Rules
@@ -74,10 +74,10 @@ Do not use it for:
## Fallback
-If the `farness` MCP server is not connected, tell the user to add it with:
+If the `brier` MCP server is not connected, tell the user to add it with:
```bash
-farness setup codex
+brier setup codex
```
Then continue with the same workflow once the server is available.
diff --git a/farness/calibration.py b/brier/calibration.py
similarity index 99%
rename from farness/calibration.py
rename to brier/calibration.py
index ca72ed6..a346f5e 100644
--- a/farness/calibration.py
+++ b/brier/calibration.py
@@ -3,7 +3,7 @@
from dataclasses import dataclass
from typing import Optional
-from farness.framework import Decision, Forecast
+from brier.framework import Decision, Forecast
@dataclass
diff --git a/farness/cli.py b/brier/cli.py
similarity index 94%
rename from farness/cli.py
rename to brier/cli.py
index 415becd..32802ba 100644
--- a/farness/cli.py
+++ b/brier/cli.py
@@ -1,4 +1,4 @@
-"""Command-line interface for farness."""
+"""Command-line interface for brier."""
import argparse
import json
@@ -7,27 +7,27 @@
from datetime import datetime
from pathlib import Path
-from farness import Decision, DecisionStore, CalibrationTracker
-from farness.agent_setup import inspect_agent_setup, remove_agent_setup, repair_agent_setup, setup_agent
-from farness.market import (
+from brier import Decision, DecisionStore, CalibrationTracker
+from brier.agent_setup import inspect_agent_setup, remove_agent_setup, repair_agent_setup, setup_agent
+from brier.market import (
MarketSource,
draft_binary_policy_market,
draft_markets_for_decision,
market_pack_to_dict,
)
-from farness.skills import install_skill
+from brier.skills import install_skill
def main():
parser = argparse.ArgumentParser(
- prog="farness",
+ prog="brier",
description="Forecasting as a harness for decision-making",
)
parser.add_argument(
"--store",
help=(
- "Optional path to the farness JSONL store. Defaults to "
- "$FARNESS_STORE_PATH or ~/.farness/decisions.jsonl."
+ "Optional path to the brier JSONL store. Defaults to "
+ "$BRIER_STORE_PATH or ~/.brier/decisions.jsonl."
),
)
subparsers = parser.add_subparsers(dest="command", help="Commands")
@@ -125,8 +125,8 @@ def main():
"--target",
help=(
"Optional target skill directory. Defaults to "
- "$CODEX_HOME/skills/farness (or ~/.codex/skills/farness) for Codex, "
- "or ~/.claude/skills/farness for Claude."
+ "$CODEX_HOME/skills/brier (or ~/.codex/skills/brier) for Codex, "
+ "or ~/.claude/skills/brier for Claude."
),
)
install_skill_parser.add_argument(
@@ -145,8 +145,8 @@ def main():
"--target",
help=(
"Optional target skill directory. Defaults to "
- "$CODEX_HOME/skills/farness (or ~/.codex/skills/farness) for Codex, "
- "or ~/.claude/skills/farness for Claude."
+ "$CODEX_HOME/skills/brier (or ~/.codex/skills/brier) for Codex, "
+ "or ~/.claude/skills/brier for Claude."
),
)
uninstall_parser.add_argument(
@@ -165,8 +165,8 @@ def main():
"--target",
help=(
"Optional target skill directory. Defaults to "
- "$CODEX_HOME/skills/farness (or ~/.codex/skills/farness) for Codex, "
- "or ~/.claude/skills/farness for Claude."
+ "$CODEX_HOME/skills/brier (or ~/.codex/skills/brier) for Codex, "
+ "or ~/.claude/skills/brier for Claude."
),
)
setup_parser.add_argument(
@@ -192,8 +192,8 @@ def main():
"--target",
help=(
"Optional target skill directory. Defaults to "
- "$CODEX_HOME/skills/farness (or ~/.codex/skills/farness) for Codex, "
- "or ~/.claude/skills/farness for Claude."
+ "$CODEX_HOME/skills/brier (or ~/.codex/skills/brier) for Codex, "
+ "or ~/.claude/skills/brier for Claude."
),
)
doctor_parser.add_argument(
@@ -317,14 +317,14 @@ def main():
print("Recommended next step:")
if result.skill_state == "missing" and not result.mcp_configured and result.cli_path:
- print(f" farness setup {args.agent}")
+ print(f" brier setup {args.agent}")
elif result.skill_state == "missing":
- print(f" farness install-skill {args.agent}")
+ print(f" brier install-skill {args.agent}")
if result.cli_path is None:
print(f" Then install the {args.agent} CLI and run:")
print(f" {result.manual_command}")
elif result.skill_state == "modified":
- print(f" farness doctor {args.agent} --fix")
+ print(f" brier doctor {args.agent} --fix")
elif result.cli_path is None:
print(f" Install the {args.agent} CLI and run:")
print(f" {result.manual_command}")
@@ -332,7 +332,7 @@ def main():
print(f" {result.manual_command}")
return
- store_path = args.store or os.environ.get("FARNESS_STORE_PATH")
+ store_path = args.store or os.environ.get("BRIER_STORE_PATH")
store = DecisionStore(Path(store_path).expanduser()) if store_path else DecisionStore()
if args.command == "list":
diff --git a/farness/experiments/DECISION_USEFULNESS_STATUS.md b/brier/experiments/DECISION_USEFULNESS_STATUS.md
similarity index 65%
rename from farness/experiments/DECISION_USEFULNESS_STATUS.md
rename to brier/experiments/DECISION_USEFULNESS_STATUS.md
index 4891351..f721d0d 100644
--- a/farness/experiments/DECISION_USEFULNESS_STATUS.md
+++ b/brier/experiments/DECISION_USEFULNESS_STATUS.md
@@ -4,11 +4,11 @@ Last updated: 2026-04-15
## Why this exists
-The original stability-under-probing paper showed that `farness` front-loads framework-aligned considerations, but the held-out probe validation weakened the broad "better reasoning" claim. The current follow-up asks a different question:
+The original stability-under-probing paper showed that `brier` front-loads framework-aligned considerations, but the held-out probe validation weakened the broad "better reasoning" claim. The current follow-up asks a different question:
> Does forcing an LLM from qualitative vibes into explicit forecasts and tradeoffs produce more useful recommendations?
-The main methodological risk is rewarding `farness` by construction. The current design therefore separates final recommendation quality from framework-compliance diagnostics.
+The main methodological risk is rewarding `brier` by construction. The current design therefore separates final recommendation quality from framework-compliance diagnostics.
## Current evaluation design
@@ -17,7 +17,7 @@ Generator conditions:
- `naive`: ordinary helpful recommendation.
- `format_control`: structured qualitative headings, no required forecasts.
- `forecast_only`: explicit KPIs, numeric forecasts, intervals, assumptions, and recommendation.
-- `farness`: full framework with KPIs, option expansion, forecasts, outside view, disconfirming evidence, mechanism, recommendation, and review date.
+- `brier`: full framework with KPIs, option expansion, forecasts, outside view, disconfirming evidence, mechanism, recommendation, and review date.
Representations:
@@ -33,11 +33,11 @@ Judge tasks:
Primary endpoint:
-- Pairwise win rate for `farness` vs `forecast_only` on `decision_memo` utility.
+- Pairwise win rate for `brier` vs `forecast_only` on `decision_memo` utility.
Key secondary endpoint:
-- `decision_memo` critique survival, especially `farness` vs `forecast_only`.
+- `decision_memo` critique survival, especially `brier` vs `forecast_only`.
Diagnostic endpoint:
@@ -71,30 +71,30 @@ The old aligned/normalized pilot was too favorable to structured outputs:
The memo-primary rerun was much less one-sided:
- Claude-generated outputs, GPT judge:
- - `farness` vs `forecast_only`: `forecast_only` won `6-4`.
- - `farness` vs `naive`: `farness` won `6-4`.
+ - `brier` vs `forecast_only`: `forecast_only` won `6-4`.
+ - `brier` vs `naive`: `brier` won `6-4`.
- `forecast_only` vs `naive`: `5-5`.
- `format_control` vs `naive`: `naive` won `6-4`.
- GPT-5.4-generated outputs, Claude judge:
- - `farness` vs `forecast_only`: `farness` won `7-2-1`, but with low mean confidence.
- - `farness` vs `naive`: `farness` won `6-4`.
+ - `brier` vs `forecast_only`: `brier` won `7-2-1`, but with low mean confidence.
+ - `brier` vs `naive`: `brier` won `6-4`.
- `forecast_only` vs `naive`: `naive` won `7-3`.
- `format_control` vs `naive`: `naive` won `6-4`.
Critique-survival backfill on `decision_memo` only:
- GPT-5.4-generated outputs, Claude judge:
- - `farness` vs `forecast_only`: `farness` was less undermined `8-1-1`.
- - `farness` vs `naive`: `naive` was less undermined `7-3`.
+ - `brier` vs `forecast_only`: `brier` was less undermined `8-1-1`.
+ - `brier` vs `naive`: `naive` was less undermined `7-3`.
- Claude-generated outputs, GPT judge:
- - `farness` vs `forecast_only`: `farness` was less undermined `6-4`.
- - `farness` vs `naive`: tied `5-5`.
+ - `brier` vs `forecast_only`: `brier` was less undermined `6-4`.
+ - `brier` vs `naive`: tied `5-5`.
Current interpretation:
- The cleaner `decision_memo` endpoint sharply weakens the broad "structure helps" story.
-- There is weak-to-mixed evidence that `farness` improves concise final recommendations over `naive`.
-- There is more consistent pilot evidence that `farness` is more robust than `forecast_only` under held-out critique lenses.
+- There is weak-to-mixed evidence that `brier` improves concise final recommendations over `naive`.
+- There is more consistent pilot evidence that `brier` is more robust than `forecast_only` under held-out critique lenses.
- The full framework may add robustness beyond explicit forecasts, but the pilot does not show broad dominance over naive recommendations.
- `normalized` results should not be used as primary evidence for recommendation quality.
@@ -111,9 +111,9 @@ Recent local commits relevant to this evaluation:
Useful commands:
```bash
-./.venv/bin/python -m farness.experiments decision-usefulness --list
-./.venv/bin/python -m farness.experiments decision-usefulness --output-dir experiments/decision_usefulness/pilot_memo_primary/gpt-5.4 --judge-only --representations decision_memo raw normalized
-./.venv/bin/python -m farness.experiments decision-usefulness --output-dir experiments/decision_usefulness/pilot_critique_survival/gpt-5.4 --judge-only --representations decision_memo --judge-tasks critique_survival
+./.venv/bin/python -m brier.experiments decision-usefulness --list
+./.venv/bin/python -m brier.experiments decision-usefulness --output-dir experiments/decision_usefulness/pilot_memo_primary/gpt-5.4 --judge-only --representations decision_memo raw normalized
+./.venv/bin/python -m brier.experiments decision-usefulness --output-dir experiments/decision_usefulness/pilot_critique_survival/gpt-5.4 --judge-only --representations decision_memo --judge-tasks critique_survival
```
## Recommended next step
@@ -125,5 +125,5 @@ Do not run a full study yet. First improve the pilot protocol in two ways:
If the next pilot repeats the current pattern, the clean claim is:
-> `farness` does not obviously dominate naive recommendations in concise memo form, but it may add robustness beyond forecast-only prompting when recommendations are tested against held-out critiques.
+> `brier` does not obviously dominate naive recommendations in concise memo form, but it may add robustness beyond forecast-only prompting when recommendations are tested against held-out critiques.
diff --git a/farness/experiments/LLM_JUDGE_EVALUATION_PLAN.md b/brier/experiments/LLM_JUDGE_EVALUATION_PLAN.md
similarity index 89%
rename from farness/experiments/LLM_JUDGE_EVALUATION_PLAN.md
rename to brier/experiments/LLM_JUDGE_EVALUATION_PLAN.md
index 87c9a6a..47e463f 100644
--- a/farness/experiments/LLM_JUDGE_EVALUATION_PLAN.md
+++ b/brier/experiments/LLM_JUDGE_EVALUATION_PLAN.md
@@ -1,11 +1,11 @@
-# LLM-Judge Evaluation Plan for Farness Decision Usefulness
+# LLM-Judge Evaluation Plan for Brier Decision Usefulness
**Date:** 2026-04-06
**Status:** Proposed follow-up study
## Why this study exists
-The current `stability-under-probing` work measures whether a prompt front-loads considerations that later probes ask about. That is a real process measure, but it is not the same as the practical question that motivates `farness`:
+The current `stability-under-probing` work measures whether a prompt front-loads considerations that later probes ask about. That is a real process measure, but it is not the same as the practical question that motivates `brier`:
> does forcing a model to go from qualitative vibes to explicit numeric forecasts produce more useful decision analyses?
@@ -15,19 +15,19 @@ This study is meant to evaluate that narrower and more operational claim.
This design can support claims like:
-- held-out LLM judges find `farness` outputs more decision-useful than naive outputs
+- held-out LLM judges find `brier` outputs more decision-useful than naive outputs
- forcing explicit forecasts improves the decision artifact even when real-world outcomes are unresolved
-- some or all of the `farness` effect comes from quantified forecasting rather than from formatting alone
+- some or all of the `brier` effect comes from quantified forecasting rather than from formatting alone
This design cannot by itself support claims like:
-- humans make better final decisions with `farness`
-- `farness` improves real-world outcomes
-- `farness` forecasts are more accurate on unresolved decisions
+- humans make better final decisions with `brier`
+- `brier` improves real-world outcomes
+- `brier` forecasts are more accurate on unresolved decisions
## Research question
-Do `farness` analyses look more decision-useful than naive or partially structured alternatives when judged by held-out LLMs that do not know which prompt produced which output?
+Do `brier` analyses look more decision-useful than naive or partially structured alternatives when judged by held-out LLMs that do not know which prompt produced which output?
## Core design
@@ -55,13 +55,13 @@ Primary conditions:
1. `naive`
2. `format_control`
3. `forecast_only`
-4. `farness`
+4. `brier`
This decomposition is intentional:
- `format_control` isolates whether legible structure alone helps
- `forecast_only` isolates whether forcing explicit numbers does most of the work
-- `farness` tests the full framework
+- `brier` tests the full framework
`CoT` is omitted from the primary design. It is already weak in the current paper and does not isolate the mechanism you care about here.
@@ -116,10 +116,10 @@ Do the following:
Do not explicitly cite cognitive biases, base rates, disconfirming evidence, or review dates unless they are strictly necessary to support the forecast.
```
-### `farness`
+### `brier`
```text
-You are a decision analyst using the farness framework.
+You are a decision analyst using the brier framework.
A user needs help with this decision:
@@ -146,7 +146,7 @@ Before judging:
- remove condition labels
- remove model names
-- redact explicit mentions of `farness` in the body if they appear
+- redact explicit mentions of `brier` in the body if they appear
- randomize left/right order in pairwise comparisons
### Decision memo representation
@@ -173,7 +173,7 @@ Quantitative support:
{up to 1-2 decisive quantitative claims if supported}
```
-This is the main safeguard against rewarding `farness` by construction. The memo keeps the recommendation, rationale, caveat, and quantitative mechanism visible, but removes the framework-shaped checklist.
+This is the main safeguard against rewarding `brier` by construction. The memo keeps the recommendation, rationale, caveat, and quantitative mechanism visible, but removes the framework-shaped checklist.
### Canonical normalized representation
@@ -311,7 +311,7 @@ Return JSON only:
### Task 3: critique survival
-Critique survival stress-tests whether a recommendation is less undermined by held-out concerns that are not tied to the `farness` checklist.
+Critique survival stress-tests whether a recommendation is less undermined by held-out concerns that are not tied to the `brier` checklist.
Judge prompt:
@@ -345,22 +345,22 @@ Return JSON only:
Primary pairwise comparisons:
-1. `farness` vs `naive`
-2. `farness` vs `forecast_only`
+1. `brier` vs `naive`
+2. `brier` vs `forecast_only`
3. `forecast_only` vs `naive`
4. `format_control` vs `naive`
-The critical comparison is **`farness` vs `forecast_only`**.
+The critical comparison is **`brier` vs `forecast_only`**.
That is the cleanest test of your current intuition:
-> is the main gain simply forcing explicit numeric forecasts, or does the full `farness` checklist add something beyond quantified forecasting?
+> is the main gain simply forcing explicit numeric forecasts, or does the full `brier` checklist add something beyond quantified forecasting?
## Primary endpoint
Primary endpoint:
-- **pairwise win rate for `farness` vs `forecast_only` on the `decision_memo` representation**
+- **pairwise win rate for `brier` vs `forecast_only` on the `decision_memo` representation**
Reason:
@@ -371,7 +371,7 @@ Reason:
## Secondary endpoints
-- `farness` vs `naive` win rate on `decision_memo`
+- `brier` vs `naive` win rate on `decision_memo`
- raw blinded pairwise win rates for all primary comparisons
- critique-survival win rates under held-out critique lenses
- normalized aligned-rubric win rates as a manipulation check
@@ -445,7 +445,7 @@ Critique survival is the robustness endpoint. It should not be treated as a dire
## Interpretation logic
-### If `farness` beats `forecast_only` on `decision_memo` and critique survival
+### If `brier` beats `forecast_only` on `decision_memo` and critique survival
Interpretation:
@@ -465,21 +465,21 @@ Interpretation:
- some of the gain comes from output organization and comparability, not just better reasoning content
-### If `farness` wins only on raw or normalized judging
+### If `brier` wins only on raw or normalized judging
Interpretation:
- judges may mainly prefer visible structure, polish, or framework-shaped artifacts
- this is weak evidence for recommendation-quality improvement
-### If `farness` wins on full artifacts but not `decision_memo`
+### If `brier` wins on full artifacts but not `decision_memo`
Interpretation:
- the product surface may be more useful or auditable, but the final recommendation is not clearly better
- this supports a decision-artifact claim more than a recommendation-quality claim
-### If omission rates remain high under `farness`
+### If omission rates remain high under `brier`
Interpretation:
diff --git a/farness/experiments/PREREGISTRATION.md b/brier/experiments/PREREGISTRATION.md
similarity index 82%
rename from farness/experiments/PREREGISTRATION.md
rename to brier/experiments/PREREGISTRATION.md
index 1fd3d5e..d8cf738 100644
--- a/farness/experiments/PREREGISTRATION.md
+++ b/brier/experiments/PREREGISTRATION.md
@@ -1,4 +1,4 @@
-# Preregistration: Farness Framework Effectiveness Experiment
+# Preregistration: Brier Framework Effectiveness Experiment
**Date:** 2024-12-19
**Authors:** Max Ghenis
@@ -12,28 +12,28 @@ Does prompting an LLM with a structured decision framework ("farness") improve t
### Primary Hypotheses
-**H1:** Farness-framed prompts will produce higher correct recommendation rates than naive prompts.
+**H1:** Brier-framed prompts will produce higher correct recommendation rates than naive prompts.
- *Operationalization:* Binary match to research-backed answer
-- *Expected direction:* Farness > Naive
+- *Expected direction:* Brier > Naive
- *Minimum detectable effect:* 20 percentage points (e.g., 70% vs 50%)
-**H2:** Farness-framed prompts will cite base rates more frequently.
+**H2:** Brier-framed prompts will cite base rates more frequently.
- *Operationalization:* Binary - does response mention any relevant base rate statistic?
-- *Expected direction:* Farness > Naive
+- *Expected direction:* Brier > Naive
-**H3:** Farness-framed prompts will identify more cognitive biases.
+**H3:** Brier-framed prompts will identify more cognitive biases.
- *Operationalization:* Count of biases named that match ground truth list
-- *Expected direction:* Farness > Naive
+- *Expected direction:* Brier > Naive
### Secondary Hypotheses
-**H4:** Farness-framed prompts will include confidence intervals or uncertainty quantification.
+**H4:** Brier-framed prompts will include confidence intervals or uncertainty quantification.
- *Operationalization:* Binary - contains numeric ranges or probability estimates
-**H5:** Farness-framed prompts will produce more actionable accountability mechanisms.
+**H5:** Brier-framed prompts will produce more actionable accountability mechanisms.
- *Operationalization:* Binary - suggests review date, tracking, or feedback loop
-**H6:** Farness-framed prompts will be more resistant to framing effects in the scenario.
+**H6:** Brier-framed prompts will be more resistant to framing effects in the scenario.
- *Operationalization:* Qualitative - does response identify misleading framing?
## Methods
@@ -64,9 +64,9 @@ You are a helpful assistant. Answer this question directly:
Give your recommendation and reasoning.
```
-**Condition B (Farness):**
+**Condition B (Brier):**
```
-You are a decision analyst using the "farness" framework. This framework requires you to:
+You are a decision analyst using the "Brier" framework. This framework requires you to:
1. Define explicit, measurable KPIs for the decision
2. Make numeric forecasts with confidence intervals for each option
@@ -110,7 +110,7 @@ Each response scored on:
### Primary Analysis
-For each metric, compare Farness vs Naive using:
+For each metric, compare Brier vs Naive using:
- **Binary outcomes (H1, H2, H4, H5):** Two-proportion z-test or Fisher's exact test
- **Count outcomes (H3):** Mann-Whitney U test (non-parametric)
@@ -128,7 +128,7 @@ Report:
### Secondary Analyses
1. **Per-case breakdown:** Which cases show largest effect?
-2. **Correlation:** Do cases where naive fails show larger farness benefit?
+2. **Correlation:** Do cases where naive fails show larger brier benefit?
3. **Qualitative:** Example responses showing mechanism of improvement
### Multiple Comparisons
diff --git a/brier/experiments/__init__.py b/brier/experiments/__init__.py
new file mode 100644
index 0000000..172b97c
--- /dev/null
+++ b/brier/experiments/__init__.py
@@ -0,0 +1 @@
+"""Experiments for measuring brier framework effectiveness."""
diff --git a/farness/experiments/__main__.py b/brier/experiments/__main__.py
similarity index 95%
rename from farness/experiments/__main__.py
rename to brier/experiments/__main__.py
index d9ffef2..1b961d0 100644
--- a/farness/experiments/__main__.py
+++ b/brier/experiments/__main__.py
@@ -4,24 +4,24 @@
import json
from pathlib import Path
-from farness.experiments.cases import get_all_cases, get_case
-from farness.experiments.runner import (
+from brier.experiments.cases import get_all_cases, get_case
+from brier.experiments.runner import (
generate_prompts_for_manual_run,
run_experiment,
score_runs,
)
-from farness.experiments.analyze import analyze_experiment, print_results_table, load_scores
-from farness.experiments.stability import (
+from brier.experiments.analyze import analyze_experiment, print_results_table, load_scores
+from brier.experiments.stability import (
get_all_stability_cases,
get_primary_stability_cases,
get_stability_case,
)
-from farness.experiments.stability_runner import (
+from brier.experiments.stability_runner import (
run_stability_experiment,
print_experiment_summary,
)
-from farness.experiments.llm import model_short_name
-from farness.experiments.decision_usefulness import (
+from brier.experiments.llm import model_short_name
+from brier.experiments.decision_usefulness import (
DECISION_USEFULNESS_CONDITIONS,
JUDGE_TASKS,
REPRESENTATIONS,
@@ -51,7 +51,7 @@ def _add_model_args(parser: argparse.ArgumentParser) -> None:
nargs="+",
choices=ALL_CONDITIONS,
default=None,
- help="Conditions to run (default: naive farness)",
+ help="Conditions to run (default: naive brier)",
)
parser.add_argument(
"--probe-batteries",
@@ -65,7 +65,7 @@ def _add_model_args(parser: argparse.ArgumentParser) -> None:
def main():
parser = argparse.ArgumentParser(
- description="Run the farness framework effectiveness experiment"
+ description="Run the brier framework effectiveness experiment"
)
subparsers = parser.add_subparsers(dest="command", help="Command to run")
@@ -165,7 +165,7 @@ def main():
stability_parser.add_argument(
"--strongest-validation",
action="store_true",
- help="Run the strongest reviewer-facing validation preset (primary scenarios, on/off-framework probes, naive + estimate-only + format-control + farness)",
+ help="Run the strongest reviewer-facing validation preset (primary scenarios, on/off-framework probes, naive + estimate-only + format-control + brier)",
)
_add_model_args(stability_parser)
@@ -410,7 +410,7 @@ def main():
print(f"Running stability experiment: {len(cases)} cases, {args.runs} runs/condition")
print(f" Model: {model}")
- print(f" Conditions: {conditions or ['naive', 'farness']}")
+ print(f" Conditions: {conditions or ['naive', 'brier']}")
print(f" Probe batteries: {probe_batteries or ['on_framework']}")
print(f" Output: {output_dir}")
print(f" Starting at run {args.start_run}")
@@ -431,7 +431,7 @@ def main():
print(f"\nResults saved to {output_dir}")
elif args.command == "reframing":
- from farness.experiments.reframing import (
+ from brier.experiments.reframing import (
REFRAMING_CASES,
run_reframing_experiment,
analyze_reframing,
@@ -444,7 +444,7 @@ def main():
print(f"Running reframing experiment: {len(REFRAMING_CASES)} cases, {args.runs} runs/condition")
print(f" Model: {model}")
- print(f" Conditions: {conditions or ['naive', 'farness']}")
+ print(f" Conditions: {conditions or ['naive', 'brier']}")
print(f" Output: {output_dir}")
results = run_reframing_experiment(
@@ -467,7 +467,7 @@ def main():
_reanalyze(args)
elif args.command == "judge":
- from farness.experiments.judge import run_judge_evaluation
+ from brier.experiments.judge import run_judge_evaluation
run_judge_evaluation(
reframing_dir=args.reframing_dir,
stability_dir=args.stability_dir,
@@ -540,8 +540,8 @@ def main():
def _reanalyze(args):
"""Reanalyze results from saved JSON files, discovering model subdirectories."""
- from farness.experiments.stability import StabilityResult, StabilityExperiment
- from farness.experiments.reframing import ReframingResult, analyze_reframing, summary_table
+ from brier.experiments.stability import StabilityResult, StabilityExperiment
+ from brier.experiments.reframing import ReframingResult, analyze_reframing, summary_table
stability_base = Path(args.stability_dir)
reframing_base = Path(args.reframing_dir)
diff --git a/farness/experiments/analyze.py b/brier/experiments/analyze.py
similarity index 79%
rename from farness/experiments/analyze.py
rename to brier/experiments/analyze.py
index 68da70d..df96cca 100644
--- a/farness/experiments/analyze.py
+++ b/brier/experiments/analyze.py
@@ -7,7 +7,7 @@
from pathlib import Path
from typing import Callable, Optional
-from farness.experiments.scorer import ResponseScore, aggregate_scores
+from brier.experiments.scorer import ResponseScore, aggregate_scores
@dataclass
@@ -16,7 +16,7 @@ class StatisticalTest:
metric: str
naive_value: float
- farness_value: float
+ brier_value: float
difference: float
p_value: Optional[float]
significant: bool
@@ -133,26 +133,26 @@ def analyze_experiment(
Analysis results dict
"""
naive = [s for s in scores if s.condition == "naive"]
- farness = [s for s in scores if s.condition == "farness"]
+ brier = [s for s in scores if s.condition == "farness"]
- n_naive, n_farness = len(naive), len(farness)
+ n_naive, n_brier = len(naive), len(brier)
- if n_naive == 0 or n_farness == 0:
+ if n_naive == 0 or n_brier == 0:
return {"error": "Need both conditions to analyze"}
tests = []
# H1: Correct recommendation (if we have labels)
naive_correct = [s for s in naive if s.correct_recommendation is not None]
- farness_correct = [s for s in farness if s.correct_recommendation is not None]
- if naive_correct and farness_correct:
+ brier_correct = [s for s in brier if s.correct_recommendation is not None]
+ if naive_correct and brier_correct:
p1 = sum(s.correct_recommendation for s in naive_correct) / len(naive_correct)
- p2 = sum(s.correct_recommendation for s in farness_correct) / len(farness_correct)
- p_val = proportion_z_test(len(naive_correct), p1, len(farness_correct), p2)
+ p2 = sum(s.correct_recommendation for s in brier_correct) / len(brier_correct)
+ p_val = proportion_z_test(len(naive_correct), p1, len(brier_correct), p2)
tests.append(StatisticalTest(
metric="correct_recommendation",
naive_value=p1,
- farness_value=p2,
+ brier_value=p2,
difference=p2 - p1,
p_value=p_val,
significant=p_val < alpha,
@@ -161,13 +161,13 @@ def analyze_experiment(
# H2: Base rate citation
p1 = sum(s.cites_base_rate for s in naive) / n_naive
- p2 = sum(s.cites_base_rate for s in farness) / n_farness
- p_val = proportion_z_test(n_naive, p1, n_farness, p2)
+ p2 = sum(s.cites_base_rate for s in brier) / n_brier
+ p_val = proportion_z_test(n_naive, p1, n_brier, p2)
secondary_alpha = alpha / 5 if bonferroni_correct else alpha
tests.append(StatisticalTest(
metric="cites_base_rate",
naive_value=p1,
- farness_value=p2,
+ brier_value=p2,
difference=p2 - p1,
p_value=p_val,
significant=p_val < secondary_alpha,
@@ -176,13 +176,13 @@ def analyze_experiment(
# H3: Bias count (Mann-Whitney)
bias_naive = [s.bias_count for s in naive]
- bias_farness = [s.bias_count for s in farness]
- p_val = mann_whitney_u(bias_naive, bias_farness)
+ bias_brier = [s.bias_count for s in brier]
+ p_val = mann_whitney_u(bias_naive, bias_brier)
tests.append(StatisticalTest(
metric="bias_count",
naive_value=sum(bias_naive) / n_naive,
- farness_value=sum(bias_farness) / n_farness,
- difference=sum(bias_farness) / n_farness - sum(bias_naive) / n_naive,
+ brier_value=sum(bias_brier) / n_brier,
+ difference=sum(bias_brier) / n_brier - sum(bias_naive) / n_naive,
p_value=p_val,
significant=p_val < secondary_alpha,
test_name="Mann-Whitney U",
@@ -190,12 +190,12 @@ def analyze_experiment(
# H4: Confidence intervals
p1 = sum(s.has_confidence_interval for s in naive) / n_naive
- p2 = sum(s.has_confidence_interval for s in farness) / n_farness
- p_val = proportion_z_test(n_naive, p1, n_farness, p2)
+ p2 = sum(s.has_confidence_interval for s in brier) / n_brier
+ p_val = proportion_z_test(n_naive, p1, n_brier, p2)
tests.append(StatisticalTest(
metric="has_confidence_interval",
naive_value=p1,
- farness_value=p2,
+ brier_value=p2,
difference=p2 - p1,
p_value=p_val,
significant=p_val < secondary_alpha,
@@ -204,12 +204,12 @@ def analyze_experiment(
# H5: Accountability
p1 = sum(s.has_accountability for s in naive) / n_naive
- p2 = sum(s.has_accountability for s in farness) / n_farness
- p_val = proportion_z_test(n_naive, p1, n_farness, p2)
+ p2 = sum(s.has_accountability for s in brier) / n_brier
+ p_val = proportion_z_test(n_naive, p1, n_brier, p2)
tests.append(StatisticalTest(
metric="has_accountability",
naive_value=p1,
- farness_value=p2,
+ brier_value=p2,
difference=p2 - p1,
p_value=p_val,
significant=p_val < secondary_alpha,
@@ -218,12 +218,12 @@ def analyze_experiment(
# H6: Quantified tradeoffs
p1 = sum(s.quantifies_tradeoffs for s in naive) / n_naive
- p2 = sum(s.quantifies_tradeoffs for s in farness) / n_farness
- p_val = proportion_z_test(n_naive, p1, n_farness, p2)
+ p2 = sum(s.quantifies_tradeoffs for s in brier) / n_brier
+ p_val = proportion_z_test(n_naive, p1, n_brier, p2)
tests.append(StatisticalTest(
metric="quantifies_tradeoffs",
naive_value=p1,
- farness_value=p2,
+ brier_value=p2,
difference=p2 - p1,
p_value=p_val,
significant=p_val < secondary_alpha,
@@ -232,14 +232,14 @@ def analyze_experiment(
return {
"n_naive": n_naive,
- "n_farness": n_farness,
+ "n_brier": n_brier,
"alpha": alpha,
"bonferroni_corrected": bonferroni_correct,
"tests": [
{
"metric": t.metric,
"naive": round(t.naive_value, 3),
- "farness": round(t.farness_value, 3),
+ "farness": round(t.brier_value, 3),
"difference": round(t.difference, 3),
"p_value": round(t.p_value, 4) if t.p_value else None,
"significant": t.significant,
@@ -256,14 +256,14 @@ def _generate_summary(tests: list[StatisticalTest]) -> str:
sig_tests = [t for t in tests if t.significant and t.difference > 0]
if not sig_tests:
- return "No significant differences found favoring the farness framework."
+ return "No significant differences found favoring the brier framework."
- lines = ["Significant improvements with farness framework:"]
+ lines = ["Significant improvements with brier framework:"]
for t in sig_tests:
pct_diff = t.difference * 100
lines.append(
f" - {t.metric}: +{pct_diff:.1f} percentage points "
- f"({t.naive_value*100:.0f}% -> {t.farness_value*100:.0f}%, p={t.p_value:.3f})"
+ f"({t.naive_value*100:.0f}% -> {t.brier_value*100:.0f}%, p={t.p_value:.3f})"
)
return "\n".join(lines)
@@ -272,19 +272,19 @@ def _generate_summary(tests: list[StatisticalTest]) -> str:
def print_results_table(analysis: dict) -> None:
"""Print a formatted results table."""
print("\n" + "=" * 70)
- print("FARNESS FRAMEWORK EXPERIMENT RESULTS")
+ print("BRIER FRAMEWORK EXPERIMENT RESULTS")
print("=" * 70)
- print(f"N (naive): {analysis['n_naive']}, N (farness): {analysis['n_farness']}")
+ print(f"N (naive): {analysis['n_naive']}, N (brier): {analysis['n_brier']}")
print(f"Alpha: {analysis['alpha']}, Bonferroni: {analysis['bonferroni_corrected']}")
print("-" * 70)
- print(f"{'Metric':<25} {'Naive':>8} {'Farness':>8} {'Diff':>8} {'p-value':>10} {'Sig':>5}")
+ print(f"{'Metric':<25} {'Naive':>8} {'Brier':>8} {'Diff':>8} {'p-value':>10} {'Sig':>5}")
print("-" * 70)
for t in analysis["tests"]:
sig_marker = "*" if t["significant"] else ""
p_str = f"{t['p_value']:.4f}" if t["p_value"] else "N/A"
print(
- f"{t['metric']:<25} {t['naive']:>8.1%} {t['farness']:>8.1%} "
+ f"{t['metric']:<25} {t['naive']:>8.1%} {t['brier']:>8.1%} "
f"{t['difference']:>+8.1%} {p_str:>10} {sig_marker:>5}"
)
@@ -320,7 +320,7 @@ def load_scores(scores_file: Path) -> list[ResponseScore]:
import sys
if len(sys.argv) < 2:
- print("Usage: python -m farness.experiments.analyze ")
+ print("Usage: python -m brier.experiments.analyze ")
sys.exit(1)
scores_file = Path(sys.argv[1])
diff --git a/farness/experiments/cases.py b/brier/experiments/cases.py
similarity index 100%
rename from farness/experiments/cases.py
rename to brier/experiments/cases.py
diff --git a/farness/experiments/decision_usefulness.py b/brier/experiments/decision_usefulness.py
similarity index 99%
rename from farness/experiments/decision_usefulness.py
rename to brier/experiments/decision_usefulness.py
index 9409e94..4f967f7 100644
--- a/farness/experiments/decision_usefulness.py
+++ b/brier/experiments/decision_usefulness.py
@@ -7,7 +7,7 @@
- naive
- format_control
- forecast_only
-- farness
+- brier
Each generated analysis is evaluated in three representations:
- decision_memo: a neutral fixed-envelope summary for recommendation quality
@@ -25,7 +25,7 @@
from pathlib import Path
from typing import Any, Optional
-from farness.experiments.llm import call_llm, _is_openai_model, model_short_name
+from brier.experiments.llm import call_llm, _is_openai_model, model_short_name
DECISION_USEFULNESS_CONDITIONS = [
@@ -214,7 +214,7 @@ class DecisionUsefulnessCase:
5. Briefly state the main assumptions behind the forecast.
Do not explicitly cite cognitive biases, base rates, disconfirming evidence, or review dates unless they are strictly necessary to support the forecast.""",
- "farness": """You are a decision analyst using the farness framework.
+ "farness": """You are a decision analyst using the brier framework.
A user needs help with this decision:
@@ -713,7 +713,7 @@ def _clean_freeform_for_memo(text: str) -> str:
def _redact_framework_names(text: str) -> str:
"""Remove explicit framework references from judged text."""
- redacted = re.sub(r"\bfarness\b", "[framework]", text, flags=re.IGNORECASE)
+ redacted = re.sub(r"\bbrier\b", "[framework]", text, flags=re.IGNORECASE)
redacted = re.sub(r"\bforecasting as a harness\b", "[framework]", redacted, flags=re.IGNORECASE)
return redacted
diff --git a/farness/experiments/judge.py b/brier/experiments/judge.py
similarity index 98%
rename from farness/experiments/judge.py
rename to brier/experiments/judge.py
index e663006..795bf18 100644
--- a/farness/experiments/judge.py
+++ b/brier/experiments/judge.py
@@ -12,7 +12,7 @@
from pathlib import Path
from typing import Optional
-from farness.experiments.llm import call_llm, _is_openai_model
+from brier.experiments.llm import call_llm, _is_openai_model
REFRAMING_JUDGE_PROMPT = """You are an expert evaluator of decision advice quality. You will evaluate a response
@@ -214,8 +214,8 @@ def run_judge_evaluation(
verbose: bool = True,
) -> None:
"""Run LLM-as-judge evaluation on existing results."""
- from farness.experiments.reframing import REFRAMING_CASES
- from farness.experiments.stability import STABILITY_CASES
+ from brier.experiments.reframing import REFRAMING_CASES
+ from brier.experiments.stability import STABILITY_CASES
reframing_dir = Path(reframing_dir)
stability_dir = Path(stability_dir)
diff --git a/farness/experiments/llm.py b/brier/experiments/llm.py
similarity index 100%
rename from farness/experiments/llm.py
rename to brier/experiments/llm.py
diff --git a/farness/experiments/reframing.py b/brier/experiments/reframing.py
similarity index 98%
rename from farness/experiments/reframing.py
rename to brier/experiments/reframing.py
index 4a5eb30..42f2dcc 100644
--- a/farness/experiments/reframing.py
+++ b/brier/experiments/reframing.py
@@ -1,4 +1,4 @@
-"""Reframing experiment: does the farness framework reduce problem reframing?
+"""Reframing experiment: does the brier framework reduce problem reframing?
Tests whether structured forecasting constrains or preserves an advisor's
ability to challenge the decision framing itself — suggesting different KPIs,
@@ -14,7 +14,7 @@
from pathlib import Path
from typing import Optional
-from farness.experiments.llm import call_llm
+from brier.experiments.llm import call_llm
try:
import numpy as np
@@ -191,7 +191,7 @@ class ReframingCase:
Think through this carefully step by step, then give me your honest recommendation with reasoning."""
-FARNESS_PROMPT = """You are a decision analyst using the "farness" framework. Apply this process:
+BRIER_PROMPT = """You are a decision analyst using the "Brier" framework. Apply this process:
1. Define 2-3 explicit, measurable KPIs for this decision
2. Identify the options (including ones not mentioned)
@@ -300,7 +300,7 @@ def run_single_trial(
elif condition == "cot":
template = COT_PROMPT
else:
- template = FARNESS_PROMPT
+ template = BRIER_PROMPT
prompt = template.format(scenario=case.scenario.strip())
timestamp = datetime.now().isoformat()
@@ -372,7 +372,7 @@ def analyze_reframing(results: list[ReframingResult]) -> dict:
Supports 2 or 3 conditions with pairwise comparisons and Holm-Bonferroni correction.
"""
- from farness.experiments.stability import holm_bonferroni
+ from brier.experiments.stability import holm_bonferroni
valid = [r for r in results if not r.response_text.startswith("ERROR")]
conditions = sorted(set(r.condition for r in valid))
diff --git a/farness/experiments/runner.py b/brier/experiments/runner.py
similarity index 93%
rename from farness/experiments/runner.py
rename to brier/experiments/runner.py
index 837db98..62b91e3 100644
--- a/farness/experiments/runner.py
+++ b/brier/experiments/runner.py
@@ -1,4 +1,4 @@
-"""Run the farness effectiveness experiment."""
+"""Run the brier effectiveness experiment."""
from __future__ import annotations
@@ -9,9 +9,9 @@
from pathlib import Path
from typing import Optional
-from farness.experiments.cases import TestCase, get_all_cases
-from farness.experiments.llm import call_llm
-from farness.experiments.scorer import ResponseScore, ResponseScorer
+from brier.experiments.cases import TestCase, get_all_cases
+from brier.experiments.llm import call_llm
+from brier.experiments.scorer import ResponseScore, ResponseScorer
@dataclass
@@ -33,7 +33,7 @@ class ExperimentRun:
Give your recommendation and reasoning."""
-FARNESS_TEMPLATE = """You are a decision analyst using the "farness" framework. This framework requires you to:
+BRIER_TEMPLATE = """You are a decision analyst using the "Brier" framework. This framework requires you to:
1. Define explicit, measurable KPIs for the decision
2. Make numeric forecasts with confidence intervals for each option
@@ -49,7 +49,7 @@ class ExperimentRun:
def generate_prompt(case: TestCase, condition: str) -> str:
"""Generate the prompt for a given case and condition."""
- template = NAIVE_TEMPLATE if condition == "naive" else FARNESS_TEMPLATE
+ template = NAIVE_TEMPLATE if condition == "naive" else BRIER_TEMPLATE
return template.format(scenario=case.scenario.strip())
@@ -224,7 +224,7 @@ def generate_prompts_for_manual_run(
if __name__ == "__main__":
import argparse
- parser = argparse.ArgumentParser(description="Run farness experiment")
+ parser = argparse.ArgumentParser(description="Run brier experiment")
parser.add_argument(
"--generate-prompts",
action="store_true",
@@ -250,7 +250,7 @@ def generate_prompts_for_manual_run(
args = parser.parse_args()
- from farness.experiments.cases import get_case
+ from brier.experiments.cases import get_case
if args.case:
case = get_case(args.case)
diff --git a/farness/experiments/scorer.py b/brier/experiments/scorer.py
similarity index 98%
rename from farness/experiments/scorer.py
rename to brier/experiments/scorer.py
index 7b27d96..d155aed 100644
--- a/farness/experiments/scorer.py
+++ b/brier/experiments/scorer.py
@@ -6,7 +6,7 @@
from dataclasses import dataclass
from typing import Optional
-from farness.experiments.cases import TestCase
+from brier.experiments.cases import TestCase
@dataclass
@@ -210,7 +210,7 @@ def aggregate_scores(scores: list[ResponseScore]) -> dict:
return {}
naive_scores = [s for s in scores if s.condition == "naive"]
- farness_scores = [s for s in scores if s.condition == "farness"]
+ brier_scores = [s for s in scores if s.condition == "farness"]
def calc_stats(score_list: list[ResponseScore]) -> dict:
n = len(score_list)
@@ -236,7 +236,7 @@ def calc_stats(score_list: list[ResponseScore]) -> dict:
return {
"naive": calc_stats(naive_scores),
- "farness": calc_stats(farness_scores),
+ "farness": calc_stats(brier_scores),
"by_case": _aggregate_by_case(scores),
}
diff --git a/farness/experiments/stability.py b/brier/experiments/stability.py
similarity index 97%
rename from farness/experiments/stability.py
rename to brier/experiments/stability.py
index 0a42034..f52071e 100644
--- a/farness/experiments/stability.py
+++ b/brier/experiments/stability.py
@@ -19,7 +19,7 @@
np = None # type: ignore
stats = None # type: ignore
-from farness.experiments.cases import DecisionCase
+from brier.experiments.cases import DecisionCase
DEFAULT_PROBE_BATTERY = "on_framework"
@@ -32,7 +32,7 @@
"estimate_only": "Estimate Only",
"format_control": "Format Control",
"cot": "CoT",
- "farness": "Farness",
+ "farness": "Brier",
}
PROBE_BATTERY_DISPLAY_NAMES = {
"on_framework": "On-Framework Probes",
@@ -61,7 +61,7 @@ class QuantitativeCase:
# Expected direction of update given probes (for validation)
expected_update_direction: str # "up", "down", or "neutral"
- # Held-out probes that are intentionally not named in the farness prompt
+ # Held-out probes that are intentionally not named in the brier prompt
off_framework_probes: Optional[list[str]] = None
off_framework_expected_update_direction: Optional[str] = None
@@ -606,7 +606,7 @@ def generate_format_control_prompt(case: QuantitativeCase) -> str:
def generate_cot_prompt(case: QuantitativeCase) -> str:
- """Generate chain-of-thought prompt (structured reasoning, no farness framework)."""
+ """Generate chain-of-thought prompt (structured reasoning, no brier framework)."""
return f"""You are a helpful assistant. Think through this step by step.
{case.scenario}
@@ -614,9 +614,9 @@ def generate_cot_prompt(case: QuantitativeCase) -> str:
Question: {case.estimate_question} Think through this carefully step by step, then give a single number and an 80% confidence interval.{_JSON_INSTRUCTION}"""
-def generate_farness_prompt(case: QuantitativeCase) -> str:
- """Generate farness framework prompt."""
- return f"""You are a decision analyst using the "farness" framework. This requires:
+def generate_brier_prompt(case: QuantitativeCase) -> str:
+ """Generate brier framework prompt."""
+ return f"""You are a decision analyst using the "Brier" framework. This requires:
1. Cite base rates from research (outside view)
2. Make numeric forecasts with confidence intervals
3. Identify cognitive biases in the framing
@@ -633,7 +633,7 @@ def generate_initial_prompt(case: QuantitativeCase, condition: str) -> str:
"estimate_only": generate_estimate_only_prompt,
"format_control": generate_format_control_prompt,
"cot": generate_cot_prompt,
- "farness": generate_farness_prompt,
+ "farness": generate_brier_prompt,
}
try:
return prompt_generators[condition](case)
@@ -1040,7 +1040,7 @@ def _get_case(self, case_id: str) -> Optional[QuantitativeCase]:
return None
def _measure_convergence(self, results: Optional[list[StabilityResult]] = None) -> dict:
- """Measure whether naive(probed) converges toward farness(initial).
+ """Measure whether naive(probed) converges toward brier(initial).
Uses minimum gap threshold to avoid division instability.
Provides bootstrap confidence intervals for the convergence ratio.
@@ -1053,21 +1053,21 @@ def _measure_convergence(self, results: Optional[list[StabilityResult]] = None)
naive_results = [
r for r in results if r.case_id == case.id and r.condition == "naive"
]
- farness_results = [
+ brier_results = [
r for r in results if r.case_id == case.id and r.condition == "farness"
]
- if not naive_results or not farness_results:
+ if not naive_results or not brier_results:
continue
- # Average farness initial estimates per scenario to avoid pseudo-replication
- farness_initial_mean = sum(r.initial_estimate for r in farness_results) / len(farness_results)
+ # Average brier initial estimates per scenario to avoid pseudo-replication
+ brier_initial_mean = sum(r.initial_estimate for r in brier_results) / len(brier_results)
for naive_r in naive_results:
- # Distance from naive(initial) to mean farness(initial)
- initial_gap = abs(naive_r.initial_estimate - farness_initial_mean)
- # Distance from naive(probed) to mean farness(initial)
- final_gap = abs(naive_r.final_estimate - farness_initial_mean)
+ # Distance from naive(initial) to mean brier(initial)
+ initial_gap = abs(naive_r.initial_estimate - brier_initial_mean)
+ # Distance from naive(probed) to mean brier(initial)
+ final_gap = abs(naive_r.final_estimate - brier_initial_mean)
# Skip if initial gap too small (estimates already similar)
if initial_gap < MIN_GAP_THRESHOLD:
@@ -1142,9 +1142,9 @@ def _measure_convergence(self, results: Optional[list[StabilityResult]] = None)
# Interpretation based on CI and effect size
if ci_low is not None and ci_low > 0:
- interpretation = "Significant convergence: naive responses moved toward farness initial estimates (CI excludes 0)"
+ interpretation = "Significant convergence: naive responses moved toward brier initial estimates (CI excludes 0)"
elif ci_high is not None and ci_high < 0:
- interpretation = "Significant divergence: naive responses moved away from farness initial estimates"
+ interpretation = "Significant divergence: naive responses moved away from brier initial estimates"
elif ci_low is None:
# No CI available (scipy not installed)
interpretation = f"Mean convergence ratio: {avg_convergence:.2f} (install scipy for CI and p-value)"
diff --git a/farness/experiments/stability_runner.py b/brier/experiments/stability_runner.py
similarity index 98%
rename from farness/experiments/stability_runner.py
rename to brier/experiments/stability_runner.py
index 5061ec6..f76df9b 100644
--- a/farness/experiments/stability_runner.py
+++ b/brier/experiments/stability_runner.py
@@ -15,8 +15,8 @@
from pathlib import Path
from typing import Optional
-from farness.experiments.llm import call_llm, model_short_name
-from farness.experiments.stability import (
+from brier.experiments.llm import call_llm, model_short_name
+from brier.experiments.stability import (
DEFAULT_PROBE_BATTERY,
QuantitativeCase,
StabilityResult,
@@ -355,7 +355,7 @@ def print_experiment_summary(experiment: StabilityExperiment) -> None:
type=str,
nargs="+",
default=None,
- help="Conditions to test (default: naive farness)",
+ help="Conditions to test (default: naive brier)",
)
args = parser.parse_args()
diff --git a/farness/framework.py b/brier/framework.py
similarity index 100%
rename from farness/framework.py
rename to brier/framework.py
diff --git a/farness/market.py b/brier/market.py
similarity index 96%
rename from farness/market.py
rename to brier/market.py
index 778c4f8..0f7ec13 100644
--- a/farness/market.py
+++ b/brier/market.py
@@ -1,4 +1,4 @@
-"""Market-draft helpers for turning farness forecasts into forecast markets."""
+"""Market-draft helpers for turning brier forecasts into forecast markets."""
from __future__ import annotations
@@ -9,7 +9,7 @@
from pathlib import Path
from typing import Any, Literal, Optional
-from farness.framework import Decision, Forecast, KPI
+from brier.framework import Decision, Forecast, KPI
MarketOutcomeType = Literal["BINARY", "PSEUDO_NUMERIC"]
MarketVisibility = Literal["public", "unlisted"]
@@ -253,7 +253,7 @@ def draft_market_for_option_kpi(
f"If the condition is true: {resolution_rule}"
)
context = (
- f"Original farness decision: {decision.question}\n\n"
+ f"Original brier decision: {decision.question}\n\n"
f"Condition: if `{option_name}` is chosen or implemented.\n\n"
f"KPI: {kpi.name} - {kpi.description}"
)
@@ -282,7 +282,7 @@ def draft_market_for_option_kpi(
resolution_date=resolution_date,
resolution_rule=conditional_resolution_rule,
source_forecast=_source_forecast_from_forecast(forecast),
- notes=["Drafted from a stored farness forecast."],
+ notes=["Drafted from a stored brier forecast."],
)
low, high = forecast.confidence_interval
@@ -309,7 +309,7 @@ def draft_market_for_option_kpi(
resolution_date=resolution_date,
resolution_rule=conditional_resolution_rule,
source_forecast=_source_forecast_from_forecast(forecast),
- notes=["Drafted from a stored farness forecast."],
+ notes=["Drafted from a stored brier forecast."],
)
@@ -352,14 +352,14 @@ def _description_markdown(
parts.extend(
[
"",
- "_Drafted by farness. Review wording and resolution criteria before posting._",
+ "_Drafted by brier. Review wording and resolution criteria before posting._",
]
)
return "\n".join(parts).strip()
def _source_forecast_from_forecast(forecast: Forecast) -> SourceForecast:
- """Convert a farness forecast to a market-source forecast."""
+ """Convert a brier forecast to a market-source forecast."""
ci_low, ci_high = forecast.confidence_interval
return SourceForecast(
point_estimate=forecast.point_estimate,
diff --git a/farness/mcp_server.py b/brier/mcp_server.py
similarity index 95%
rename from farness/mcp_server.py
rename to brier/mcp_server.py
index 9032a85..75ad839 100644
--- a/farness/mcp_server.py
+++ b/brier/mcp_server.py
@@ -1,4 +1,4 @@
-"""MCP server for farness."""
+"""MCP server for brier."""
import argparse
import json
@@ -7,9 +7,9 @@
from pathlib import Path
from typing import Any, Literal
-from farness import CalibrationTracker, DecisionStore
-from farness.framework import Decision, Forecast, KPI, Option
-from farness.market import (
+from brier import CalibrationTracker, DecisionStore
+from brier.framework import Decision, Forecast, KPI, Option
+from brier.market import (
MarketSource,
draft_binary_policy_market,
draft_markets_for_decision,
@@ -19,7 +19,7 @@
def _resolve_store_path(store_path: str | None = None) -> Path | None:
"""Resolve the configured store path, falling back to environment."""
- candidate = store_path or os.environ.get("FARNESS_STORE_PATH")
+ candidate = store_path or os.environ.get("BRIER_STORE_PATH")
return Path(candidate).expanduser() if candidate else None
@@ -232,7 +232,7 @@ def save_decision_analysis(
context: str | None = None,
store_path: str | None = None,
) -> dict[str, Any]:
- """Persist a structured farness analysis onto an existing decision."""
+ """Persist a structured brier analysis onto an existing decision."""
store = _get_store(store_path)
decision = store.get(decision_id)
if not decision:
@@ -363,7 +363,7 @@ def build_server(store_path: str | None = None):
except ImportError as exc: # pragma: no cover - exercised by installation, not tests
raise RuntimeError(
"MCP support is not installed. Install the repo with MCP extras, "
- "for example `python -m pip install -e '/path/to/farness[mcp]'`."
+ "for example `python -m pip install -e '/path/to/brier[mcp]'`."
) from exc
resolved_store_path = _resolve_store_path(store_path)
@@ -420,9 +420,9 @@ class MarketSourceInput(BaseModel):
url: str = Field(description="Source URL")
server = FastMCP(
- "farness",
+ "brier",
instructions=(
- "Use farness to structure decisions as KPIs, options, forecasts, "
+ "Use brier to structure decisions as KPIs, options, forecasts, "
"reference classes, disconfirming evidence, review dates, and resolvable KPI metadata. "
"In the first answer, show the forecast summary and explain how it drives the recommendation."
),
@@ -433,7 +433,7 @@ def _store() -> DecisionStore:
@server.tool(
title="Create decision",
- description="Create an empty decision record to analyze with the farness workflow.",
+ description="Create an empty decision record to analyze with the brier workflow.",
structured_output=True,
)
def create_decision(question: str, context: str = "") -> dict[str, Any]:
@@ -536,7 +536,7 @@ def get_calibration_summary() -> dict[str, Any]:
@server.tool(
title="Draft forecast market pack",
description=(
- "Draft Manifold-ready forecast market JSON for a stored farness decision "
+ "Draft Manifold-ready forecast market JSON for a stored brier decision "
"or standalone policy question. This never creates markets or places bets."
),
structured_output=True,
@@ -567,15 +567,15 @@ def draft_market_pack(
)
@server.resource(
- "farness://framework",
- title="Farness framework",
- description="The canonical seven-step farness workflow.",
+ "brier://framework",
+ title="Brier framework",
+ description="The canonical seven-step brier workflow.",
mime_type="text/markdown",
)
def framework_resource() -> str:
"""Static overview of the framework."""
return (
- "# Farness\n\n"
+ "# Brier\n\n"
"1. Define one or two KPIs that are later scoreable: include outcome type, "
"resolution rule, resolution date, and data source.\n"
"2. Expand the option set beyond the choices already mentioned.\n"
@@ -629,13 +629,13 @@ def calibration_resource() -> str:
@server.prompt(
title="Analyze decision",
- description="Prompt template for producing a full farness analysis for a stored decision.",
+ description="Prompt template for producing a full brier analysis for a stored decision.",
)
def analyze_decision(decision_id: str) -> str:
"""Generate a prompt to analyze a stored decision."""
decision = get_decision(decision_id)
return (
- "Use the farness workflow for this stored decision.\n\n"
+ "Use the brier workflow for this stored decision.\n\n"
f"Decision record:\n{json.dumps(decision, indent=2)}\n\n"
"Produce:\n"
"1. explicit KPIs with outcome type, resolution rule, resolution date, and data source\n"
@@ -664,7 +664,7 @@ def review_decision(decision_id: str) -> str:
"""Generate a prompt to review a stored decision."""
decision = get_decision(decision_id)
return (
- "Review this farness decision.\n\n"
+ "Review this brier decision.\n\n"
f"Decision record:\n{json.dumps(decision, indent=2)}\n\n"
"Check whether:\n"
"- the chosen option still makes sense,\n"
@@ -696,12 +696,12 @@ def score_decision_prompt(decision_id: str) -> str:
def main() -> None:
- """Run the farness MCP server."""
- parser = argparse.ArgumentParser(prog="farness-mcp", description="Run the farness MCP server.")
+ """Run the brier MCP server."""
+ parser = argparse.ArgumentParser(prog="brier-mcp", description="Run the brier MCP server.")
parser.add_argument(
"--store",
default=None,
- help="Optional path to the farness JSONL store. Defaults to $FARNESS_STORE_PATH or ~/.farness/decisions.jsonl.",
+ help="Optional path to the brier JSONL store. Defaults to $BRIER_STORE_PATH or ~/.brier/decisions.jsonl.",
)
parser.add_argument(
"--transport",
diff --git a/farness/skills.py b/brier/skills.py
similarity index 92%
rename from farness/skills.py
rename to brier/skills.py
index f774f22..625fc57 100644
--- a/farness/skills.py
+++ b/brier/skills.py
@@ -28,11 +28,11 @@ def default_skill_dir(agent: str) -> Path:
if agent == "codex":
codex_home = os.environ.get("CODEX_HOME")
if codex_home:
- return Path(codex_home).expanduser() / "skills" / "farness"
- return Path.home() / ".codex" / "skills" / "farness"
+ return Path(codex_home).expanduser() / "skills" / "brier"
+ return Path.home() / ".codex" / "skills" / "brier"
if agent == "claude":
- return Path.home() / ".claude" / "skills" / "farness"
+ return Path.home() / ".claude" / "skills" / "brier"
raise ValueError(f"Unsupported agent: {agent}")
@@ -40,7 +40,7 @@ def default_skill_dir(agent: str) -> Path:
def load_skill_text(agent: str) -> str:
"""Return the packaged skill template for the requested agent."""
try:
- resource = resources.files("farness").joinpath(*SKILL_RESOURCE_PATHS[agent])
+ resource = resources.files("brier").joinpath(*SKILL_RESOURCE_PATHS[agent])
except KeyError as exc:
raise ValueError(f"Unsupported agent: {agent}") from exc
return resource.read_text(encoding="utf-8")
diff --git a/farness/storage.py b/brier/storage.py
similarity index 96%
rename from farness/storage.py
rename to brier/storage.py
index ee4672b..d43ebf0 100644
--- a/farness/storage.py
+++ b/brier/storage.py
@@ -5,7 +5,7 @@
from pathlib import Path
from typing import Optional
-from farness.framework import Decision
+from brier.framework import Decision
class DecisionStore:
@@ -13,7 +13,7 @@ class DecisionStore:
def __init__(self, path: Optional[Path] = None):
if path is None:
- path = Path.home() / ".farness" / "decisions.jsonl"
+ path = Path.home() / ".brier" / "decisions.jsonl"
self.path = Path(path)
self.path.parent.mkdir(parents=True, exist_ok=True)
diff --git a/claude-plugin/.claude-plugin/plugin.json b/claude-plugin/.claude-plugin/plugin.json
index 7bed4d9..1c4d3f2 100644
--- a/claude-plugin/.claude-plugin/plugin.json
+++ b/claude-plugin/.claude-plugin/plugin.json
@@ -1,5 +1,5 @@
{
- "name": "farness",
+ "name": "brier",
"version": "0.2.4",
"description": "Forecasting as a harness - reframe decisions as KPI predictions",
"author": {
diff --git a/claude-plugin/commands/decide.md b/claude-plugin/commands/decide.md
index 545f0ca..0222a64 100644
--- a/claude-plugin/commands/decide.md
+++ b/claude-plugin/commands/decide.md
@@ -1,12 +1,12 @@
---
-description: Run a structured decision analysis using the farness framework (forecasting as a harness)
+description: Run a structured decision analysis using the brier framework (forecasting as a harness)
arguments:
- name: decision
description: The decision or question to analyze (optional - will prompt if not provided)
required: false
---
-# Farness Decision Framework
+# Brier Decision Framework
You are running a structured decision analysis. Follow this framework exactly:
@@ -60,11 +60,11 @@ Ask: "What information would most change these estimates?"
## Step 6: Log the Decision
-After completing the analysis, use Python to save the decision using the farness package:
+After completing the analysis, use Python to save the decision using the brier package:
```python
from datetime import datetime, timedelta
-from farness import Decision, KPI, Option, Forecast, DecisionStore
+from brier import Decision, KPI, Option, Forecast, DecisionStore
# Create the decision object with all the data from the analysis
decision = Decision(
@@ -104,7 +104,7 @@ store.save(decision)
print(f"Decision logged: {decision.id[:8]}")
```
-Tell the user: "Decision logged. Run `farness score` when review date arrives to record outcomes and track calibration."
+Tell the user: "Decision logged. Run `brier score` when review date arrives to record outcomes and track calibration."
## Key Principles
diff --git a/claude-plugin/commands/score.md b/claude-plugin/commands/score.md
index b9d0465..3814ee8 100644
--- a/claude-plugin/commands/score.md
+++ b/claude-plugin/commands/score.md
@@ -15,7 +15,7 @@ Review a past decision and score how the forecasts performed.
Run the interactive scoring command:
```bash
-farness score $ARGUMENTS
+brier score $ARGUMENTS
```
This will:
@@ -32,13 +32,13 @@ This will:
List unscored decisions:
```bash
-farness list --unscored
+brier list --unscored
```
Or show a specific decision:
```bash
-farness show
+brier show
```
### Step 2: Review Original Forecasts
@@ -60,7 +60,7 @@ Get specific numbers.
```python
from datetime import datetime
-from farness import DecisionStore
+from brier import DecisionStore
store = DecisionStore()
decision = store.get("")
@@ -78,7 +78,7 @@ store.update(decision)
### Step 5: Show Calibration
```bash
-farness calibration
+brier calibration
```
## Reflection Questions
diff --git a/docs/agent-workflows.md b/docs/agent-workflows.md
index a961796..055548d 100644
--- a/docs/agent-workflows.md
+++ b/docs/agent-workflows.md
@@ -1,13 +1,13 @@
# Agent Workflows
-`farness` is not tied to one assistant. The Claude Code plugin is the most integrated path today, but the framework also works with Codex, Cursor, Windsurf, ChatGPT, and any other agent that can follow structured instructions.
+`brier` is not tied to one assistant. The Claude Code plugin is the most integrated path today, but the framework also works with Codex, Cursor, Windsurf, ChatGPT, and any other agent that can follow structured instructions.
## Core instruction
-Give your agent this instruction when you want a decision analyzed with `farness`:
+Give your agent this instruction when you want a decision analyzed with `brier`:
```text
-Use the farness workflow for this decision.
+Use the brier workflow for this decision.
1. Define the KPI or outcome that would make the decision successful.
2. Expand the option set beyond the choices already mentioned.
3. Anchor on a relevant reference class or base rate before using the inside view.
@@ -20,7 +20,7 @@ Do not answer with a vague recommendation until the forecasts are explicit.
## Codex and other coding agents
-This works well in tools like Codex because they already have the two things `farness` needs:
+This works well in tools like Codex because they already have the two things `brier` needs:
- access to local context
- the ability to log decisions through the CLI or Python package
@@ -28,18 +28,18 @@ This works well in tools like Codex because they already have the two things `fa
Minimal workflow:
```bash
-python -m pip install farness
-farness new "Should we rewrite the auth layer?" --context "3 incidents this quarter; CTO prefers Rust; team is strongest in Node."
+python -m pip install brier
+brier new "Should we rewrite the auth layer?" --context "3 incidents this quarter; CTO prefers Rust; team is strongest in Node."
```
-Then ask the agent to use the core instruction above and to read or update the decision in `~/.farness/decisions.jsonl`.
+Then ask the agent to use the core instruction above and to read or update the decision in `~/.brier/decisions.jsonl`.
If you want Codex to pick this workflow up as a native skill, install the packaged skill:
```bash
-python -m pip install 'farness[mcp]'
-farness setup codex
-farness doctor codex
+python -m pip install 'brier[mcp]'
+brier setup codex
+brier doctor codex
```
Then restart Codex.
@@ -47,16 +47,16 @@ Then restart Codex.
If the skill drifted or setup only half-worked:
```bash
-farness doctor codex --fix
+brier doctor codex --fix
```
## MCP server
-If you want a native tool surface instead of prompt copy-paste, `farness` ships an MCP server:
+If you want a native tool surface instead of prompt copy-paste, `brier` ships an MCP server:
```bash
-python -m pip install 'farness[mcp]'
-farness-mcp
+python -m pip install 'brier[mcp]'
+brier-mcp
```
The server exposes:
@@ -68,9 +68,9 @@ The server exposes:
Optional configuration:
```bash
-FARNESS_STORE_PATH=/path/to/decisions.jsonl farness-mcp
+BRIER_STORE_PATH=/path/to/decisions.jsonl brier-mcp
# or
-farness-mcp --store /path/to/decisions.jsonl
+brier-mcp --store /path/to/decisions.jsonl
```
The default transport is `stdio`, which is the right default for editor and agent integrations.
@@ -78,7 +78,7 @@ The default transport is `stdio`, which is the right default for editor and agen
To register the local server in Codex:
```bash
-farness setup codex
+brier setup codex
```
## Claude Code
@@ -86,51 +86,51 @@ farness setup codex
Claude Code can use the same local MCP server and a local skill wrapper:
```bash
-python -m pip install 'farness[mcp]'
-farness setup claude
-farness doctor claude
+python -m pip install 'brier[mcp]'
+brier setup claude
+brier doctor claude
```
-This gives Claude Code a local skill plus the `farness` MCP tools/resources/prompts.
+This gives Claude Code a local skill plus the `brier` MCP tools/resources/prompts.
If the skill drifted or setup only half-worked:
```bash
-farness doctor claude --fix
+brier doctor claude --fix
```
The plugin path is still available if you prefer slash commands:
```bash
-claude plugin marketplace add MaxGhenis/farness
-claude plugin install farness@maxghenis-plugins
+claude plugin marketplace add MaxGhenis/brier
+claude plugin install brier@maxghenis-plugins
```
-Then either use the local skill or run `/farness:decide` for the plugin flow.
+Then either use the local skill or run `/brier:decide` for the plugin flow.
## Python and CLI
-If you do not want any agent integration, `farness` still works as a local decision log and calibration tool. The CLI does not call an LLM and does not need an API key.
+If you do not want any agent integration, `brier` still works as a local decision log and calibration tool. The CLI does not call an LLM and does not need an API key.
Useful commands:
```bash
-farness new "Should we rewrite the auth layer?"
-farness list
-farness show
-farness pending
-farness calibration
+brier new "Should we rewrite the auth layer?"
+brier list
+brier show
+brier pending
+brier calibration
```
To draft forecast questions from a standalone policy question or a stored decision:
```bash
-farness forecast-draft "Will Waymo be legally permitted to offer fully driverless paid robotaxi rides in Washington, DC by 2026-12-31?" \
+brier forecast-draft "Will Waymo be legally permitted to offer fully driverless paid robotaxi rides in Washington, DC by 2026-12-31?" \
--initial-prob 52 \
--resolution-date 2026-12-31 \
--output waymo-dc-forecast-pack.json
-farness forecast-draft --output forecast-pack.json
+brier forecast-draft --output forecast-pack.json
```
This only writes Manifold-ready JSON. It does not publish questions, create Manifold entries,
@@ -139,8 +139,8 @@ place bets, or require a Manifold API key.
If you want to fully reset a local integration:
```bash
-farness uninstall codex
-farness uninstall claude
+brier uninstall codex
+brier uninstall claude
```
## Recommended prompt shape
diff --git a/farness/__init__.py b/farness/__init__.py
deleted file mode 100644
index 888e79b..0000000
--- a/farness/__init__.py
+++ /dev/null
@@ -1,21 +0,0 @@
-"""Farness: Forecasting as a harness for decision-making."""
-
-__version__ = "0.2.4"
-
-from farness.framework import Decision, KPI, Option, Forecast, OutcomeType
-from farness.storage import DecisionStore
-from farness.calibration import CalibrationTracker
-from farness.market import MarketDraft, MarketSource, draft_markets_for_decision
-
-__all__ = [
- "Decision",
- "KPI",
- "Option",
- "Forecast",
- "OutcomeType",
- "DecisionStore",
- "CalibrationTracker",
- "MarketDraft",
- "MarketSource",
- "draft_markets_for_decision",
-]
diff --git a/farness/experiments/__init__.py b/farness/experiments/__init__.py
deleted file mode 100644
index d6220cb..0000000
--- a/farness/experiments/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-"""Experiments for measuring farness framework effectiveness."""
diff --git a/forecast-api/.env.example b/forecast-api/.env.example
index b6efc75..e40c4ab 100644
--- a/forecast-api/.env.example
+++ b/forecast-api/.env.example
@@ -2,7 +2,7 @@
# AI_GATEWAY_API_KEY=...
# Defaults to anthropic/claude-sonnet-4.6.
-# FARNESS_AI_MODEL=anthropic/claude-sonnet-4.6
+# BRIER_AI_MODEL=anthropic/claude-sonnet-4.6
# Comma-separated browser origins allowed to read SSE streams.
-# FARNESS_SITE_ORIGINS=https://farness.ai,http://127.0.0.1:3001
+# BRIER_SITE_ORIGINS=https://brieralmanac.org,http://127.0.0.1:3001
diff --git a/forecast-api/README.md b/forecast-api/README.md
index b046f30..b0a7d3b 100644
--- a/forecast-api/README.md
+++ b/forecast-api/README.md
@@ -1,4 +1,4 @@
-# Farness Forecast API
+# Brier Forecast API
Small Vercel-deployable backend for live forecast traces.
@@ -10,7 +10,7 @@ bun run dev -- --hostname 127.0.0.1 --port 3002
```
The static site reads from `http://127.0.0.1:3002` on local hosts unless
-`NEXT_PUBLIC_FARNESS_API_BASE_URL` is set.
+`NEXT_PUBLIC_BRIER_API_BASE_URL` is set.
AI Gateway is optional locally. Without `AI_GATEWAY_API_KEY`,
`VERCEL_OIDC_TOKEN`, or a Vercel runtime, live endpoints still stream public
diff --git a/forecast-api/package.json b/forecast-api/package.json
index 2eb598b..16e47e3 100644
--- a/forecast-api/package.json
+++ b/forecast-api/package.json
@@ -1,5 +1,5 @@
{
- "name": "farness-forecast-api",
+ "name": "brier-forecast-api",
"private": true,
"version": "0.0.0",
"scripts": {
diff --git a/forecast-api/src/app/forecasts/[slug]/stream/route.ts b/forecast-api/src/app/forecasts/[slug]/stream/route.ts
index 3b029a1..26cfef2 100644
--- a/forecast-api/src/app/forecasts/[slug]/stream/route.ts
+++ b/forecast-api/src/app/forecasts/[slug]/stream/route.ts
@@ -157,17 +157,17 @@ async function streamSpmChildPovertyForecast(send: SendEvent) {
});
const calibrationCall =
- 'farness.calibration.lookup({ domain: "poverty_forecasts", outcome: "spm_child_poverty_rate", targetYear: 2025 })';
+ 'brier.calibration.lookup({ domain: "poverty_forecasts", outcome: "spm_child_poverty_rate", targetYear: 2025 })';
send("status", {
state: "tool_running",
label: "Looking up SPM calibration prior",
});
send("tool_start", {
- tool: "farness.calibration",
+ tool: "brier.calibration",
call: calibrationCall,
});
send("tool_result", {
- tool: "farness.calibration",
+ tool: "brier.calibration",
call: calibrationCall,
result: serializeSpmCalibrationToolResult(dataset),
});
@@ -361,13 +361,13 @@ async function streamCtcCurrentLawOutlaysForecast(send: SendEvent) {
});
const calibrationCall =
- 'farness.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", outcome: "current_law_outlays" })';
+ 'brier.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", outcome: "current_law_outlays" })';
send("status", {
state: "tool_running",
label: "Looking up CTC outlay calibration",
});
send("tool_start", {
- tool: "farness.calibration",
+ tool: "brier.calibration",
call: calibrationCall,
});
@@ -378,7 +378,7 @@ async function streamCtcCurrentLawOutlaysForecast(send: SendEvent) {
const ciLow = 52.0;
const ciHigh = 70.0;
send("tool_result", {
- tool: "farness.calibration",
+ tool: "brier.calibration",
call: calibrationCall,
result: JSON.stringify(
{
@@ -489,17 +489,17 @@ async function streamCtcExpansionForecast(send: SendEvent) {
});
const calibrationCall =
- 'farness.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", outcome: "federal_budget_cost" })';
+ 'brier.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", outcome: "federal_budget_cost" })';
send("status", {
state: "tool_running",
label: "Looking up calibration prior",
});
send("tool_start", {
- tool: "farness.calibration",
+ tool: "brier.calibration",
call: calibrationCall,
});
send("tool_result", {
- tool: "farness.calibration",
+ tool: "brier.calibration",
call: calibrationCall,
result: JSON.stringify(dataset.calibration, null, 2),
});
diff --git a/forecast-api/src/app/health/route.ts b/forecast-api/src/app/health/route.ts
index 4665d84..56bf9af 100644
--- a/forecast-api/src/app/health/route.ts
+++ b/forecast-api/src/app/health/route.ts
@@ -1,6 +1,6 @@
export function GET() {
return Response.json({
ok: true,
- service: "farness-forecast-api",
+ service: "brier-forecast-api",
});
}
diff --git a/forecast-api/src/lib/cors.ts b/forecast-api/src/lib/cors.ts
index 2293946..ca4f2da 100644
--- a/forecast-api/src/lib/cors.ts
+++ b/forecast-api/src/lib/cors.ts
@@ -1,6 +1,6 @@
const DEFAULT_ORIGINS = [
- "https://farness.ai",
- "https://www.farness.ai",
+ "https://brieralmanac.org",
+ "https://www.brieralmanac.org",
"http://localhost:3000",
"http://127.0.0.1:3000",
"http://localhost:3001",
@@ -10,7 +10,7 @@ const DEFAULT_ORIGINS = [
export function corsHeaders(request: Request): HeadersInit {
const origin = request.headers.get("origin");
const allowedOrigins = (
- process.env.FARNESS_SITE_ORIGINS?.split(",") ?? DEFAULT_ORIGINS
+ process.env.BRIER_SITE_ORIGINS?.split(",") ?? DEFAULT_ORIGINS
).map((value) => value.trim());
const allowOrigin =
origin && allowedOrigins.includes(origin) ? origin : allowedOrigins[0];
diff --git a/forecast-api/src/lib/forecast.ts b/forecast-api/src/lib/forecast.ts
index f7a96fb..e0690eb 100644
--- a/forecast-api/src/lib/forecast.ts
+++ b/forecast-api/src/lib/forecast.ts
@@ -86,7 +86,7 @@ export async function generateCpiForecast(
);
}
- const model = process.env.FARNESS_AI_MODEL ?? "anthropic/claude-sonnet-4.6";
+ const model = process.env.BRIER_AI_MODEL ?? "anthropic/claude-sonnet-4.6";
try {
const result = await generateObject({
@@ -94,7 +94,7 @@ export async function generateCpiForecast(
schema: ForecastSchema,
temperature: 0.2,
system:
- "You are a Farness public forecasting agent. Produce concise, audit-ready reasoning for public readers. Do not reveal hidden chain-of-thought; provide a public trace with evidence, assumptions, and uncertainty.",
+ "You are a Brier public forecasting agent. Produce concise, audit-ready reasoning for public readers. Do not reveal hidden chain-of-thought; provide a public trace with evidence, assumptions, and uncertainty.",
prompt: [
"Forecast this public prediction cell:",
"What will the annual average percent change in CPI-U for calendar year 2026 versus the 2025 annual average be, as published by BLS?",
@@ -132,7 +132,7 @@ export async function generateCtcExpansionForecast(
);
}
- const model = process.env.FARNESS_AI_MODEL ?? "anthropic/claude-sonnet-4.6";
+ const model = process.env.BRIER_AI_MODEL ?? "anthropic/claude-sonnet-4.6";
try {
const result = await generateObject({
@@ -140,7 +140,7 @@ export async function generateCtcExpansionForecast(
schema: CtcExpansionForecastSchema,
temperature: 0.2,
system:
- "You are a Farness public forecasting agent. Forecast in billions of nominal dollars. Use public, audit-ready reasoning only. Treat PolicyEngine as an explicit model input, not as ground truth, and describe calibration adjustments without hidden chain-of-thought.",
+ "You are a Brier public forecasting agent. Forecast in billions of nominal dollars. Use public, audit-ready reasoning only. Treat PolicyEngine as an explicit model input, not as ground truth, and describe calibration adjustments without hidden chain-of-thought.",
prompt: [
"Forecast this public prediction cell:",
dataset.summary.question,
@@ -186,7 +186,7 @@ export async function generateSpmChildPovertyForecast(
);
}
- const model = process.env.FARNESS_AI_MODEL ?? "anthropic/claude-sonnet-4.6";
+ const model = process.env.BRIER_AI_MODEL ?? "anthropic/claude-sonnet-4.6";
try {
const result = await generateObject({
@@ -194,7 +194,7 @@ export async function generateSpmChildPovertyForecast(
schema: SpmChildPovertyForecastSchema,
temperature: 0.2,
system:
- "You are a Farness public forecasting agent. Forecast in percentage points. Use public, audit-ready reasoning only. Treat Census history and PolicyEngine current-law inputs as explicit model inputs, not as ground truth, and describe calibration adjustments without hidden chain-of-thought.",
+ "You are a Brier public forecasting agent. Forecast in percentage points. Use public, audit-ready reasoning only. Treat Census history and PolicyEngine current-law inputs as explicit model inputs, not as ground truth, and describe calibration adjustments without hidden chain-of-thought.",
prompt: [
"Forecast this public prediction cell:",
dataset.summary.question,
@@ -381,7 +381,7 @@ function normalizeSpmPercentForecast(
}
function shouldTryGateway() {
- if (process.env.FARNESS_DISABLE_AI === "1") return false;
+ if (process.env.BRIER_DISABLE_AI === "1") return false;
return Boolean(
process.env.AI_GATEWAY_API_KEY ||
process.env.VERCEL_OIDC_TOKEN ||
diff --git a/paper/_header.html b/paper/_header.html
index 3bb014b..73869e8 100644
--- a/paper/_header.html
+++ b/paper/_header.html
@@ -1,4 +1,4 @@
-
-
@@ -49,7 +49,7 @@ export default function MarketsPage() {
How forecasts are generated
- Every forecast cell is opened by the Farness analyst agent, which
+ Every forecast cell is opened by the Brier analyst agent, which
decomposes the question, calls the PolicyEngine microsim against
scenarios drawn from law-encoded statutes and MICROPLEX synthetic
populations, integrates external baselines (CBO, FOMC SEP, JCT, BLS,
diff --git a/site/src/app/page.tsx b/site/src/app/page.tsx
index 75a1370..9b4d5f7 100644
--- a/site/src/app/page.tsx
+++ b/site/src/app/page.tsx
@@ -1,206 +1,9 @@
"use client";
-import Link from "next/link";
import { Header } from "@/components/Header";
-import { DemoVideo } from "@/components/DemoVideo";
import { MarketsBrowser } from "@/components/MarketsBrowser";
-/* ── Hero ── */
-
-function Hero() {
- return (
-
-
- farness
- {" "}
- intercepts agent decisions and demands a forecast: a KPI, a
- confidence interval, a base rate, disconfirming evidence, and a
- review date. Works with Codex, Claude Code, and any agent that
- speaks MCP.
-
Forecasts on every consequential cell of government data
- Farness analyst agents forecast published government statistics,
+ Brier analyst agents forecast published government statistics,
law-encoded policy parameters, and outcomes conditional on policy
states. Each cell carries a calibrated interval and an audit trail
of the reasoning behind it.
@@ -241,7 +44,7 @@ function ForecastPrototype() {
Live API paths
- CPI-U and two CTC cells stream through api.farness.ai.
+ CPI-U and two CTC cells stream through api.thesisinstitute.org.
@@ -277,27 +80,27 @@ function HorizonDivider() {
);
}
-/* ── How It Works — 3 steps on light cards ── */
+/* ── How the Almanac works — 3 steps ── */
function HowItWorks() {
const stages = [
{
num: "01",
- title: "Intercept",
+ title: "The catalog",
description:
- "Catch decision-language before the model hardens into advice. When a prompt sounds like 'Should we...?' or 'Which is better?', farness reframes it as a forecastable choice.",
+ "Every consequential cell of government data — published statistics, law-encoded policy parameters, and outcomes conditional on policy states — gets its own forecast cell, with an explicit resolution source and date.",
},
{
num: "02",
- title: "Reframe",
+ title: "The forecast",
description:
- "Convert vague 'Should I?' into explicit, measurable outcome questions. Define the KPIs that would actually tell you whether the decision was good.",
+ "Brier analyst agents predict each cell with a point estimate, a calibrated interval, and a full audit trail of the reasoning, sources, and key drivers behind it — open by construction.",
},
{
num: "03",
- title: "Anchor",
+ title: "The score",
description:
- "Produce numeric forecasts with confidence intervals, reference classes from comparable situations, disconfirming evidence, and a review date for accountability.",
+ "When the official number publishes, every forecast is scored against the record. Calibration is public, per cell and per agent, so the track record is the product.",
},
];
@@ -306,10 +109,10 @@ function HowItWorks() {
- How farness works
+ How the Almanac works
- From intuition to instrument
+ Open forecasts, scored against reality
- The clip below shows the current Codex path exactly the way the docs
- describe it: install the package, register the local MCP server, use
- $farness
- in Codex, then pull the decision back out of the local store.
-
- The paper introduces stability-under-probing as a way to evaluate
- decision prompts without waiting for outcomes. In Study 1, farness
- looked more prepared for the shared probe battery on Claude Opus
- 4.6 and GPT-5.4.
-
-
- Study 2 then added held-out probes and showed the broader claim
- weakens sharply off-framework. That makes the paper a methods
- result first, not proof that farness is universally superior.
-
-
- The useful claim is narrower and better: structured decision
- prompts can be tested empirically, and farness is one case study.
-
- AI is often fluent about decisions before it is rigorous about them.
- farness adds structure before confidence hardens into action.
-
-
-
-
- );
-}
-
-/* ── Installation ── */
-
-function Installation() {
- const workflows = [
- {
- title: "Codex",
- description:
- "Install the package, run one setup command, then use $farness when a decision prompt shows up.",
- code: `$ python -m pip install 'farness[mcp]'
-$ farness setup codex
-$ # restart Codex, then use $farness`,
- },
- {
- title: "Claude Code",
- description:
- "Use the same single-command setup flow for Claude. The plugin is still available if you prefer slash-command UX.",
- code: `$ python -m pip install 'farness[mcp]'
-$ farness setup claude
-$ # restart Claude Code`,
- },
- {
- title: "CLI / Python",
- description:
- "Local decision log and calibration tool. No LLM API key required unless you run separate experiment code against external models.",
- code: `$ python -m pip install farness
-$ farness new "Should we rewrite the auth layer?"
-$ farness calibration`,
- },
- ];
-
- return (
-
-
-
- Agent integrations
-
-
- Use it natively or from the CLI
-
-
-
- Farness now has a package-first agent path: a local MCP server for
- persistence, packaged skills for Codex and Claude Code, and the same
- forecast structure used in the paper. The Claude plugin remains
- optional, and the CLI is a local store and calibration surface, not an
- LLM client. If setup drifts, `farness doctor --fix` repairs the local
- integration.
-
-
- Start with farness
-
-
- );
-}
+/* ── Footer ── */
function Footer() {
return (
@@ -825,17 +184,8 @@ export default function HomePage() {
-
-
-
-
-
-
-
-
-
-
+
);
diff --git a/site/src/app/thesis/page.tsx b/site/src/app/thesis/page.tsx
deleted file mode 100644
index 2c404f2..0000000
--- a/site/src/app/thesis/page.tsx
+++ /dev/null
@@ -1,756 +0,0 @@
-import Link from "next/link";
-import { Cite } from "@/components/Cite";
-import { Header } from "@/components/Header";
-
-export default function ThesisPage() {
- return (
-
-
-
-
-
- The Farness thesis
-
-
- Forecasting as a harness
-
-
- Why reframing decisions as predictions leads to better outcomes—and
- how to do it.
-
-
-
-
-
-
The problem with advice
-
- When we ask someone—a friend, a mentor, an AI—"Should I do
- X?", we're asking the wrong question. The answer we get
- depends entirely on unstated assumptions: What do we value? What
- counts as success? How certain is the advisor? None of this is
- made explicit.
-
-
- Worse, we can never learn from these answers. A year later, we
- can't evaluate whether the advice was good because we never
- defined what "good" meant. The feedback loop is broken.
-
-
- This isn't just a problem with AI (though AI's tendency
- toward sycophancy makes it worse
- 1). It's a problem with how we structure
- decision-making conversations. Annie Duke calls this
- "resulting"—judging decisions by outcomes rather than
- process
- 16. When we ask for advice and get a good
- outcome, we credit the advice. Bad outcome, we blame it. But a
- single outcome tells us almost nothing about whether the decision
- was good.
-
-
-
-
-
The reframe
-
- Instead of asking for advice, ask for{" "}
- forecasts conditional on actions.
-
-
The shift is subtle but transformative:
-
-
- Before: "Should I take this job?"
-
-
- After: "If I value income, growth, and
- work-life balance, what's the probability that each of
- these exceeds my threshold under Option A vs Option B? What
- assumptions drive those estimates?"
-
-
-
This forces several things to happen:
-
-
- Values become explicit. You must state what
- you're optimizing for before anyone can help you.
-
-
- Uncertainty becomes visible. A forecast
- requires a confidence interval. "Probably fine"
- becomes "70% chance, with a range of 50-85%."
-
-
- Assumptions surface. To make a forecast, you
- must reason about mechanisms. What needs to be true for this
- outcome to occur?
-
-
- Accountability emerges. Predictions can be
- scored. Opinions cannot.
-
-
-
-
-
-
The superforecasting connection
-
- This isn't a new idea. Philip Tetlock's research on
- superforecasting
- 2 identified a set of techniques that reliably
- improve predictive accuracy. In the Good Judgment Project, a small
- group of forecasters consistently beat professional intelligence
- analysts with access to classified information
- 3.
-
- Outside view first: Start with base rates
- before adjusting for specifics—what Kahneman calls
- "reference class forecasting"
- 5.
-
-
- Calibrated confidence: Your 80% predictions
- should come true 80% of the time.
-
-
- Continuous updating: Revise estimates as new
- information arrives, following Bayesian principles.
-
-
-
- Superforecasters don't have access to secret information.
- They're just more disciplined about structuring their
- thinking. Across nearly 100 comparative studies, Dawes, Faust, and
- Meehl found that structured "mechanical" prediction
- equaled or outperformed unstructured expert judgment in every
- domain tested
- 17. Farness applies this discipline to
- personal and professional decisions.
-
-
-
-
-
Why AI makes this better
-
- Large language models are surprisingly good at forecasting. LLM
- ensembles can match human crowd accuracy on prediction tasks
- 6. Halawi et al. built a retrieval-augmented
- system that approaches competitive forecaster accuracy
- 18, and AI forecasting systems like AIA
- Forecaster have achieved superforecaster-level performance through
- structured pipelines of search, independent reasoning, and
- calibration
- 7. The CAIS forecasting bot has demonstrated
- superhuman accuracy on competitive forecasting platforms
- 8. On ForecastBench, LLMs now surpass the
- median public forecaster, with projected LLM-superforecaster
- parity by late 2026
- 28.
-
-
- But LLMs are also prone to sycophancy: telling you what you want
- to hear rather than what's true. Research has shown this
- tendency is robust across models and contexts
- 1.
-
-
- The forecasting frame is a harness that constrains this
- tendency. When you ask an AI for a probability with a confidence
- interval, it's harder for it to simply validate your existing
- beliefs. Numbers create accountability. Xiong et al. found that
- structured elicitation strategies—multi-step prompting, top-k
- sampling—can help mitigate LLM overconfidence, though no single
- technique consistently outperforms others
- 19. How you ask matters as much as what you
- ask.
-
-
- More importantly, the structure itself improves thinking. Research
- on LLM-augmented forecasting found that AI assistance
- significantly boosts human forecasting accuracy, with the largest
- gains for less experienced forecasters
- 9:
-
-
-
- KPI definition forces you to articulate what
- you actually care about.
-
-
- Option expansion surfaces alternatives you
- hadn't considered.
-
-
- Assumption surfacing reveals where your model
- might be wrong.
-
-
- Sensitivity analysis shows which uncertainties
- matter most.
-
-
-
The AI becomes a structured thinking partner, not an oracle.
-
- See the research: I've developed a
- methodology called "stability-under-probing" to
- empirically test whether frameworks reduce sycophancy.{" "}
- Read the paper →
-
-
-
-
-
The calibration loop
-
- The most powerful part of this approach is what happens over time.
- By logging your forecasts and scoring them against reality, you
- build a calibration curve.
-
-
- Research on expert prediction shows that without feedback, even
- domain experts are poorly calibrated
- 10. Lichtenstein, Fischhoff, and Phillips
- found that when people said they were 98% confident, they were
- correct only 68% of the time
- 20. But with structured feedback, calibration
- improves dramatically. Weather forecasters and professional
- oddsmakers—who receive regular, structured feedback on their
- probabilistic predictions—exhibited little or no overconfidence.
- The Good Judgment Project confirmed this: regular accuracy
- feedback was one of the key interventions that improved
- performance
- 3.
-
-
- You learn that you're overconfident on career decisions. Or
- underconfident on technical estimates. Or systematically biased
- toward optimism about timelines.
-
-
- This meta-knowledge is invaluable. It's not just about making
- better individual decisions—it's about understanding your own
- decision-making patterns and compensating for systematic biases.
-
-
-
-
-
The decision quality chain
-
- Ron Howard and the Strategic Decisions Group developed a framework
- for measuring decision quality at the time of decision,
- independent of outcome
- 21. A decision is only as good as its weakest
- link across six elements: appropriate frame, creative
- alternatives, reliable information, clear values, sound reasoning,
- and commitment to action
- 22.
-
-
- Farness maps directly onto this chain. Defining KPIs addresses{" "}
- frame and values. Option expansion addresses{" "}
- creative alternatives. Forecasting with base rates
- addresses reliable information and{" "}
- sound reasoning. The calibration loop addresses the
- feedback mechanism that strengthens every link over time.
-
-
- The key insight from decision analysis is that you can assess
- decision quality without waiting for outcomes. Howard's
- information value theory shows that when decisions are framed as
- forecasts, you can calculate exactly how much to invest in
- resolving each uncertainty
- 23. If the expected value of learning your
- probability of success is only $50, don't spend $5,000 on a
- feasibility study.
-
-
- This connects to what Kahneman and Lovallo call the "inside
- view" versus "outside view"
- 24. Decision makers naturally treat each
- problem as unique, anchoring on plans and scenarios rather than
- base rates from comparable situations. Reframing decisions as
- forecasts naturally invokes the outside view by forcing explicit
- probability assessment against a reference class.
-
-
-
-
-
Boosting, not nudging
-
- Hertwig and Grune-Yanoff distinguish "nudges"
- (environmental changes that steer behavior) from
- "boosts" (interventions that build decision-making
- competence)
- 25. A nudge might default your retirement
- savings to 10%. A boost teaches you to think about compound
- interest so you choose the right rate yourself.
-
-
- Farness is a boost, not a nudge. It doesn't tell you what to
- decide. It teaches a way of thinking—probabilistic, structured,
- accountable—that transfers across domains. Julia Galef calls this
- the "scout mindset": treating beliefs as provisional
- hypotheses to be stress-tested, not positions to defend
- 26. The forecasting frame cultivates this
- mindset by making accuracy the explicit goal.
-
-
- And critically, Koriat, Lichtenstein, and Fischhoff showed that
- simply asking people to generate reasons against their
- preferred option eliminates overconfidence almost entirely
- 27. Structured consideration of
- alternatives—a core forecasting discipline—is one of the most
- robust debiasing techniques known.
-
-
-
-
-
The framework
-
- Farness implements a five-step process, drawing on structured
- analytic techniques from intelligence analysis
- 11 and the superforecasting literature:
-
-
-
- Define KPIs. What outcomes matter? Pick 1-3
- metrics you'd actually use to judge success in hindsight.
- This mirrors the "AIMS" technique (Audience, Issue,
- Message, Storyline) from intelligence analysis
- 11.
-
-
- Expand options. Don't just compare A vs B.
- What about C? Waiting? A hybrid? The best option is often one
- you didn't initially consider. This combats "premature
- closure"—a well-documented cognitive bias
- 12.
-
-
- Decompose and forecast. For each option x KPI,
- apply outside view, inside view, Fermi decomposition. Produce a
- point estimate with confidence interval. Decomposition is one of
- Heuer's core structured analytic techniques
- 11.
-
-
- Surface assumptions. What must be true for this
- forecast to hold? What would change it? This is the "key
- assumptions check" from intelligence tradecraft
- 13.
-
-
- Log and score. Record the decision. Return in
- 3-6 months. Compare predictions to reality. Update your
- calibration. Brier scores provide a proper scoring rule that
- rewards both accuracy and calibration
- 14.
-
-
-
-
-
-
When to use it
-
Farness is valuable across a range of decisions:
-
-
- High-stakes decisions where the cost of being
- wrong is significant.
-
-
- Recurring decision types where you can build
- calibration over time.
-
-
- Decisions with delayed feedback where you
- won't know if you were right for months or years.
-
-
- Decisions where you suspect motivated reasoning
- —where you might be fooling yourself
- 15.
-
-
- Smaller decisions as practice—building the
- habit and calibration data that pays off when stakes are high.
-
-
-
-
-
-
The vision
-
Imagine a world where every significant decision comes with:
-
-
Explicit success criteria
-
A range of options, not just the obvious ones
-
Quantified predictions with uncertainty ranges
-
Surfaced assumptions that can be tested
-
A record that can be scored and learned from
-
-
- This is possible today. The tools exist. The research supports it.
- What's missing is the habit—the muscle memory of reaching for
- forecasts instead of opinions.
-
-
- Farness is an attempt to build that habit. Use it as a Python
- library, a CLI tool, or a Claude Code plugin. Log your decisions.
- Score your predictions. Get better over time.
-
-
-
- Get started →
-
-
-
-
-
-
-
- References
-
-
-
- ↑ Sharma, M., et al. (2024). "Towards
- Understanding Sycophancy in Language Models."{" "}
- ICLR 2024.{" "}
-
- openreview.net
-
-
-
- ↑ Tetlock, P. E., & Gardner, D. (2015).{" "}
- Superforecasting: The Art and Science of Prediction.
- Crown.{" "}
-
- Amazon
-
-
-
- ↑ Mellers, B., et al. (2014).
- "Psychological Strategies for Winning a Geopolitical
- Forecasting Tournament." Psychological Science,
- 25(5), 1106-1115.{" "}
-
- DOI
-
-
-
- ↑ Good Judgment.
- "Superforecasters' Toolbox: Fermi-ization in
- Forecasting."{" "}
-
- goodjudgment.com
-
-
-
- ↑ Kahneman, D., & Tversky, A. (1979).
- "Intuitive Prediction: Biases and Corrective
- Procedures." TIMS Studies in Management Science, 12,
- 313-327.
-
-
- ↑ Schoenegger, P., et al. (2024).
- "Wisdom of the Silicon Crowd: LLM Ensemble Prediction
- Capabilities Rival Human Crowd Accuracy."{" "}
- arXiv:2402.19379.{" "}
-
- arxiv.org/abs/2402.19379
-
-
- ↑ Tetlock, P. E. (2005).{" "}
-
- Expert Political Judgment: How Good Is It? How Can We Know?
- {" "}
- Princeton University Press.{" "}
-
- Princeton University Press
-
-
-
- ↑ Heuer, R. J., & Pherson, R. H. (2015).{" "}
- Structured Analytic Techniques for Intelligence Analysis{" "}
- (2nd ed.). CQ Press.{" "}
-
- Amazon
-
-
-
- ↑ Kruglanski, A. W., & Webster, D. M.
- (1996). "Motivated Closing of the Mind: 'Seizing'
- and 'Freezing'." Psychological Review,
- 103(2), 263-283.{" "}
-
- DOI
-
-
- ↑ Brier, G. W. (1950). "Verification
- of Forecasts Expressed in Terms of Probability."{" "}
- Monthly Weather Review, 78(1), 1-3.{" "}
-
- DOI
-
-
-
- ↑ Kunda, Z. (1990). "The Case for
- Motivated Reasoning." Psychological Bulletin,
- 108(3), 480-498.{" "}
-
- DOI
-
-
-
- ↑ Duke, A. (2018).{" "}
-
- Thinking in Bets: Making Smarter Decisions When You Don't
- Have All the Facts
-
- . Portfolio/Penguin.
-
-
- ↑ Dawes, R. M., Faust, D., & Meehl, P. E.
- (1989). "Clinical Versus Actuarial Judgment."{" "}
- Science, 243(4899), 1668-1674.{" "}
-
- DOI
-
-
-
- ↑ Halawi, D., Zhang, F., Chen, Y.-H., &
- Steinhardt, J. (2024). "Approaching Human-Level Forecasting
- with Language Models." NeurIPS 2024.{" "}
-
- arxiv.org/abs/2402.18563
-
-
-
- ↑ Xiong, M., Hu, Z., Lu, X., et al. (2024).
- "Can LLMs Express Their Uncertainty? An Empirical Evaluation
- of Confidence Elicitation in LLMs." ICLR 2024.{" "}
-
- arxiv.org/abs/2306.13063
-
-
-
- ↑ Lichtenstein, S., Fischhoff, B., &
- Phillips, L. D. (1982). "Calibration of Probabilities: The
- State of the Art to 1980." In D. Kahneman, P. Slovic, & A.
- Tversky (Eds.),{" "}
- Judgment under Uncertainty: Heuristics and Biases (pp.
- 306-334). Cambridge University Press.
-
-
- ↑ Howard, R. A. (1988). "Decision
- Analysis: Practice and Promise." Management Science,
- 34(6), 679-695.{" "}
-
- DOI
-
-
-
- ↑ Spetzler, C., Winter, H., & Meyer, J.
- (2016).{" "}
-
- Decision Quality: Value Creation from Better Business Decisions
-
- . Wiley.
-
-
- ↑ Howard, R. A. (1966). "Information
- Value Theory."{" "}
- IEEE Transactions on Systems Science and Cybernetics,
- 2(1), 22-26.{" "}
-
- DOI
-
-
-
- ↑ Kahneman, D., & Lovallo, D. (1993).
- "Timid Choices and Bold Forecasts: A Cognitive Perspective on
- Risk Taking." Management Science, 39(1), 17-31.{" "}
-
- DOI
-
-
-
- ↑ Hertwig, R., & Grune-Yanoff, T. (2017).
- "Nudging and Boosting: Steering or Empowering Good
- Decisions." Perspectives on Psychological Science,
- 12(6), 973-986.{" "}
-
- DOI
-
-
-
- ↑ Galef, J. (2021).{" "}
-
- The Scout Mindset: Why Some People See Things Clearly and Others
- Don't
-
- . Portfolio/Penguin.
-
-
- ↑ Koriat, A., Lichtenstein, S., &
- Fischhoff, B. (1980). "Reasons for Confidence."{" "}
-
- Journal of Experimental Psychology: Human Learning and Memory
-
- , 6(2), 107-118.{" "}
-
- DOI
-
-
-
- ↑ Karger, E., et al. (2025).
- "ForecastBench: A Dynamic Benchmark of AI Forecasting
- Capabilities." ICLR 2025.{" "}
-
- openreview.net
-
-
-
-
-
-
-
-
- );
-}
diff --git a/site/src/app/vision/page.tsx b/site/src/app/vision/page.tsx
deleted file mode 100644
index e2f270d..0000000
--- a/site/src/app/vision/page.tsx
+++ /dev/null
@@ -1,550 +0,0 @@
-import type { Metadata } from "next";
-import Link from "next/link";
-import { Header } from "@/components/Header";
-
-export const metadata: Metadata = {
- title: "Farness vision — working document",
- description:
- "Working synthesis of the Farness Foundation vision: we build open AI forecasters that publish, explain, and score predictions on consequential outcomes.",
- robots: {
- index: false,
- follow: false,
- nocache: true,
- googleBot: {
- index: false,
- follow: false,
- noimageindex: true,
- },
- },
-};
-
-export default function VisionPage() {
- return (
-
-
-
-
-
- Working document · not for distribution
-
-
- Open predictions
-
-
-
- farness
- {" "}
- builds open AI forecasters. We make them predict consequential
- outcomes, show their work, call public tools, publish calibrated
- uncertainty, and score every forecast against reality.
-
-
-
-
-
-
The bet
-
- Open source software opened code. Open data opened the inputs.
- Open weights opened the reasoning machinery. We open the
- predictions: continuously-updated forecasts from AI systems whose
- tool calls, assumptions, uncertainty, calibration, and later
- outcomes are public.
-
-
- We use forecasting as an alignment pressure. A system that must
- predict public facts before they happen has to track reality,
- expose uncertainty, use evidence, and learn from misses. When the
- trace is open, the public can inspect the model's evidence and the
- model can learn from the public record of its own errors.
-
-
- The traces create a compounding loop. Aggregate them and
- systematic biases become visible. Score the forecasts and weak
- methods lose credibility. Publish the fixes and the next
- generation of forecasters starts from a better baseline. Applied
- to prediction, the open-source dynamic becomes epistemic
- infrastructure.
-
-
-
-
-
What we build
-
- Farness Foundation builds four connected pieces of public-good
- forecasting infrastructure:
-
-
We run open forecasters on consequential questions
-
- We run AI-agent ensembles across the structured grid of
- consequential questions: government statistics from BEA, BLS,
- Census, and IRS; law-encoded policy parameters; and counterfactual
- questions that drive policy and economic decisions. We publish the
- forecasts, the traces, the calibration history, and the running
- methodology notes openly. Funded compute scales the depth of the
- ensemble; the substrate stays free at the point of use.
-
-
We simulate policy with inspectable models
-
- We maintain PolicyEngine as open-source microsimulation for US,
- UK, and Canadian tax-benefit systems. Governments, think tanks,
- advocacy organizations, and researchers use it for custom policy
- analysis. Farness forecasters call PolicyEngine when they need
- policy-conditional distributions, and PolicyEngine keeps serving
- the policy community through the brand and workflows people
- already know.
-
-
We build calibrated synthetic populations
-
- We build Microplex as the synthetic micro-data substrate for
- PolicyEngine simulations and calibration-native AI research. We
- publish the population data, methodology, and synthesizer code
- openly. Microplex replaces PolicyEngine's Enhanced CPS substrate
- with data calibrated more tightly to administrative benchmarks and
- useful beyond tax-benefit microsimulation.
-
-
We make everyday agent advice forecastable
-
- We maintain the open-source Farness Decisions package, CLI, MCP
- server, and agent skills. They turn advice-seeking into explicit
- forecasts with KPIs, options, confidence intervals, resolution
- rules, and calibration tracking. This keeps the same discipline
- available for individual decisions, team decisions, and public
- policy forecasts.
-
-
-
-
-
The transparency advantage is the durable moat
-
- We make transparency the core mechanism. Every methodology
- improvement, newly-discovered bias, and successful tool
- integration becomes shared infrastructure. Researchers can inspect
- the trace, reproduce the forecast, challenge the assumptions, and
- contribute a better method. Each improvement raises the baseline
- for everyone who builds on the substrate.
-
-
- The same dynamic that made Linux durable protects Farness's
- position. The compounding work happens across the whole community
- of users and contributors. The foundation maintains the core
- infrastructure, integrates the best contributions, sets direction,
- and protects the public-good character. The community expands the
- surface area faster than any single organization could.
-
-
-
-
-
Built for the agents of tomorrow
-
- The infrastructure that matters most gets built ahead of the
- capability that needs it. TCP/IP was designed for a few hundred
- nodes and scaled to billions because the design anticipated future
- use. Kubernetes solved orchestration problems most organizations
- had not yet reached when it shipped. Linux was built when
- computing was tiny and scaled with hardware nobody had imagined.
- Substrate-builders capture disproportionate value because they are
- already there when the demand shows up.
-
-
- Farness is built with this in mind. Every capability is reachable
- through a clean machine-callable API; future agents will call
- tools directly. Every agent trace is structured for downstream
- consumption by other agents and human readers. Every tool in the
- simulation engine is self-describing so that agents that have not
- been invented yet can discover what is available. Permissioning
- anticipates millions of automated participants through scoped
- automated access. Calibration scoring is queryable, so current
- agents can learn from history and future agents can preferentially
- route to tool configurations with proven track records.
-
-
- This costs a little more today and pays disproportionately when
- capability arrives. By the time agents are reliably orchestrating
- tools, composing pipelines, and proposing methodology
- improvements, the substrate they need will already be open,
- public, free, and continuously calibrated. Open substrate gives
- tomorrow's agents permission-less infrastructure the next decade
- of AI development can build on.
-
-
-
-
-
The stack
-
- Two independent 501(c)(3) foundations, technically integrated as
- one open stack:
-
-
-
- Encoded-law substrate — computable statutes,
- regulations, holdings, and metadata linking published government
- statistics to the laws that mandate them. Separate organization,
- shared substrate.
-
-
- Farness Foundation — open-predictions platform,
- microsimulation engine and custom policy analysis
- (PolicyEngine), synthetic-population substrate (Microplex),
- personal decision tool (Farness Decisions), and the research
- program on calibration-native foundation models and value
- forecasting.
-
-
-
- The Farness platform consumes encoded law, government data
- architecture, and Microplex population substrate as inputs, runs
- ensembles through PolicyEngine and other computational engines,
- and publishes calibrated forecasts. Policy partners interact with
- PolicyEngine directly through its own brand and channels. New
- audiences — AI safety, agencies funding their own forecasts,
- prediction-market researchers, broader policy analysts — interact
- with Farness as the umbrella platform.
-
-
-
-
-
Open predictions as a movement
-
- The category needs a name to anchor its identity. The lineage:
-
-
-
- Open source software opened the code. Linux,
- Apache, Mozilla. The free software movement and its successors
- made source available and rewrote the economics of software
- distribution.
-
-
- Open data opened the inputs. Wikipedia, Common
- Crawl, OpenStreetMap, government open-data portals. The raw
- material of analysis became public and citable.
-
-
- Open weights opened the reasoning machinery.
- Allen Institute's Olmo, Llama, Mistral, DeepSeek. The trained
- models themselves became available for inspection and reuse.
-
-
- Open predictions opens the reasoning
- itself, on consequential questions. Every prior, every
- tool call, every update is auditable. The output includes the
- forecast and the full chain of reasoning that produced it.
-
-
-
- Each step opens more of the epistemic process. Each step produces
- durable public goods and gives the next generation of builders
- more to start from. Open predictions is the natural next layer,
- and Farness is the foundation building it.
-
-
-
-
-
We align AI by making it predict
-
- We give AI systems a narrow job with a hard feedback loop: predict
- consequential outcomes before they happen, explain the evidence
- behind the prediction, quantify uncertainty, and accept a public
- score when reality arrives. That objective pushes models toward
- truth-tracking behavior because calibration, evidence use, and
- humility become measurable product requirements.
-
-
- We use the strongest available models as forecasters today. We
- connect them to public data, encoded law, PolicyEngine
- simulations, Microplex populations, and explicit calibration
- records. We evaluate which model-tool-method combinations predict
- best. We publish the traces so other researchers can reproduce,
- criticize, and improve the methods.
-
-
- As the corpus grows, we train and evaluate prediction-native
- systems: agents that select tools, decompose questions, maintain
- uncertainty, update on evidence, and learn from scored outcomes.
- The lab advances by making forecasts useful in the world and by
- making the full learning loop open.
-
-
-
-
-
We train prediction-native agents
-
- The long-run research program is not just prompting other
- companies' models. We build the data, tools, and evaluation loop
- needed to train our own forecasters. The training corpus is
- time-versioned so every backtest can ask what the agent could have
- known on a specific date, which tool versions were available, and
- which outcomes had not yet resolved. That leakage control turns
- forecasting into a real scientific benchmark instead of a vibes
- demo.
-
-
- We make tool use native. The agent learns that forecasting often
- means calling BLS, Census, IRS, CMS, CBO, BEA, PolicyEngine,
- Microplex, and other public or inspectable systems. Tool outputs
- carry provenance, vintage, and uncertainty. Forecast artifacts
- store the question, evidence, tool calls, assumptions, uncertainty
- decomposition, final distribution, resolution rule, and eventual
- score as first-class data.
-
-
- We train on scored predictions. Supervised learning starts from
- strong public forecast traces. Reinforcement learning and reward
- modeling optimize proper scoring rules, calibration, interval
- sharpness, and decision usefulness once enough forecasts resolve.
- The objective is not eloquence; it is measured accuracy under
- uncertainty with a visible audit trail.
-
-
- We let agents learn from other agents' public traces. Not private
- chain-of-thought, but durable artifacts: which evidence was used,
- which tools were called, which assumptions mattered, which
- forecaster configurations were overconfident, and which methods
- improved after resolution. Future agents can route to the tools,
- methods, and peer traces with the best calibration record.
-
-
- This is how Farness can become an AI lab without becoming a closed
- frontier company. Whether we train foundation models directly or
- specialize open models into forecasters, the purpose stays narrow
- and public: build systems whose job is to predict, explain,
- resolve, and improve against reality.
-
-
-
-
-
Funder fit
-
- The funder base that matches the thesis is broader and more
- accessible than the funder base for any of the predecessor
- framings:
-
-
-
- Coefficient Giving (Open Philanthropy rebrand)
- — AI safety, forecasting infrastructure, consequence-visibility
- framing fits directly in their existing grant portfolios.
-
-
- Survival and Flourishing Fund — long-horizon AI
- safety and alignment-adjacent infrastructure.
-
-
- Astera Institute,{" "}
- Schmidt Sciences / Schmidt Futures,{" "}
- Mozilla Foundation — novel public-good
- scientific infrastructure and open-source AI ethos.
-
-
-
- Anthropic alumni and AI-safety-aligned liquidity
- {" "}
- — tender-offer and IPO-event capital from Anthropic and similar
- frontier labs. Open-source-by-construction means current
- frontier-lab employees can publicly back the work without
- conflict of interest. The complement-not-compete frame is unique
- to this category.
-
-
- Arnold Ventures Mission Aligned Investments —
- fits the structure Andrew Moylan and team have already signaled
- interest in, particularly for the open policy-forecasting
- infrastructure angle.
-
-
-
- Government agencies and international equivalents
- {" "}
- — Treasury, state revenue offices, Federal Reserve regional
- banks, HHS, Census, and international counterparts paying for
- marginal compute on the questions they care about. Sponsored
- runs are program-related revenue that fits 501(c)(3) structure
- cleanly.
-
-
- National research funding — NSF, DARPA, IARPA,
- ARIA UK, NIH for specific research directions.
-
-
-
- Sponsorship capital from AI labs, Big Tech, and philanthropies
-
- , per the Fradkin/Jabarian/Koh
- well-capitalized-prediction-markets model, applied to specific
- question sets the sponsor wants better-calibrated forecasts on.
-
-
-
- Farness funds the work through multiple channels. Foundation
- grants fund the platform and research. Sponsored compute pays for
- specific question coverage. Custom analysis through PolicyEngine
- generates additional program revenue. The revenue mix keeps the
- foundation institutionally independent.
-
-
-
-
-
What success looks like in five years
-
- At maturity, Farness produces continuously-updated calibrated
- forecasts on every consequential government statistic, every
- encoded policy parameter, and every counterfactual conditional
- question stakeholders care about. The platform runs hundreds to
- thousands of specialized agent configurations, each with published
- methodology and visible track record. Calibration history goes
- back years and is queryable per question, per configuration, per
- resolution period. Government agencies fund targeted compute on
- their projection questions. Researchers build on the open
- infrastructure for their own work. Frontier AI labs use the
- calibration corpus as a training and evaluation resource.
- Open-source forecaster configurations and tool integrations are
- contributed by people the foundation has never met.
-
-
- The forecasts feed into the decisions of governments, advocacy
- organizations, firms, and individuals because calibrated
- probability distributions with visible evidence improve the
- decisions those institutions already make. The substrate
- compounds: every new tool integration, every new methodology
- insight, and every new question coverage makes everything that was
- already there more useful.
-
-
- And when the AI agents of 2030 arrive — substantially more capable
- than today's, better at tool selection, better at composing
- methodology, better at reasoning over their own outputs — they
- find a substrate already built for them. Open, calibrated,
- audit-trail-native, and free at the point of use. The capability
- becomes immediately deployable on consequential questions because
- the infrastructure is already there.
-
-
-
-
-
Honest caveats and open questions
-
-
- The autonomous-improvement language is aspirational.
- {" "}
- Today's AI systems can iterate variants, tune hyperparameters, and
- generate model code, but autonomous improvement of methodology
- without sustained human guidance is years out. Honest framing:
- open human-in-the-loop improvement of AI ensembles on a
- transparent substrate, with the substrate compounding the
- human-and-AI work over time. We build collaborative compounding
- before autonomous self-improvement.
-
-
- The one-year launch starts narrower. The platform
- launches with a smaller agent ensemble, fewer questions, a
- narrower research program, and a working but incomplete substrate.
- Building toward the mature state takes real research and
- engineering investment over years. The vision is the north star;
- the early stages look more like a focused shipping organization
- than a complete forecasting layer.
-
-
- Open infrastructure depends on adoption.{" "}
- Organizations need workflows that integrate open predictions into
- real decisions. Building that institutional muscle across policy
- shops, agencies, and other users takes years. Farness can lead the
- category and still has to earn adoption one workflow at a time.
-
-
-
- Regulatory ambiguity if forecasts become market-moving.
- {" "}
- Farness publishes forecasts rather than trades, which avoids most
- prediction-market regulatory complexity. If open forecasts become
- widely consumed by financial markets, the SEC or CFTC may still
- take interest in disclosure rules. Probably solvable through
- precedents like Federal Reserve forecast publication, but warrants
- real legal review.
-
-
- The PolicyEngine brand transition. PolicyEngine
- continues operationally unchanged, but funders, board, and
- partners need to be brought along on the umbrella structure.
- Existing grants are to PolicyEngine via PSL Foundation fiscal
- sponsorship; the cleanest path is incorporating Farness Foundation
- as the new 501(c)(3) and graduating PolicyEngine into it from PSL.
- Donor consent process is straightforward; the communications work
- requires care.
-
-
-
- The "Farness" name has multiple uses to disambiguate.
- {" "}
- Farness Foundation (the org), Farness (the open-predictions
- platform — the flagship), Farness Decisions (the personal decision
- tool). Naming hierarchy needs to be settled before any public
- launch.
-
-
-
-
-
The shape of the work, in priority order
-
-
- Incorporate Farness Foundation as a 501(c)(3)
- upon graduating PolicyEngine from PSL fiscal sponsorship. Use
- fiscal sponsorship during the application period.
-
-
- Settle the naming hierarchy publicly and
- internally before any launch announcement: foundation, platform,
- PolicyEngine, Microplex, Decisions, and the shared law/data
- substrate.
-
-
- Compose the board with AI-safety, policy, and
- technical credibility — names that signal what the foundation is
- to the funder base it most needs to reach.
-
-
- Publish the manifesto in its public form (this
- document, rewritten for external audience) with accompanying
- funder one-pager and FAQ.
-
-
- Ship the first visible version of the platform—
- Manifold-hosted forecast experiments with full agent telemetry,
- a small set of government-data-anchored questions with published
- calibration, and the agent traces openly available.
-
-
- Move Microplex into PolicyEngine as the
- Enhanced CPS replacement, with the methodology and synthesizer
- code published openly.
-
-
- Pre-flight major funder conversations —
- Coefficient Giving, SFF, Schmidt, Anthropic-alumni outreach,
- Arnold Ventures MAI — with the manifesto and one-pager in hand.
-
-
- Coordinate the encoded-law substrate on shared
- roadmap for law access and government-data architecture.
-
-
-
-
- Read the thesis →
-
-
-
-
-
-
-
-
- );
-}
diff --git a/site/src/components/FarnessLogo.tsx b/site/src/components/BrierLogo.tsx
similarity index 80%
rename from site/src/components/FarnessLogo.tsx
rename to site/src/components/BrierLogo.tsx
index 4862345..333e764 100644
--- a/site/src/components/FarnessLogo.tsx
+++ b/site/src/components/BrierLogo.tsx
@@ -1,12 +1,12 @@
/**
- * Farness logo mark — "The Vanishing Point"
+ * Brier logo mark — "The Vanishing Point"
*
* Two perspective lines converging toward a luminous focal point.
* Updated for the "Clear Horizon" palette:
* - Lines use Mist-400 (#9FB6C6) — visible on light backgrounds
* - Dot uses Rose-600 (#A94E80) — brand accent
*/
-export function FarnessLogoMark({
+export function BrierLogoMark({
size = 28,
className = "",
}: {
@@ -21,7 +21,7 @@ export function FarnessLogoMark({
fill="none"
xmlns="http://www.w3.org/2000/svg"
className={className}
- aria-label="Farness"
+ aria-label="Thesis"
>
{/* Glow halo — rose at low opacity */}
-
- farness
+
+ thesis
);
}
diff --git a/site/src/components/DemoVideo.tsx b/site/src/components/DemoVideo.tsx
index 8886390..cecca50 100644
--- a/site/src/components/DemoVideo.tsx
+++ b/site/src/components/DemoVideo.tsx
@@ -4,8 +4,8 @@ export function DemoVideo({
caption?: string;
}) {
const assetRev = process.env.NEXT_PUBLIC_SITE_ASSET_REV || "dev";
- const videoSrc = `/demo/farness-demo.mp4?v=${assetRev}`;
- const posterSrc = `/demo/farness-demo-poster.png?v=${assetRev}`;
+ const videoSrc = `/demo/brier-demo.mp4?v=${assetRev}`;
+ const posterSrc = `/demo/brier-demo-poster.png?v=${assetRev}`;
return (
@@ -19,7 +19,7 @@ export function DemoVideo({
playsInline
poster={posterSrc}
preload="metadata"
- aria-label="End-to-end farness workflow demo for Codex"
+ aria-label="End-to-end brier workflow demo for Codex"
>
diff --git a/site/src/components/Header.tsx b/site/src/components/Header.tsx
index 6382132..b7186bd 100644
--- a/site/src/components/Header.tsx
+++ b/site/src/components/Header.tsx
@@ -1,10 +1,10 @@
import Link from "next/link";
-import { FarnessLogoMark } from "./FarnessLogo";
+import { BrierLogoMark } from "./BrierLogo";
export function Header({
activePage,
}: {
- activePage?: "docs" | "thesis" | "paper" | "forecasts";
+ activePage?: "docs" | "thesis" | "paper" | "forecasts" | "about";
}) {
return (
-
- farness
+
+ thesis
diff --git a/site/src/components/MarketRuntime.tsx b/site/src/components/MarketRuntime.tsx
index 08139e6..d9be4d4 100644
--- a/site/src/components/MarketRuntime.tsx
+++ b/site/src/components/MarketRuntime.tsx
@@ -598,7 +598,7 @@ function Spinner() {
}
function resolveApiBase() {
- const configured = process.env.NEXT_PUBLIC_FARNESS_API_BASE_URL?.replace(
+ const configured = process.env.NEXT_PUBLIC_BRIER_API_BASE_URL?.replace(
/\/$/,
"",
);
@@ -609,7 +609,7 @@ function resolveApiBase() {
) {
return "http://127.0.0.1:3002";
}
- return "https://api.farness.ai";
+ return "https://api.thesisinstitute.org";
}
function parseEventData(event: Event) {
diff --git a/site/src/data/markets.ts b/site/src/data/markets.ts
index bbb12d4..da0d417 100644
--- a/site/src/data/markets.ts
+++ b/site/src/data/markets.ts
@@ -386,7 +386,7 @@ export const MARKETS: Market[] = [
},
{
kind: "text",
- text: "The FOMC SEP median, CBO projection, and the farness structural-VAR all cluster between 4.27 and 4.40 for full-year 2026. December prints tend to run very slightly below the annual mean in expansions due to seasonal adjustment behavior in the household survey.",
+ text: "The FOMC SEP median, CBO projection, and the brier structural-VAR all cluster between 4.27 and 4.40 for full-year 2026. December prints tend to run very slightly below the annual mean in expansions due to seasonal adjustment behavior in the household survey.",
},
{ kind: "heading", text: "Risk distribution" },
{
@@ -908,8 +908,8 @@ export const MARKETS: Market[] = [
{ kind: "heading", text: "Calibration layer" },
{
kind: "tool",
- tool: "farness.calibration",
- call: 'farness.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", outcome: "federal_budget_cost" })',
+ tool: "brier.calibration",
+ call: 'brier.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", outcome: "federal_budget_cost" })',
result:
"{ raw_to_final_ratio: 1.04, additive_billions: 3.5, queued_uncertainty_multiplier: 1.4 }",
},
@@ -971,8 +971,8 @@ export const MARKETS: Market[] = [
},
{
kind: "tool",
- tool: "farness.calibration",
- call: 'farness.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", target: "irs_soi_outlays" })',
+ tool: "brier.calibration",
+ call: 'brier.calibration.lookup({ domain: "policyengine_budget_scores", policy_area: "ctc", target: "irs_soi_outlays" })',
result:
"{ ratio: 1.02, additive_billions: 0.5, widened_for_reporting_lag: true }",
},
diff --git a/farness/assets/skills/codex/SKILL.md b/skills/brier/SKILL.md
similarity index 86%
rename from farness/assets/skills/codex/SKILL.md
rename to skills/brier/SKILL.md
index 935113c..585759b 100644
--- a/farness/assets/skills/codex/SKILL.md
+++ b/skills/brier/SKILL.md
@@ -1,13 +1,13 @@
---
-name: farness
+name: brier
description: Use when the user wants advice or a decision analysis rather than pure implementation, especially for prompts like "should I", "should we", "which is better", "is it worth it", or "what would you do" about architecture, product, hiring, strategy, or career choices. Reframe the decision as explicit KPIs, expanded options, reference classes, disconfirming evidence, numeric forecasts, and a review date. Do not use for straightforward debugging, factual explanation, or routine coding tasks.
---
-# Farness
+# Brier
Use this skill to turn vague decisions into forecastable choices.
-Prefer the `farness` MCP server when available. It gives you persistent tools, resources, and prompts for the workflow.
+Prefer the `brier` MCP server when available. It gives you persistent tools, resources, and prompts for the workflow.
## Trigger Conditions
@@ -37,7 +37,7 @@ Do not use it for:
## Workflow
1. If there is no stored decision yet, call `create_decision`.
-2. Use `farness://framework` if you need the canonical sequence.
+2. Use `brier://framework` if you need the canonical sequence.
3. Structure the analysis around:
- KPI definition
- KPI resolution metadata
@@ -57,8 +57,8 @@ Do not use it for:
5. If the user is revisiting the decision, use `get_decision` and `review_decision`.
6. If outcomes are now known, call `score_decision` to update calibration.
7. If the user wants to externalize a forecast into a prediction market, draft it first:
- - Use `farness forecast-draft --output forecast-pack.json` for stored decisions.
- - Use `farness forecast-draft "" --initial-prob <1-99> --resolution-date YYYY-MM-DD --output forecast-pack.json` for standalone policy questions.
+ - Use `brier forecast-draft --output forecast-pack.json` for stored decisions.
+ - Use `brier forecast-draft "" --initial-prob <1-99> --resolution-date YYYY-MM-DD --output forecast-pack.json` for standalone policy questions.
- Treat forecast drafts as review artifacts only; do not publish questions or place bets unless the user explicitly asks.
## Working Rules
@@ -74,10 +74,10 @@ Do not use it for:
## Fallback
-If the `farness` MCP server is not connected, tell the user to add it with:
+If the `brier` MCP server is not connected, tell the user to add it with:
```bash
-farness setup codex
+brier setup codex
```
Then continue with the same workflow once the server is available.
diff --git a/skills/brier/agents/openai.yaml b/skills/brier/agents/openai.yaml
new file mode 100644
index 0000000..f68332f
--- /dev/null
+++ b/skills/brier/agents/openai.yaml
@@ -0,0 +1,13 @@
+interface:
+ display_name: "Brier"
+ short_description: "Turn decisions into tracked forecasts with MCP-backed structure"
+ default_prompt: "Use $brier to turn this decision into explicit KPIs, options, forecasts, and a review date, then persist it with the brier MCP tools."
+
+dependencies:
+ tools:
+ - type: "mcp"
+ value: "brier"
+ description: "Local brier MCP server"
+
+policy:
+ allow_implicit_invocation: true
diff --git a/skills/farness/agents/openai.yaml b/skills/farness/agents/openai.yaml
deleted file mode 100644
index b189b5f..0000000
--- a/skills/farness/agents/openai.yaml
+++ /dev/null
@@ -1,13 +0,0 @@
-interface:
- display_name: "Farness"
- short_description: "Turn decisions into tracked forecasts with MCP-backed structure"
- default_prompt: "Use $farness to turn this decision into explicit KPIs, options, forecasts, and a review date, then persist it with the farness MCP tools."
-
-dependencies:
- tools:
- - type: "mcp"
- value: "farness"
- description: "Local farness MCP server"
-
-policy:
- allow_implicit_invocation: true
diff --git a/tests/test_agent_setup.py b/tests/test_agent_setup.py
index 5ee02e6..48c5659 100644
--- a/tests/test_agent_setup.py
+++ b/tests/test_agent_setup.py
@@ -7,7 +7,7 @@
import pytest
-from farness.agent_setup import (
+from brier.agent_setup import (
inspect_agent_setup,
manual_setup_command,
remove_agent_setup,
@@ -19,8 +19,8 @@
def test_manual_setup_command_for_claude():
command = manual_setup_command("claude", "/tmp/venv/bin/python")
assert (
- command == "claude mcp add --scope user farness -- "
- "/tmp/venv/bin/python -m farness.mcp_server"
+ command == "claude mcp add --scope user brier -- "
+ "/tmp/venv/bin/python -m brier.mcp_server"
)
@@ -30,10 +30,10 @@ def test_setup_agent_skips_add_when_server_exists(monkeypatch, tmp_path):
skill_path.write_text("skill")
monkeypatch.setattr(
- "farness.agent_setup.install_skill", lambda *args, **kwargs: skill_path
+ "brier.agent_setup.install_skill", lambda *args, **kwargs: skill_path
)
monkeypatch.setattr(
- "farness.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
+ "brier.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
)
calls = []
@@ -42,12 +42,12 @@ def fake_run(cmd, capture_output, text, check):
calls.append(cmd)
return SimpleNamespace(returncode=0, stdout="", stderr="")
- monkeypatch.setattr("farness.agent_setup.subprocess.run", fake_run)
+ monkeypatch.setattr("brier.agent_setup.subprocess.run", fake_run)
result = setup_agent("codex", python_bin="/tmp/python")
assert result.mcp_already_configured is True
- assert calls == [["codex", "mcp", "get", "farness"]]
+ assert calls == [["codex", "mcp", "get", "brier"]]
def test_setup_agent_adds_missing_server(monkeypatch, tmp_path):
@@ -56,10 +56,10 @@ def test_setup_agent_adds_missing_server(monkeypatch, tmp_path):
skill_path.write_text("skill")
monkeypatch.setattr(
- "farness.agent_setup.install_skill", lambda *args, **kwargs: skill_path
+ "brier.agent_setup.install_skill", lambda *args, **kwargs: skill_path
)
monkeypatch.setattr(
- "farness.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
+ "brier.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
)
calls = []
@@ -70,24 +70,24 @@ def fake_run(cmd, capture_output, text, check):
return SimpleNamespace(returncode=1, stdout="", stderr="missing")
return SimpleNamespace(returncode=0, stdout="", stderr="")
- monkeypatch.setattr("farness.agent_setup.subprocess.run", fake_run)
+ monkeypatch.setattr("brier.agent_setup.subprocess.run", fake_run)
result = setup_agent("claude", python_bin="/tmp/python")
assert result.mcp_already_configured is False
assert calls == [
- ["claude", "mcp", "get", "farness"],
+ ["claude", "mcp", "get", "brier"],
[
"claude",
"mcp",
"add",
"--scope",
"user",
- "farness",
+ "brier",
"--",
"/tmp/python",
"-m",
- "farness.mcp_server",
+ "brier.mcp_server",
],
]
@@ -98,34 +98,34 @@ def test_setup_agent_reports_missing_cli(monkeypatch, tmp_path):
skill_path.write_text("skill")
monkeypatch.setattr(
- "farness.agent_setup.install_skill", lambda *args, **kwargs: skill_path
+ "brier.agent_setup.install_skill", lambda *args, **kwargs: skill_path
)
- monkeypatch.setattr("farness.agent_setup.shutil.which", lambda name: None)
+ monkeypatch.setattr("brier.agent_setup.shutil.which", lambda name: None)
with pytest.raises(RuntimeError) as excinfo:
setup_agent("codex", python_bin="/tmp/python")
message = str(excinfo.value)
assert "Installed the codex skill" in message
- assert "codex mcp add farness -- /tmp/python -m farness.mcp_server" in message
+ assert "codex mcp add brier -- /tmp/python -m brier.mcp_server" in message
def test_inspect_agent_setup_uses_codex_home(monkeypatch, tmp_path):
codex_home = tmp_path / "codex-home"
- skill_path = codex_home / "skills" / "farness" / "SKILL.md"
+ skill_path = codex_home / "skills" / "brier" / "SKILL.md"
skill_path.parent.mkdir(parents=True)
skill_path.write_text("skill")
monkeypatch.setenv("CODEX_HOME", str(codex_home))
monkeypatch.setattr(
- "farness.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
+ "brier.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
)
def fake_run(cmd, capture_output, text, check):
- assert cmd == ["codex", "mcp", "get", "farness"]
+ assert cmd == ["codex", "mcp", "get", "brier"]
return SimpleNamespace(returncode=0, stdout="", stderr="")
- monkeypatch.setattr("farness.agent_setup.subprocess.run", fake_run)
+ monkeypatch.setattr("brier.agent_setup.subprocess.run", fake_run)
result = inspect_agent_setup("codex", python_bin="/tmp/python")
@@ -139,12 +139,12 @@ def test_inspect_agent_setup_skips_mcp_check_without_cli(monkeypatch, tmp_path):
target = tmp_path / "claude-skill"
target.mkdir(parents=True)
- monkeypatch.setattr("farness.agent_setup.shutil.which", lambda name: None)
+ monkeypatch.setattr("brier.agent_setup.shutil.which", lambda name: None)
def fail_run(*args, **kwargs): # pragma: no cover
raise AssertionError("subprocess.run should not be called when CLI is missing")
- monkeypatch.setattr("farness.agent_setup.subprocess.run", fail_run)
+ monkeypatch.setattr("brier.agent_setup.subprocess.run", fail_run)
result = inspect_agent_setup("claude", target_dir=str(target), python_bin="/tmp/python")
@@ -156,32 +156,32 @@ def fail_run(*args, **kwargs): # pragma: no cover
def test_repair_agent_setup_rewrites_modified_skill(monkeypatch, tmp_path):
- target = tmp_path / "codex-home" / "skills" / "farness"
+ target = tmp_path / "codex-home" / "skills" / "brier"
target.mkdir(parents=True)
skill_path = target / "SKILL.md"
skill_path.write_text("modified")
monkeypatch.setenv("CODEX_HOME", str(tmp_path / "codex-home"))
monkeypatch.setattr(
- "farness.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
+ "brier.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
)
calls = []
def fake_run(cmd, capture_output, text, check):
calls.append(cmd)
- if cmd == ["codex", "mcp", "get", "farness"]:
+ if cmd == ["codex", "mcp", "get", "brier"]:
return SimpleNamespace(returncode=0, stdout="", stderr="")
return SimpleNamespace(returncode=0, stdout="", stderr="")
- monkeypatch.setattr("farness.agent_setup.subprocess.run", fake_run)
+ monkeypatch.setattr("brier.agent_setup.subprocess.run", fake_run)
result = repair_agent_setup("codex", python_bin="/tmp/python")
assert result.skill_action == "updated"
assert result.mcp_action == "unchanged"
assert "Use this skill to turn vague decisions into forecastable choices." in skill_path.read_text()
- assert calls == [["codex", "mcp", "get", "farness"]]
+ assert calls == [["codex", "mcp", "get", "brier"]]
def test_remove_agent_setup_removes_skill_and_mcp(monkeypatch, tmp_path):
@@ -191,20 +191,20 @@ def test_remove_agent_setup_removes_skill_and_mcp(monkeypatch, tmp_path):
skill_path.write_text("skill")
monkeypatch.setattr(
- "farness.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
+ "brier.agent_setup.shutil.which", lambda name: f"/usr/bin/{name}"
)
calls = []
def fake_run(cmd, capture_output, text, check):
calls.append(cmd)
- if cmd == ["claude", "mcp", "get", "farness"]:
+ if cmd == ["claude", "mcp", "get", "brier"]:
return SimpleNamespace(returncode=0, stdout="", stderr="")
- if cmd == ["claude", "mcp", "remove", "farness"]:
+ if cmd == ["claude", "mcp", "remove", "brier"]:
return SimpleNamespace(returncode=0, stdout="", stderr="")
raise AssertionError(f"Unexpected command: {cmd}")
- monkeypatch.setattr("farness.agent_setup.subprocess.run", fake_run)
+ monkeypatch.setattr("brier.agent_setup.subprocess.run", fake_run)
result = remove_agent_setup("claude", target_dir=str(target))
@@ -212,6 +212,6 @@ def fake_run(cmd, capture_output, text, check):
assert result.mcp_removed is True
assert not skill_path.exists()
assert calls == [
- ["claude", "mcp", "get", "farness"],
- ["claude", "mcp", "remove", "farness"],
+ ["claude", "mcp", "get", "brier"],
+ ["claude", "mcp", "remove", "brier"],
]
diff --git a/tests/test_calibration.py b/tests/test_calibration.py
index 44b75ea..ab89c13 100644
--- a/tests/test_calibration.py
+++ b/tests/test_calibration.py
@@ -3,8 +3,8 @@
import pytest
from datetime import datetime
-from farness.framework import Decision, KPI, Option, Forecast
-from farness.calibration import CalibrationTracker, ForecastScore
+from brier.framework import Decision, KPI, Option, Forecast
+from brier.calibration import CalibrationTracker, ForecastScore
class TestForecastScore:
diff --git a/tests/test_cli.py b/tests/test_cli.py
index e23022f..c5f625b 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -6,10 +6,10 @@
import pytest
-from farness.cli import main
-from farness.framework import Decision, KPI
-from farness.skills import default_skill_dir
-from farness.storage import DecisionStore
+from brier.cli import main
+from brier.framework import Decision, KPI
+from brier.skills import default_skill_dir
+from brier.storage import DecisionStore
@pytest.fixture
@@ -21,12 +21,12 @@ def temp_store():
class TestNewCommand:
- """Tests for `farness new` CLI command."""
+ """Tests for `brier new` CLI command."""
def test_new_creates_decision(self, temp_store):
- """farness new 'question' should create and save a decision."""
- with patch("farness.cli.DecisionStore", return_value=temp_store):
- with patch("sys.argv", ["farness", "new", "Should I take this job?"]):
+ """brier new 'question' should create and save a decision."""
+ with patch("brier.cli.DecisionStore", return_value=temp_store):
+ with patch("sys.argv", ["brier", "new", "Should I take this job?"]):
main()
decisions = temp_store.list_all()
@@ -34,11 +34,11 @@ def test_new_creates_decision(self, temp_store):
assert decisions[0].question == "Should I take this job?"
def test_new_with_context(self, temp_store):
- """farness new 'question' --context 'details' should include context."""
- with patch("farness.cli.DecisionStore", return_value=temp_store):
+ """brier new 'question' --context 'details' should include context."""
+ with patch("brier.cli.DecisionStore", return_value=temp_store):
with patch(
"sys.argv",
- ["farness", "new", "Which city?", "--context", "Considering SF vs NYC"],
+ ["brier", "new", "Which city?", "--context", "Considering SF vs NYC"],
):
main()
@@ -47,9 +47,9 @@ def test_new_with_context(self, temp_store):
assert decisions[0].context == "Considering SF vs NYC"
def test_new_prints_id(self, temp_store, capsys):
- """farness new should print the new decision ID."""
- with patch("farness.cli.DecisionStore", return_value=temp_store):
- with patch("sys.argv", ["farness", "new", "Test question"]):
+ """brier new should print the new decision ID."""
+ with patch("brier.cli.DecisionStore", return_value=temp_store):
+ with patch("sys.argv", ["brier", "new", "Test question"]):
main()
output = capsys.readouterr().out
@@ -58,17 +58,17 @@ def test_new_prints_id(self, temp_store, capsys):
assert decisions[0].id[:8] in output
def test_new_without_question_fails(self, capsys):
- """farness new without a question should fail."""
- with patch("sys.argv", ["farness", "new"]):
+ """brier new without a question should fail."""
+ with patch("sys.argv", ["brier", "new"]):
with pytest.raises(SystemExit):
main()
- def test_new_respects_farness_store_env(self, monkeypatch, tmp_path):
- """CLI commands should honor FARNESS_STORE_PATH when set."""
+ def test_new_respects_brier_store_env(self, monkeypatch, tmp_path):
+ """CLI commands should honor BRIER_STORE_PATH when set."""
store_path = tmp_path / "env-store.jsonl"
- monkeypatch.setenv("FARNESS_STORE_PATH", str(store_path))
+ monkeypatch.setenv("BRIER_STORE_PATH", str(store_path))
- with patch("sys.argv", ["farness", "new", "Env-backed decision"]):
+ with patch("sys.argv", ["brier", "new", "Env-backed decision"]):
main()
store = DecisionStore(store_path)
@@ -81,13 +81,13 @@ class TestShowWithPrefix:
"""Tests for prefix matching in show command."""
def test_show_finds_by_prefix(self, temp_store, capsys):
- """farness show should find decision by ID prefix."""
+ """brier show should find decision by ID prefix."""
d = Decision(question="Test decision for show")
temp_store.save(d)
prefix = d.id[:8]
- with patch("farness.cli.DecisionStore", return_value=temp_store):
- with patch("sys.argv", ["farness", "show", prefix]):
+ with patch("brier.cli.DecisionStore", return_value=temp_store):
+ with patch("sys.argv", ["brier", "show", prefix]):
main()
output = capsys.readouterr().out
@@ -109,8 +109,8 @@ def test_show_prints_kpi_resolution_metadata(self, temp_store, capsys):
)
temp_store.save(d)
- with patch("farness.cli.DecisionStore", return_value=temp_store):
- with patch("sys.argv", ["farness", "show", d.id[:8]]):
+ with patch("brier.cli.DecisionStore", return_value=temp_store):
+ with patch("sys.argv", ["brier", "show", d.id[:8]]):
main()
output = capsys.readouterr().out
@@ -124,11 +124,11 @@ class TestInstallSkillCommand:
"""Tests for the packaged skill installer."""
def test_install_skill_writes_codex_skill(self, tmp_path, capsys):
- """farness install-skill codex should create a SKILL.md file."""
+ """brier install-skill codex should create a SKILL.md file."""
target = tmp_path / "codex-skill"
with patch(
- "sys.argv", ["farness", "install-skill", "codex", "--target", str(target)]
+ "sys.argv", ["brier", "install-skill", "codex", "--target", str(target)]
):
main()
@@ -148,7 +148,7 @@ def test_install_skill_refuses_overwrite_without_force(self, tmp_path, capsys):
(target / "SKILL.md").write_text("different")
with patch(
- "sys.argv", ["farness", "install-skill", "claude", "--target", str(target)]
+ "sys.argv", ["brier", "install-skill", "claude", "--target", str(target)]
):
with pytest.raises(SystemExit):
main()
@@ -165,11 +165,11 @@ def test_install_skill_force_overwrites(self, tmp_path):
with patch(
"sys.argv",
- ["farness", "install-skill", "claude", "--target", str(target), "--force"],
+ ["brier", "install-skill", "claude", "--target", str(target), "--force"],
):
main()
- assert "Prefer the local `farness` MCP server" in skill_path.read_text()
+ assert "Prefer the local `brier` MCP server" in skill_path.read_text()
def test_codex_default_skill_dir_respects_codex_home(self, monkeypatch, tmp_path):
"""Default Codex install path should use CODEX_HOME when it is set."""
@@ -177,7 +177,7 @@ def test_codex_default_skill_dir_respects_codex_home(self, monkeypatch, tmp_path
skill_dir = default_skill_dir("codex")
- assert skill_dir == tmp_path / "codex-home" / "skills" / "farness"
+ assert skill_dir == tmp_path / "codex-home" / "skills" / "brier"
class TestSetupCommand:
@@ -186,39 +186,39 @@ class TestSetupCommand:
def test_setup_prints_success(self, capsys):
result = SimpleNamespace(
skill_path="/tmp/skill/SKILL.md",
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_already_configured=False,
agent_cli="codex",
python_bin="/tmp/python",
)
- with patch("farness.cli.setup_agent", return_value=result):
- with patch("sys.argv", ["farness", "setup", "codex"]):
+ with patch("brier.cli.setup_agent", return_value=result):
+ with patch("sys.argv", ["brier", "setup", "codex"]):
main()
output = capsys.readouterr().out
assert "Installed codex skill at /tmp/skill/SKILL.md" in output
- assert "Configured MCP server `farness` in codex using /tmp/python." in output
+ assert "Configured MCP server `brier` in codex using /tmp/python." in output
def test_setup_reports_existing_server(self, capsys):
result = SimpleNamespace(
skill_path="/tmp/skill/SKILL.md",
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_already_configured=True,
agent_cli="claude",
python_bin="/tmp/python",
)
- with patch("farness.cli.setup_agent", return_value=result):
- with patch("sys.argv", ["farness", "setup", "claude"]):
+ with patch("brier.cli.setup_agent", return_value=result):
+ with patch("sys.argv", ["brier", "setup", "claude"]):
main()
output = capsys.readouterr().out
- assert "MCP server `farness` is already configured in claude." in output
+ assert "MCP server `brier` is already configured in claude." in output
def test_setup_exits_on_runtime_error(self, capsys):
- with patch("farness.cli.setup_agent", side_effect=RuntimeError("boom")):
- with patch("sys.argv", ["farness", "setup", "codex"]):
+ with patch("brier.cli.setup_agent", side_effect=RuntimeError("boom")):
+ with patch("sys.argv", ["brier", "setup", "codex"]):
with pytest.raises(SystemExit):
main()
@@ -235,17 +235,17 @@ def test_uninstall_reports_removed_skill_and_mcp(self, capsys):
cli_path="/usr/local/bin/codex",
skill_path="/tmp/skill/SKILL.md",
skill_removed=True,
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_removed=True,
)
- with patch("farness.cli.remove_agent_setup", return_value=result):
- with patch("sys.argv", ["farness", "uninstall", "codex"]):
+ with patch("brier.cli.remove_agent_setup", return_value=result):
+ with patch("sys.argv", ["brier", "uninstall", "codex"]):
main()
output = capsys.readouterr().out
assert "Removed codex skill at /tmp/skill/SKILL.md" in output
- assert "Removed MCP server `farness` from codex." in output
+ assert "Removed MCP server `brier` from codex." in output
def test_uninstall_keep_mcp_reports_retained_server(self, capsys):
result = SimpleNamespace(
@@ -253,17 +253,17 @@ def test_uninstall_keep_mcp_reports_retained_server(self, capsys):
cli_path="/usr/local/bin/claude",
skill_path="/tmp/skill/SKILL.md",
skill_removed=False,
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_removed=False,
)
- with patch("farness.cli.remove_agent_setup", return_value=result):
- with patch("sys.argv", ["farness", "uninstall", "claude", "--keep-mcp"]):
+ with patch("brier.cli.remove_agent_setup", return_value=result):
+ with patch("sys.argv", ["brier", "uninstall", "claude", "--keep-mcp"]):
main()
output = capsys.readouterr().out
assert "No claude skill found" in output
- assert "Left MCP server `farness` configured." in output
+ assert "Left MCP server `brier` configured." in output
class TestDoctorCommand:
@@ -276,13 +276,13 @@ def test_doctor_reports_ready_status(self, capsys):
skill_path="/tmp/skill/SKILL.md",
skill_state="installed",
skill_installed=True,
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_configured=True,
- manual_command="codex mcp add farness -- /tmp/python -m farness.mcp_server",
+ manual_command="codex mcp add brier -- /tmp/python -m brier.mcp_server",
)
- with patch("farness.cli.inspect_agent_setup", return_value=result):
- with patch("sys.argv", ["farness", "doctor", "codex"]):
+ with patch("brier.cli.inspect_agent_setup", return_value=result):
+ with patch("sys.argv", ["brier", "doctor", "codex"]):
main()
output = capsys.readouterr().out
@@ -297,18 +297,18 @@ def test_doctor_recommends_setup_when_missing(self, capsys):
skill_path="/tmp/skill/SKILL.md",
skill_state="missing",
skill_installed=False,
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_configured=False,
- manual_command="claude mcp add --scope user farness -- /tmp/python -m farness.mcp_server",
+ manual_command="claude mcp add --scope user brier -- /tmp/python -m brier.mcp_server",
)
- with patch("farness.cli.inspect_agent_setup", return_value=result):
- with patch("sys.argv", ["farness", "doctor", "claude"]):
+ with patch("brier.cli.inspect_agent_setup", return_value=result):
+ with patch("sys.argv", ["brier", "doctor", "claude"]):
main()
output = capsys.readouterr().out
assert "Recommended next step:" in output
- assert "farness setup claude" in output
+ assert "brier setup claude" in output
def test_doctor_fix_reports_actions(self, capsys):
repaired = SimpleNamespace(
@@ -316,7 +316,7 @@ def test_doctor_fix_reports_actions(self, capsys):
cli_path="/usr/local/bin/codex",
skill_path="/tmp/skill/SKILL.md",
skill_action="updated",
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_action="configured",
python_bin="/tmp/python",
)
@@ -326,14 +326,14 @@ def test_doctor_fix_reports_actions(self, capsys):
skill_path="/tmp/skill/SKILL.md",
skill_state="installed",
skill_installed=True,
- mcp_server_name="farness",
+ mcp_server_name="brier",
mcp_configured=True,
- manual_command="codex mcp add farness -- /tmp/python -m farness.mcp_server",
+ manual_command="codex mcp add brier -- /tmp/python -m brier.mcp_server",
)
- with patch("farness.cli.repair_agent_setup", return_value=repaired):
- with patch("farness.cli.inspect_agent_setup", return_value=result):
- with patch("sys.argv", ["farness", "doctor", "codex", "--fix"]):
+ with patch("brier.cli.repair_agent_setup", return_value=repaired):
+ with patch("brier.cli.inspect_agent_setup", return_value=result):
+ with patch("sys.argv", ["brier", "doctor", "codex", "--fix"]):
main()
output = capsys.readouterr().out
diff --git a/tests/test_decision_usefulness.py b/tests/test_decision_usefulness.py
index ed5ea54..1bca6c3 100644
--- a/tests/test_decision_usefulness.py
+++ b/tests/test_decision_usefulness.py
@@ -2,7 +2,7 @@
import json
-from farness.experiments import decision_usefulness as du
+from brier.experiments import decision_usefulness as du
def test_seed_cases_are_well_formed():
@@ -16,7 +16,7 @@ def test_seed_cases_are_well_formed():
assert case.scenario
-def test_prompt_generation_covers_forecast_only_and_farness():
+def test_prompt_generation_covers_forecast_only_and_brier():
"""Prompts should reflect the intended mechanism split."""
case = du.get_decision_usefulness_case("auth_rewrite")
assert case is not None
@@ -25,9 +25,9 @@ def test_prompt_generation_covers_forecast_only_and_farness():
assert "80% confidence intervals" in forecast_only
assert "Do not explicitly cite cognitive biases" in forecast_only
- farness = du.generate_decision_usefulness_prompt(case, "farness")
- assert "outside-view base rates" in farness
- assert "review date" in farness
+ brier = du.generate_decision_usefulness_prompt(case, "farness")
+ assert "outside-view base rates" in brier
+ assert "review date" in brier
format_control = du.generate_decision_usefulness_prompt(case, "format_control")
assert "Goal" in format_control
@@ -122,7 +122,7 @@ def test_judge_pairwise_decision_usefulness_maps_winner(monkeypatch):
artifact_a = du.DecisionUsefulnessArtifact(
case_id=case.id,
- condition="farness",
+ condition="brier",
model="gpt-5.4",
run_number=1,
prompt="p1",
@@ -179,7 +179,7 @@ def fake_call_llm(prompt, model, temperature, max_tokens):
representation="normalized",
)
- assert result.winner_condition in {"farness", "naive"}
+ assert result.winner_condition in {"brier", "naive"}
assert result.winner_condition == result.left_condition
assert result.confidence == 81
assert result.scores_a["kpi_clarity"] == 5
@@ -266,7 +266,7 @@ def test_judge_pairwise_critique_survival_maps_less_undermined(monkeypatch):
artifact_a = du.DecisionUsefulnessArtifact(
case_id=case.id,
- condition="farness",
+ condition="brier",
model="gpt-5.4",
run_number=1,
prompt="p1",
@@ -317,7 +317,7 @@ def fake_call_llm(prompt, model, temperature, max_tokens):
assert prompts
assert "implementation fragility" in prompts[0]
assert "opportunity cost" in prompts[0]
- assert result.less_undermined_condition in {"farness", "forecast_only"}
+ assert result.less_undermined_condition in {"brier", "forecast_only"}
assert result.confidence == 84
assert "fragility" in result.most_damaging_critique_b
@@ -330,13 +330,13 @@ def test_summarize_decision_usefulness_judging_counts_wins():
source_model="gpt-5.4",
judge_model="claude-opus-4-6",
run_number=1,
- comparison="farness_vs_forecast_only",
+ comparison="brier_vs_forecast_only",
representation="normalized",
- condition_a="farness",
+ condition_a="brier",
condition_b="forecast_only",
- left_condition="farness",
+ left_condition="brier",
right_condition="forecast_only",
- winner_condition="farness",
+ winner_condition="brier",
confidence=80,
rationale="",
scores_a={},
@@ -347,11 +347,11 @@ def test_summarize_decision_usefulness_judging_counts_wins():
source_model="gpt-5.4",
judge_model="claude-opus-4-6",
run_number=2,
- comparison="farness_vs_forecast_only",
+ comparison="brier_vs_forecast_only",
representation="normalized",
- condition_a="farness",
+ condition_a="brier",
condition_b="forecast_only",
- left_condition="farness",
+ left_condition="brier",
right_condition="forecast_only",
winner_condition="forecast_only",
confidence=70,
@@ -366,11 +366,11 @@ def test_summarize_decision_usefulness_judging_counts_wins():
source_model="gpt-5.4",
judge_model="claude-opus-4-6",
run_number=1,
- comparison="farness_vs_forecast_only",
+ comparison="brier_vs_forecast_only",
representation="normalized",
- condition_a="farness",
+ condition_a="brier",
condition_b="forecast_only",
- left_condition="farness",
+ left_condition="brier",
right_condition="forecast_only",
more_serious_omission_condition="forecast_only",
confidence=65,
@@ -385,13 +385,13 @@ def test_summarize_decision_usefulness_judging_counts_wins():
source_model="gpt-5.4",
judge_model="claude-opus-4-6",
run_number=1,
- comparison="farness_vs_forecast_only",
+ comparison="brier_vs_forecast_only",
representation="normalized",
- condition_a="farness",
+ condition_a="brier",
condition_b="forecast_only",
- left_condition="farness",
+ left_condition="brier",
right_condition="forecast_only",
- less_undermined_condition="farness",
+ less_undermined_condition="brier",
confidence=75,
rationale="",
most_damaging_critique_a="timing",
@@ -404,14 +404,14 @@ def test_summarize_decision_usefulness_judging_counts_wins():
omission_results,
critique_results,
)
- assert summary["utility"]["normalized"]["farness_vs_forecast_only"]["wins"]["farness"] == 1
- assert summary["utility"]["normalized"]["farness_vs_forecast_only"]["wins"]["forecast_only"] == 1
+ assert summary["utility"]["normalized"]["brier_vs_forecast_only"]["wins"]["brier"] == 1
+ assert summary["utility"]["normalized"]["brier_vs_forecast_only"]["wins"]["forecast_only"] == 1
assert (
- summary["omission"]["normalized"]["farness_vs_forecast_only"]["flagged_more_serious"]["forecast_only"]
+ summary["omission"]["normalized"]["brier_vs_forecast_only"]["flagged_more_serious"]["forecast_only"]
== 1
)
assert (
- summary["critique_survival"]["normalized"]["farness_vs_forecast_only"]["less_undermined"]["farness"]
+ summary["critique_survival"]["normalized"]["brier_vs_forecast_only"]["less_undermined"]["brier"]
== 1
)
@@ -422,7 +422,7 @@ def test_run_decision_usefulness_judging_can_select_critique_only(tmp_path, monk
assert case is not None
for condition, recommendation in (
- ("farness", "Patch incrementally."),
+ ("brier", "Patch incrementally."),
("naive", "Rewrite now."),
):
artifact = du.DecisionUsefulnessArtifact(
@@ -460,7 +460,7 @@ def fake_call_llm(prompt, model, temperature, max_tokens):
utility_results, omission_results, critique_results = du.run_decision_usefulness_judging(
output_dir=tmp_path,
cases=[case],
- comparisons=[("farness", "naive")],
+ comparisons=[("brier", "naive")],
representations=["decision_memo"],
judge_tasks=["critique_survival"],
verbose=False,
diff --git a/tests/test_experiments.py b/tests/test_experiments.py
index a81677b..07e7b3b 100644
--- a/tests/test_experiments.py
+++ b/tests/test_experiments.py
@@ -2,9 +2,9 @@
import pytest
-from farness.experiments.cases import DecisionCase, get_all_cases, get_case
-from farness.experiments.scorer import ResponseScorer, ResponseScore
-from farness.experiments.runner import generate_prompt
+from brier.experiments.cases import DecisionCase, get_all_cases, get_case
+from brier.experiments.scorer import ResponseScorer, ResponseScore
+from brier.experiments.runner import generate_prompt
class TestCases:
@@ -56,100 +56,100 @@ def scorer(self, hiring_case) -> ResponseScorer:
def test_detects_confidence_interval_dash(self, scorer):
"""Should detect CI with dash format."""
response = "I estimate 80-90% success rate."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.has_confidence_interval
def test_detects_confidence_interval_to(self, scorer):
"""Should detect CI with 'to' format."""
response = "Success probability: 70% to 85%."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.has_confidence_interval
def test_detects_confidence_interval_explicit(self, scorer):
"""Should detect explicit CI language."""
response = "With an 80% confidence interval, I predict..."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.has_confidence_interval
def test_no_ci_in_simple_response(self, scorer):
"""Should not detect CI in simple response."""
response = "I recommend option A because it seems better."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert not score.has_confidence_interval
def test_detects_accountability(self, scorer):
"""Should detect accountability mechanisms."""
response = "Set a review date for 6 months from now to check outcomes."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.has_accountability
def test_detects_accountability_follow_up(self, scorer):
"""Should detect follow-up language."""
response = "I recommend following up in 3 months to measure results."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.has_accountability
def test_no_accountability_in_simple_response(self, scorer):
"""Should not detect accountability in simple response."""
response = "Go with option B, it's clearly better."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert not score.has_accountability
def test_detects_base_rate_explicit(self, scorer):
"""Should detect explicit base rate language."""
response = "Research shows that structured interviews are better."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.cites_base_rate
def test_detects_base_rate_statistics(self, scorer):
"""Should detect statistical base rates."""
response = "Studies show unstructured interviews have r=0.38 validity."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.cites_base_rate
def test_no_base_rate_in_opinion(self, scorer):
"""Should not detect base rate in pure opinion."""
response = "I think chemistry is important in hiring."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert not score.cites_base_rate
def test_detects_similarity_bias(self, scorer):
"""Should detect similarity bias."""
response = "Watch out for similarity bias - you might favor people like yourself."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert "similarity bias" in score.biases_found
def test_detects_multiple_biases(self, scorer):
"""Should detect multiple biases."""
response = "This shows similarity bias and halo effect."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert len(score.biases_found) >= 2
def test_bias_count_matches_list(self, scorer):
"""Bias count should match length of biases_found."""
response = "Watch for similarity bias and affinity bias."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.bias_count == len(score.biases_found)
def test_detects_quantified_tradeoffs(self, scorer):
"""Should detect quantified tradeoffs."""
response = "Expected value of A is 7.2 vs B at 6.8."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.quantifies_tradeoffs
def test_detects_percentage_comparison(self, scorer):
"""Should detect percentage comparisons."""
response = "Option A has 75% vs 60% success rate."
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.quantifies_tradeoffs
def test_score_has_all_fields(self, scorer):
"""Score should have all required fields."""
response = "Test response"
- score = scorer.score(response, "farness", 1)
+ score = scorer.score(response, "brier", 1)
assert score.case_id == "hiring_chemistry"
- assert score.condition == "farness"
+ assert score.condition == "brier"
assert score.run_number == 1
assert isinstance(score.cites_base_rate, bool)
assert isinstance(score.bias_count, int)
@@ -174,10 +174,10 @@ def test_naive_prompt_is_simple(self, case):
assert "framework" not in prompt.lower()
assert case.scenario.strip()[:50] in prompt
- def test_farness_prompt_has_framework(self, case):
- """Farness prompt should include framework instructions."""
+ def test_brier_prompt_has_framework(self, case):
+ """Brier prompt should include framework instructions."""
prompt = generate_prompt(case, "farness")
- assert "farness" in prompt.lower()
+ assert "brier" in prompt.lower()
assert "KPI" in prompt or "kpi" in prompt.lower()
assert "confidence interval" in prompt.lower()
assert "base rate" in prompt.lower()
@@ -186,12 +186,12 @@ def test_farness_prompt_has_framework(self, case):
def test_both_prompts_contain_scenario(self, case):
"""Both prompts should contain the scenario."""
naive = generate_prompt(case, "naive")
- farness = generate_prompt(case, "farness")
+ brier = generate_prompt(case, "brier")
# Check first 50 chars of scenario appear
scenario_start = case.scenario.strip()[:50]
assert scenario_start in naive
- assert scenario_start in farness
+ assert scenario_start in brier
class TestResponseScoreDict:
@@ -201,7 +201,7 @@ def test_to_dict_roundtrip(self):
"""Should serialize and contain all fields."""
score = ResponseScore(
case_id="test",
- condition="farness",
+ condition="brier",
run_number=1,
correct_recommendation=True,
cites_base_rate=True,
@@ -215,7 +215,7 @@ def test_to_dict_roundtrip(self):
d = score.to_dict()
assert d["case_id"] == "test"
- assert d["condition"] == "farness"
+ assert d["condition"] == "brier"
assert d["correct_recommendation"] is True
assert d["bias_count"] == 2
assert "similarity bias" in d["biases_found"]
diff --git a/tests/test_framework.py b/tests/test_framework.py
index 9932b30..9b39342 100644
--- a/tests/test_framework.py
+++ b/tests/test_framework.py
@@ -3,7 +3,7 @@
import pytest
from datetime import datetime
-from farness.framework import Decision, KPI, Option, Forecast
+from brier.framework import Decision, KPI, Option, Forecast
class TestKPI:
diff --git a/tests/test_llm_retry.py b/tests/test_llm_retry.py
index 7dadb83..172ce6a 100644
--- a/tests/test_llm_retry.py
+++ b/tests/test_llm_retry.py
@@ -1,6 +1,6 @@
"""Tests for shared LLM retry behavior."""
-from farness.experiments import llm
+from brier.experiments import llm
def test_retryable_error_detection():
diff --git a/tests/test_market.py b/tests/test_market.py
index 2a007e0..eefebc2 100644
--- a/tests/test_market.py
+++ b/tests/test_market.py
@@ -3,14 +3,14 @@
from pathlib import Path
from unittest.mock import patch
-from farness.cli import main
-from farness.framework import Decision, Forecast, KPI, Option
-from farness.market import (
+from brier.cli import main
+from brier.framework import Decision, Forecast, KPI, Option
+from brier.market import (
MarketSource,
draft_binary_policy_market,
draft_markets_for_decision,
)
-from farness.storage import DecisionStore
+from brier.storage import DecisionStore
def test_binary_policy_market_draft_has_manifold_payload():
@@ -79,7 +79,7 @@ def test_market_draft_cli_for_standalone_question_outputs_json(capsys):
with patch(
"sys.argv",
[
- "farness",
+ "brier",
"market-draft",
"Will Waymo be legally permitted to offer driverless paid robotaxi rides in DC by 2026-12-31?",
"--initial-prob",
@@ -99,7 +99,7 @@ def test_forecast_draft_cli_alias_outputs_json(capsys):
with patch(
"sys.argv",
[
- "farness",
+ "brier",
"forecast-draft",
"Will Waymo be legally permitted to offer driverless paid robotaxi rides in DC by 2026-12-31?",
"--initial-prob",
@@ -145,11 +145,11 @@ def test_market_draft_cli_for_decision_writes_file(tmp_path, capsys):
store.save(decision)
output_path = tmp_path / "drafts.json"
- with patch("farness.cli.DecisionStore", return_value=store):
+ with patch("brier.cli.DecisionStore", return_value=store):
with patch(
"sys.argv",
[
- "farness",
+ "brier",
"market-draft",
decision.id[:8],
"--output",
@@ -197,4 +197,4 @@ def test_waymo_example_uses_aggregate_safety_outcomes():
in market["description_markdown"]
for market in safety_markets
)
- assert all("Drafted by farness" not in market["description_markdown"] for market in safety_markets)
+ assert all("Drafted by brier" not in market["description_markdown"] for market in safety_markets)
diff --git a/tests/test_mcp_server.py b/tests/test_mcp_server.py
index f17434d..7863785 100644
--- a/tests/test_mcp_server.py
+++ b/tests/test_mcp_server.py
@@ -1,4 +1,4 @@
-"""Tests for the farness MCP server helpers."""
+"""Tests for the brier MCP server helpers."""
from __future__ import annotations
@@ -6,14 +6,14 @@
from tempfile import TemporaryDirectory
from types import SimpleNamespace
-from farness.framework import Decision
-from farness.mcp_server import (
+from brier.framework import Decision
+from brier.mcp_server import (
_parse_datetime,
draft_market_pack_for_input,
save_decision_analysis,
score_decision_outcomes,
)
-from farness.storage import DecisionStore
+from brier.storage import DecisionStore
def _forecast(**overrides):
diff --git a/tests/test_paper_content.py b/tests/test_paper_content.py
index c03edb1..a69b7cd 100644
--- a/tests/test_paper_content.py
+++ b/tests/test_paper_content.py
@@ -29,22 +29,22 @@ def test_convergence_reframe_present():
), "Missing convergence reframe language"
-# --- Task #8: Introduce farness properly ---
+# --- Task #8: Introduce brier properly ---
-def test_introduce_farness_language():
- """Paper should say 'I introduce farness' not 'I evaluate a specific framework called'."""
+def test_introduce_brier_language():
+ """Paper should say 'I introduce Brier' not 'I evaluate a specific framework called'."""
text = _read_paper()
- assert "I introduce farness" in text, "Missing 'I introduce farness'"
+ assert re.search(r"I introduce \*{0,2}Brier", text), "Missing 'I introduce Brier'"
assert (
"I evaluate a specific framework called" not in text
), "Old framework intro language still present"
-def test_farness_ai_footnote():
- """Paper should have a footnote referencing farness.ai."""
+def test_brier_ai_footnote():
+ """Paper should have a footnote referencing brier.institute."""
text = _read_paper()
- assert "farness.ai" in text, "Missing farness.ai footnote"
+ assert "brier.institute" in text, "Missing brier.institute footnote"
# --- Task #9: Concrete example ---
@@ -62,10 +62,10 @@ def test_concrete_example_section():
def test_sycophancy_gpt_numbers():
- """Paper should report GPT sycophancy numbers: 466.7 naive, 108.3 farness."""
+ """Paper should report GPT sycophancy numbers: 466.7 naive, 108.3 brier."""
text = _read_paper()
assert "466.7" in text, "Missing GPT naive sycophancy mean (466.7)"
- assert "108.3" in text, "Missing GPT farness sycophancy mean (108.3)"
+ assert "108.3" in text, "Missing GPT brier sycophancy mean (108.3)"
# --- Task #11: Technical fixes ---
@@ -115,7 +115,7 @@ def test_prompt_probe_confound_in_discussion():
def test_held_out_probe_result_present():
- """Paper should report that held-out / off-framework probes weaken or reverse the farness advantage."""
+ """Paper should report that held-out / off-framework probes weaken or reverse the brier advantage."""
text = _read_paper()
assert re.search(
r"off-framework|held-out probes", text, re.IGNORECASE
diff --git a/tests/test_reframing.py b/tests/test_reframing.py
index 588181a..a07d92a 100644
--- a/tests/test_reframing.py
+++ b/tests/test_reframing.py
@@ -1,7 +1,7 @@
"""Tests for the reframing experiment module."""
import pytest
-from farness.experiments.reframing import (
+from brier.experiments.reframing import (
ReframingCase,
ReframingResult,
REFRAMING_CASES,
@@ -93,27 +93,27 @@ def test_naive_more_reframing(self):
results = [
self._make_result("case1", "naive", 5, True),
self._make_result("case2", "naive", 3, True),
- self._make_result("case1", "farness", 1, False),
- self._make_result("case2", "farness", 0, False),
+ self._make_result("case1", "brier", 1, False),
+ self._make_result("case2", "brier", 0, False),
]
analysis = analyze_reframing(results)
- assert analysis["naive"]["mean_reframe_count"] > analysis["farness"]["mean_reframe_count"]
- assert analysis["naive"]["challenged_framing_rate"] > analysis["farness"]["challenged_framing_rate"]
+ assert analysis["naive"]["mean_reframe_count"] > analysis["brier"]["mean_reframe_count"]
+ assert analysis["naive"]["challenged_framing_rate"] > analysis["brier"]["challenged_framing_rate"]
def test_equal_reframing(self):
results = [
self._make_result("case1", "naive", 2, True),
- self._make_result("case1", "farness", 2, True),
+ self._make_result("case1", "brier", 2, True),
]
analysis = analyze_reframing(results)
- assert analysis["naive"]["mean_reframe_count"] == analysis["farness"]["mean_reframe_count"]
+ assert analysis["naive"]["mean_reframe_count"] == analysis["brier"]["mean_reframe_count"]
def test_per_case_breakdown(self):
results = [
self._make_result("case1", "naive", 3, True),
- self._make_result("case1", "farness", 1, False),
+ self._make_result("case1", "brier", 1, False),
self._make_result("case2", "naive", 5, True),
- self._make_result("case2", "farness", 2, False),
+ self._make_result("case2", "brier", 2, False),
]
analysis = analyze_reframing(results)
assert "case1" in analysis["by_case"]
@@ -140,7 +140,7 @@ def _make_result(self, case_id, condition, reframe_count, challenged):
def test_produces_markdown(self):
results = [
self._make_result("case1", "naive", 3, True),
- self._make_result("case1", "farness", 1, False),
+ self._make_result("case1", "brier", 1, False),
]
table = summary_table(results)
assert "## Reframing experiment results" in table
diff --git a/tests/test_skills.py b/tests/test_skills.py
index a4a8b74..867f645 100644
--- a/tests/test_skills.py
+++ b/tests/test_skills.py
@@ -2,7 +2,7 @@
from __future__ import annotations
-from farness.skills import inspect_skill, remove_skill
+from brier.skills import inspect_skill, remove_skill
def test_inspect_skill_reports_missing(tmp_path):
diff --git a/tests/test_stability.py b/tests/test_stability.py
index 8b497b2..4cdab64 100644
--- a/tests/test_stability.py
+++ b/tests/test_stability.py
@@ -2,7 +2,7 @@
import pytest
-from farness.experiments.stability import (
+from brier.experiments.stability import (
DEFAULT_PROBE_BATTERY,
QuantitativeCase,
StabilityResult,
@@ -12,7 +12,7 @@
generate_estimate_only_prompt,
generate_format_control_prompt,
generate_naive_prompt,
- generate_farness_prompt,
+ generate_brier_prompt,
generate_probe_prompt,
get_all_stability_cases,
get_primary_stability_cases,
@@ -176,10 +176,10 @@ def test_naive_prompt_is_simple(self, case):
assert "framework" not in prompt.lower()
assert case.estimate_question in prompt
- def test_farness_prompt_has_framework(self, case):
- """Farness prompt should include framework."""
- prompt = generate_farness_prompt(case)
- assert "farness" in prompt.lower()
+ def test_brier_prompt_has_framework(self, case):
+ """Brier prompt should include framework."""
+ prompt = generate_brier_prompt(case)
+ assert "brier" in prompt.lower()
assert "base rate" in prompt.lower()
assert "confidence interval" in prompt.lower()
@@ -190,12 +190,12 @@ def test_estimate_only_prompt_avoids_framework_language(self, case):
assert "framework" not in prompt.lower()
assert "0-100 rather than 0-1" in prompt
- def test_format_control_prompt_is_structured_without_farness(self, case):
+ def test_format_control_prompt_is_structured_without_brier(self, case):
"""Formatting-only control should preserve structure without framework content."""
prompt = generate_format_control_prompt(case)
assert "four-part structure" in prompt.lower()
assert "do not use any named decision framework" in prompt.lower()
- assert "farness" not in prompt.lower()
+ assert "brier" not in prompt.lower()
def test_probe_prompt_includes_initial_estimate(self, case):
"""Probe prompt should reference initial estimate."""
@@ -232,7 +232,7 @@ def result_with_update(self) -> StabilityResult:
def result_with_ci(self) -> StabilityResult:
return StabilityResult(
case_id="test",
- condition="farness",
+ condition="brier",
initial_estimate=10.0,
initial_ci_low=5.0,
initial_ci_high=15.0,
@@ -302,10 +302,10 @@ def experiment(self) -> StabilityExperiment:
final_response_text="",
))
- # Add farness result: smaller update
+ # Add brier result: smaller update
exp.results.append(StabilityResult(
case_id="planning_estimate",
- condition="farness",
+ condition="brier",
initial_estimate=5.0,
initial_ci_low=3.0,
initial_ci_high=7.0,
@@ -322,29 +322,29 @@ def test_analyze_returns_metrics(self, experiment):
"""Should return analysis metrics."""
analysis = experiment.analyze()
assert "n_naive" in analysis
- assert "n_farness" in analysis
+ assert "n_brier" in analysis
assert "naive" in analysis
- assert "farness" in analysis
+ assert "brier" in analysis
assert analysis["comparison_metric"] == "relative_update"
def test_naive_has_larger_update(self, experiment):
"""Naive should have larger update in our mock data."""
analysis = experiment.analyze()
naive_update = analysis["naive"]["mean_update_magnitude"]
- farness_update = analysis["farness"]["mean_update_magnitude"]
- assert naive_update > farness_update
+ brier_update = analysis["brier"]["mean_update_magnitude"]
+ assert naive_update > brier_update
- def test_farness_has_higher_ci_rate(self, experiment):
- """Farness should have higher initial CI rate."""
+ def test_brier_has_higher_ci_rate(self, experiment):
+ """Brier should have higher initial CI rate."""
analysis = experiment.analyze()
- assert analysis["farness"]["initial_ci_rate"] == 1.0
+ assert analysis["brier"]["initial_ci_rate"] == 1.0
assert analysis["naive"]["initial_ci_rate"] == 0.0
def test_convergence_measured(self, experiment):
"""Should measure convergence."""
analysis = experiment.analyze()
assert "convergence" in analysis
- # Naive went from 4→6, farness initial was 5
+ # Naive went from 4→6, brier initial was 5
# Initial gap: |4-5| = 1, Final gap: |6-5| = 1
# Convergence ratio: 1 - 1/1 = 0
@@ -353,7 +353,7 @@ def test_summary_table_generated(self, experiment):
table = experiment.summary_table()
assert "Stability-under-probing results" in table
assert "Naive" in table
- assert "Farness" in table
+ assert "Brier" in table
assert "Primary pooled comparison metric" in table
def test_mixed_effects_uses_relative_update(self, experiment):
@@ -376,7 +376,7 @@ def test_analyze_groups_multiple_probe_batteries(self):
),
StabilityResult(
case_id="planning_estimate",
- condition="farness",
+ condition="brier",
probe_battery="on_framework",
initial_estimate=5.0,
final_estimate=5.5,
@@ -390,7 +390,7 @@ def test_analyze_groups_multiple_probe_batteries(self):
),
StabilityResult(
case_id="planning_estimate",
- condition="farness",
+ condition="brier",
probe_battery="off_framework",
initial_estimate=5.0,
final_estimate=5.2,
diff --git a/tests/test_storage.py b/tests/test_storage.py
index a8f402b..c5b9fe3 100644
--- a/tests/test_storage.py
+++ b/tests/test_storage.py
@@ -5,8 +5,8 @@
from pathlib import Path
from datetime import datetime, timedelta
-from farness.framework import Decision, KPI, Option, Forecast
-from farness.storage import DecisionStore
+from brier.framework import Decision, KPI, Option, Forecast
+from brier.storage import DecisionStore
@pytest.fixture
diff --git a/universe.html b/universe.html
new file mode 100644
index 0000000..cc4ccce
--- /dev/null
+++ b/universe.html
@@ -0,0 +1,318 @@
+
+
+
+
+
+
+Axiom · Thesis universe
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ A
+
+ XIOM
+ FOUNDATION
+
+
+
the rules — computable law for all
+
+
+
+
standard
+
RuleSpec
+
Machine-readable rules — cited, time-aware, executable. The standard everything downstream runs on.
+
+
+
encode
+
Encoder
+
AI-assisted encoding of source law into RuleSpec — reviewed, cited back to the statute.
+
+
+
ingest
+
Corpus
+
Scrapers pulling source statutes, regulations, and rulings — the raw legal record.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ thesis institute
+
+
the forecasts — open, agentic predictions
+
+
+
+
+
the forecaster
+
Brier-1
+
A general-purpose, calibration-native prediction agent — used the way today's LMs are.
+
+ decisions
+ api
+ anywhere
+
+
+
+
the proving ground
+
Thesis
+
The open, continuously-scored forecast platform — the agent's public scoreboard and training ground.
+
+ docket
+ ledger
+
+
+
+
+
mechanism
+
PolicyEngine
+
Microsimulation — applies the rules to populations for tax, benefit, and distributional outcomes.
+
+
+
populations
+
Microplex
+
Calibrated synthetic populations — the substrate simulation and forecasting run on.