Rolling vector search of UK parliamentary debates for declarations of interest, published via GitHub Pages.
Parliamentary rules require MPs to declare relevant financial interests when speaking in debates. This project uses vector similarity search to find speeches that mention the Register of Members' Financial Interests, then uses an LLM agent to evaluate whether the declaration meets the clarity requirements.
The pipeline has three stages:
- Vector search —
mini-transcript-searchdownloads XML transcripts from TheyWorkForYou, computes sentence embeddings, and finds speeches similar to a set of reference phrases about declaring interests. - Agent evaluation — Speeches flagged by the vector search are passed to an OpenAI GPT-4o agent (
agent_refine.py) that determines whether each speech actually contains a declaration, and whether the declaration clearly states the nature of the interest (as opposed to a vague reference to the register). - Publishing — Results are rendered as a Jekyll site and deployed to GitHub Pages, with a daily-updated list of possible mentions from the last 30 days.
src/regmem_vector_search/
├── search.py # Vector search logic, result caching, text processing
├── agent_refine.py # LLM agent for evaluating declaration clarity
├── config.py # Settings (API keys via .env)
└── __main__.py # CLI entry point
notebooks/
├── infer_last_month.ipynb # Daily report: last 30 days of declarations
├── infer_last_year.ipynb # One-off: bulk search over a year
└── split_last_year.ipynb # Splits yearly results into per-MP pages
docs/ # Jekyll site published to GitHub Pages
Three layers of caching avoid redundant API calls and computation:
- Transcript embeddings — Per-day parquet files stored in
data/parlparse_xmls/. Once a day's transcripts are embedded, they are reused on subsequent runs. - Search results — The full
SearchResultfrom each query is cached indata/regmem_vector_search/search_results.sqliteusingPydanticDBM, keyed by date range and threshold. Same-day re-runs skip loading parquets and computing cosine similarity. - Agent declarations — Per-speech LLM evaluations are cached in
data/regmem_vector_search/interest_declarations.sqlite. Already-evaluated speeches are not re-sent to the API.
In GitHub Actions, data/parlparse_xmls/ and data/regmem_vector_search/ are preserved across runs using actions/cache.
This project uses a devcontainer. To run locally:
- Clone with submodules:
git clone --recurse-submodules - Create a
.envfile with:OPENAI_APIKEY=sk-... HF_TOKEN=hf_... - Open in VS Code with the Dev Containers extension, or build with
docker compose build.
# Run the daily search notebook
notebook render search --publish
# Run tests
script/test
# Lint
script/lintThe build_and_publish workflow runs daily at 08:00 UTC. It:
- Restores cached data from prior runs
- Runs
infer_last_month.ipynbinside the devcontainer - Commits any updated output to the repo
- Builds the Jekyll site and deploys to GitHub Pages
- Sends a Slack notification on success or failure