Systemfehler is a modular, extensible, preservation-oriented data platform for social services in Germany. It collects, normalizes, and preserves information about benefits, aid programs, tools, organizations, and related support structures.
The goal is to make information about social rights and support more transparent, accessible, and robust against removal or silent change.
For a high-level German overview for non-technical readers, see
docs/ueberblick.md.
- Real data is published for all five domains (
benefits,aid,tools,organizations,contacts). - Current validated dataset size: 1006 entries
(
benefits5,aid140,tools127,organizations379,contacts355). - Validation status (latest full pass): 0 schema/structural errors, 994 lint warnings
(
node scripts/validate_entries.js --fail-on-errors=false, warnings mainly from missing Easy German translations on newer promoted entries). - Public frontend and API are live on Cloudflare Pages at
https://systemfehler.pages.dev/. - Production AI retrieval validation (suggested life-event suite): 60/60 passed (2026-05-03).
-
Modular domain structure
Each domain (e.g. benefits, aid, tools, organizations, contacts) has its own crawler and data files, following a common pattern.
-
Schema-driven data
A core schema defines stable, cross-domain fields. Extension schemas capture domain-specific fields. All entries are validated against these schemas.
-
Temporal modeling
Entries contain temporal fields (e.g. validity intervals, deadlines, status). Historical versions are archived, so changes over time remain observable.
-
Multilingual support
Text fields support multiple languages (initially German, English, and Easy German). Translations are preserved even if removed from the original sources.
-
Human-in-the-loop moderation
Crawler output is never published directly. All changes go into a moderation queue, with diffs and provenance information. Moderators approve or reject changes, and an audit log records decisions.
-
Quality and AI searchability scores
Entries receive Information Quality and AI Searchability scores to help detect incomplete or outdated data and support downstream ranking and analysis.
-
LLM-ready structure
Data is stored in a structured, explicit format that supports retrieval-augmented generation, question answering, and future AI-based advisory tools.
A typical layout looks like this:
data/
_schemas/
_taxonomy/
_sources/
_quality/
benefits/
aid/
tools/
organizations/
contacts/
services/
benefits/
aid/
tools/
organizations/
contacts/
_link_expander/
_shared/
moderation/
review_queue.json
audit_log.jsonl
dashboard/
scripts/
validate_entries.js
generate_diff.js
calculate_quality_scores.js
export_temporal_view.js
report_language_coverage.js
docs/
architecture.md
current-state.md
status.md
ueberblick.md
vision.md
For details, see docs/architecture.md.
For the consolidated, current repo state across docs, code, and live GitHub
issues, see docs/current-state.md.
- Node.js 18+ and npm
- Python 3.11+
- PostgreSQL 16+
- Docker (optional, for running PostgreSQL)
- Git and GitHub account
- Recommended: VS Code and GitHub CLI (
gh)
- Clone the repository:
git clone git@github.com:steffolino/systemfehler.git
cd systemfehler- Install Node.js dependencies:
npm install- Install Python dependencies:
pip install -r crawlers/requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env and add your configuration- Start PostgreSQL database:
# Using Docker Compose
docker-compose up -d postgres
# Or manually install and configure PostgreSQL 16- Run database migrations:
npm run db:migrate- Install frontend dependencies:
cd frontend
npm install
cd ..Run the benefits crawler:
npm run crawl:benefitsReplace local PostgreSQL data from current snapshots:
npm run db:seedStart the API server:
npm run apiStart the frontend (in a new terminal):
npm run devOr start the full production-like local stack (recommended):
npm run dev:allOpen http://127.0.0.1:8788 in your browser.
Legacy local stack (Express + Python AI sidecar + Ollama):
npm run dev:all:legacy# Crawlers
npm run crawl:benefits # Crawl Arbeitsagentur benefits data
npm run crawl:all # Run all available crawlers
# Database
npm run db:migrate # Run database migrations
npm run db:seed # Replace PostgreSQL data from all current snapshots
npm run db:seed:benefits # Import only benefits snapshot into PostgreSQL
# API and Frontend
npm run api # Start Express API server (port 3001)
npm run dev # Start Vite frontend dev server (port 5173)
npm run dev:all # Start full local Cloudflare Pages stack (recommended, port 8788)
npm run dev:all:fast # Start local Pages stack without D1 reset/reseed
npm run dev:all:legacy # Legacy stack: Express + Vite + Python AI sidecar + Ollama + Docker
npm run prepare:dist-pages # Build and assemble dist-pages artifact
npm run dev:pages:d1:reset # Reset local D1 schema + seed snapshot entries (chunked)
npm run dev:pages:stop # Stop stale local wrangler/workerd processes on port 8788
# Validation and Quality
npm run validate # Validate entries against schemas
npm run validate:report # Validate but do not fail on schema errors
npm run validate:ci # CI mode JSON report + non-zero exit on errors
node scripts/enrich_real_entries.js # Enrich summary/content text for current real entries
npm run score # Calculate quality scores (legacy)
# Reports (legacy)
npm run report:temporal # Generate temporal view report
npm run report:languages # Report language coverage
# LLM Features (legacy)
npm run llm:setup # Set up LLM client
npm run llm:embeddings # Generate embeddings
npm run llm:qa # Interactive Q&A
npm run llm:translate # Generate Easy German translations
npm run llm:costs # Report LLM costsThe crawler CLI provides direct access to Python crawler functionality:
# Run crawler
python crawlers/cli.py crawl benefits --source arbeitsagentur
# Validate data
python crawlers/cli.py validate --domain benefits
# Import one domain to database
python crawlers/cli.py import --domain benefits --to-dbUse the Node validator for schema + taxonomy checks and lint warnings:
# Human-readable local report (fails on schema errors)
npm run validate
# Human-readable local report (never fails build)
npm run validate:report
# CI mode: JSON output + non-zero on validation errors
npm run validate:ci
# Optional flags
node scripts/validate_entries.js --domain=benefits --max-samples=10 --fail-on-errors=falseThe report includes entry counts, error/warning totals, and sample failures.
The repository includes a Cloudflare Pages deployment workflow for production frontend + API.
- Workflow:
.github/workflows/deploy-pages.yml - Static output:
frontend/dist - Required GitHub secrets:
CF_PAGES_API_TOKEN,CF_ACCOUNT_ID,PAGES_INGEST_URL,INGEST_TOKEN
Cloudflare Pages Functions source is maintained in cloudflare-pages/functions
and is deployed together with the static build.
Note: production hosting currently uses Cloudflare Pages as the primary live target.
See cloudflare-pages/README.md for setup details.
For a detailed description of the architecture, see:
docs/architecture.md– architectural overview and data flow.docs/ueberblick.md– German high-level overview for non-technical readers.docs/vision.md– strategic and stakeholder-focused overview.docs/current-state.md– current implemented state and active documentation map.docs/status.md– implementation status and known operational limits.
- Guided AI search now defaults to
standardanswer mode in the public search flow. - Simple-language (
Einfach) answer generation was reworked for coherent narrative output. - Added editorial life-event semantic governance:
- D1-backed review-case and override persistence
- API endpoint
/api/data/life-event-review - admin dashboard route
/admin/life-events
- Retrieval diagnostics now include editorial review and override metadata.
- Production deployment was validated with a full suggested-query run (
60/60) and temporary Turnstile E2E bypass cleanup.
-
Check open issues in GitHub and pick an Epic or sub-issue that matches your interests.
-
Create a feature branch and implement changes in a small, focused scope.
-
Run validation and any relevant scripts before committing.
-
Open a Pull Request and describe:
- What changed.
- Which issue(s) it closes.
- Any schema updates or data migrations.
Guidelines:
- Do not bypass moderation: crawlers should never write directly into final entries.
- Keep schemas backward compatible where possible and update schema versioning and changelogs when changes are made.
- Update documentation under
docs/if changes affect other contributors.
This project is licensed under the MIT License.
See the LICENSE file for the full text.
Everyone is welcome to use, fork, adapt, and contribute.
Systemfehler is in active implementation with a live static deployment and validated real-data snapshots. Ongoing work focuses on broadening source coverage, deepening extraction quality, and maintaining strict schema/taxonomy compliance.