Systemfehler

Systemfehler is a modular, extensible, preservation-oriented data platform for social services in Germany. It collects, normalizes, and preserves information about benefits, aid programs, tools, organizations, and related support structures.

The goal is to make information about social rights and support more transparent, accessible, and robust against removal or silent change.

For a high-level German overview for non-technical readers, see docs/ueberblick.md.

Current Snapshot (2026-05-03)

Real data is published for all five domains (benefits, aid, tools, organizations, contacts).
Current validated dataset size: 1006 entries (benefits 5, aid 140, tools 127, organizations 379, contacts 355).
Validation status (latest full pass): 0 schema/structural errors, 994 lint warnings (node scripts/validate_entries.js --fail-on-errors=false, warnings mainly from missing Easy German translations on newer promoted entries).
Public frontend and API are live on Cloudflare Pages at https://systemfehler.pages.dev/.
Production AI retrieval validation (suggested life-event suite): 60/60 passed (2026-05-03).

Features

Modular domain structure

Each domain (e.g. benefits, aid, tools, organizations, contacts) has its own crawler and data files, following a common pattern.
Schema-driven data

A core schema defines stable, cross-domain fields. Extension schemas capture domain-specific fields. All entries are validated against these schemas.
Temporal modeling

Entries contain temporal fields (e.g. validity intervals, deadlines, status). Historical versions are archived, so changes over time remain observable.
Multilingual support

Text fields support multiple languages (initially German, English, and Easy German). Translations are preserved even if removed from the original sources.
Human-in-the-loop moderation

Crawler output is never published directly. All changes go into a moderation queue, with diffs and provenance information. Moderators approve or reject changes, and an audit log records decisions.
Quality and AI searchability scores

Entries receive Information Quality and AI Searchability scores to help detect incomplete or outdated data and support downstream ranking and analysis.
LLM-ready structure

Data is stored in a structured, explicit format that supports retrieval-augmented generation, question answering, and future AI-based advisory tools.

Repository Structure

A typical layout looks like this:

data/
  _schemas/
  _taxonomy/
  _sources/
  _quality/
  benefits/
  aid/
  tools/
  organizations/
  contacts/

services/
  benefits/
  aid/
  tools/
  organizations/
  contacts/
  _link_expander/
  _shared/

moderation/
  review_queue.json
  audit_log.jsonl
  dashboard/

scripts/
  validate_entries.js
  generate_diff.js
  calculate_quality_scores.js
  export_temporal_view.js
  report_language_coverage.js

docs/
  architecture.md
  current-state.md
  status.md
  ueberblick.md
  vision.md

For details, see docs/architecture.md.

For the consolidated, current repo state across docs, code, and live GitHub issues, see docs/current-state.md.

Getting Started

Prerequisites

Node.js 18+ and npm
Python 3.11+
PostgreSQL 16+
Docker (optional, for running PostgreSQL)
Git and GitHub account
Recommended: VS Code and GitHub CLI (gh)

Installation

Clone the repository:

git clone git@github.com:steffolino/systemfehler.git
cd systemfehler

Install Node.js dependencies:

npm install

Install Python dependencies:

pip install -r crawlers/requirements.txt

Set up environment variables:

cp .env.example .env
# Edit .env and add your configuration

Start PostgreSQL database:

# Using Docker Compose
docker-compose up -d postgres

# Or manually install and configure PostgreSQL 16

Run database migrations:

npm run db:migrate

Install frontend dependencies:

cd frontend
npm install
cd ..

Quick Start

Run the benefits crawler:

npm run crawl:benefits

Replace local PostgreSQL data from current snapshots:

npm run db:seed

Start the API server:

npm run api

Start the frontend (in a new terminal):

npm run dev

Or start the full production-like local stack (recommended):

npm run dev:all

Open http://127.0.0.1:8788 in your browser.

Legacy local stack (Express + Python AI sidecar + Ollama):

npm run dev:all:legacy

Available Commands

# Crawlers
npm run crawl:benefits          # Crawl Arbeitsagentur benefits data
npm run crawl:all               # Run all available crawlers

# Database
npm run db:migrate              # Run database migrations
npm run db:seed                 # Replace PostgreSQL data from all current snapshots
npm run db:seed:benefits        # Import only benefits snapshot into PostgreSQL

# API and Frontend
npm run api                     # Start Express API server (port 3001)
npm run dev                     # Start Vite frontend dev server (port 5173)
npm run dev:all                 # Start full local Cloudflare Pages stack (recommended, port 8788)
npm run dev:all:fast            # Start local Pages stack without D1 reset/reseed
npm run dev:all:legacy          # Legacy stack: Express + Vite + Python AI sidecar + Ollama + Docker
npm run prepare:dist-pages      # Build and assemble dist-pages artifact
npm run dev:pages:d1:reset      # Reset local D1 schema + seed snapshot entries (chunked)
npm run dev:pages:stop          # Stop stale local wrangler/workerd processes on port 8788

# Validation and Quality
npm run validate                # Validate entries against schemas
npm run validate:report         # Validate but do not fail on schema errors
npm run validate:ci             # CI mode JSON report + non-zero exit on errors
node scripts/enrich_real_entries.js  # Enrich summary/content text for current real entries
npm run score                   # Calculate quality scores (legacy)

# Reports (legacy)
npm run report:temporal         # Generate temporal view report
npm run report:languages        # Report language coverage

# LLM Features (legacy)
npm run llm:setup               # Set up LLM client
npm run llm:embeddings          # Generate embeddings
npm run llm:qa                  # Interactive Q&A
npm run llm:translate           # Generate Easy German translations
npm run llm:costs               # Report LLM costs

Python CLI

The crawler CLI provides direct access to Python crawler functionality:

# Run crawler
python crawlers/cli.py crawl benefits --source arbeitsagentur

# Validate data
python crawlers/cli.py validate --domain benefits

# Import one domain to database
python crawlers/cli.py import --domain benefits --to-db

Validation Pipeline (DATA-05)

Use the Node validator for schema + taxonomy checks and lint warnings:

# Human-readable local report (fails on schema errors)
npm run validate

# Human-readable local report (never fails build)
npm run validate:report

# CI mode: JSON output + non-zero on validation errors
npm run validate:ci

# Optional flags
node scripts/validate_entries.js --domain=benefits --max-samples=10 --fail-on-errors=false

The report includes entry counts, error/warning totals, and sample failures.

Cloudflare Deployment

The repository includes a Cloudflare Pages deployment workflow for production frontend + API.

Workflow: .github/workflows/deploy-pages.yml
Static output: frontend/dist
Required GitHub secrets: CF_PAGES_API_TOKEN, CF_ACCOUNT_ID, PAGES_INGEST_URL, INGEST_TOKEN

Cloudflare Pages Functions source is maintained in cloudflare-pages/functions and is deployed together with the static build.

Note: production hosting currently uses Cloudflare Pages as the primary live target.

See cloudflare-pages/README.md for setup details.

Architecture and Design

For a detailed description of the architecture, see:

docs/architecture.md – architectural overview and data flow.
docs/ueberblick.md – German high-level overview for non-technical readers.
docs/vision.md – strategic and stakeholder-focused overview.
docs/current-state.md – current implemented state and active documentation map.
docs/status.md – implementation status and known operational limits.

Recent Changes (2026-05-03)

Guided AI search now defaults to standard answer mode in the public search flow.
Simple-language (Einfach) answer generation was reworked for coherent narrative output.
Added editorial life-event semantic governance:
- D1-backed review-case and override persistence
- API endpoint /api/data/life-event-review
- admin dashboard route /admin/life-events
Retrieval diagnostics now include editorial review and override metadata.
Production deployment was validated with a full suggested-query run (60/60) and temporary Turnstile E2E bypass cleanup.

Contributing

Check open issues in GitHub and pick an Epic or sub-issue that matches your interests.
Create a feature branch and implement changes in a small, focused scope.
Run validation and any relevant scripts before committing.
Open a Pull Request and describe:
- What changed.
- Which issue(s) it closes.
- Any schema updates or data migrations.

Guidelines:

Do not bypass moderation: crawlers should never write directly into final entries.
Keep schemas backward compatible where possible and update schema versioning and changelogs when changes are made.
Update documentation under docs/ if changes affect other contributors.

License

This project is licensed under the MIT License.

See the LICENSE file for the full text.

Everyone is welcome to use, fork, adapt, and contribute.

Status

Systemfehler is in active implementation with a live static deployment and validated real-data snapshots. Ongoing work focuses on broadening source coverage, deepening extraction quality, and maintaining strict schema/taxonomy compliance.

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
.deploy-main		.deploy-main
.github		.github
.vscode		.vscode
backend		backend
cloudflare-pages		cloudflare-pages
cloudflare-workers		cloudflare-workers
crawlers		crawlers
data		data
dev-setup		dev-setup
dist-pages		dist-pages
docs		docs
frontend		frontend
logs		logs
moderation		moderation
scripts		scripts
services		services
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.markdownlint-ci.json		.markdownlint-ci.json
DATA_LICENSE.md		DATA_LICENSE.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
test_request.py		test_request.py
test_suggested.mjs		test_suggested.mjs
wrangler.toml		wrangler.toml
wrangler.worker.toml		wrangler.worker.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Systemfehler

Current Snapshot (2026-05-03)

Features

Repository Structure

Getting Started

Prerequisites

Installation

Quick Start

Available Commands

Python CLI

Validation Pipeline (DATA-05)

Cloudflare Deployment

Architecture and Design

Recent Changes (2026-05-03)

Contributing

License

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Systemfehler

Current Snapshot (2026-05-03)

Features

Repository Structure

Getting Started

Prerequisites

Installation

Quick Start

Available Commands

Python CLI

Validation Pipeline (DATA-05)

Cloudflare Deployment

Architecture and Design

Recent Changes (2026-05-03)

Contributing

License

Status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages