Skip to content

steffolino/systemfehler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

194 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Systemfehler

Systemfehler is a modular, extensible, preservation-oriented data platform for social services in Germany. It collects, normalizes, and preserves information about benefits, aid programs, tools, organizations, and related support structures.

The goal is to make information about social rights and support more transparent, accessible, and robust against removal or silent change.

For a high-level German overview for non-technical readers, see docs/ueberblick.md.

Current Snapshot (2026-05-03)

  • Real data is published for all five domains (benefits, aid, tools, organizations, contacts).
  • Current validated dataset size: 1006 entries (benefits 5, aid 140, tools 127, organizations 379, contacts 355).
  • Validation status (latest full pass): 0 schema/structural errors, 994 lint warnings (node scripts/validate_entries.js --fail-on-errors=false, warnings mainly from missing Easy German translations on newer promoted entries).
  • Public frontend and API are live on Cloudflare Pages at https://systemfehler.pages.dev/.
  • Production AI retrieval validation (suggested life-event suite): 60/60 passed (2026-05-03).

Features

  • Modular domain structure

    Each domain (e.g. benefits, aid, tools, organizations, contacts) has its own crawler and data files, following a common pattern.

  • Schema-driven data

    A core schema defines stable, cross-domain fields. Extension schemas capture domain-specific fields. All entries are validated against these schemas.

  • Temporal modeling

    Entries contain temporal fields (e.g. validity intervals, deadlines, status). Historical versions are archived, so changes over time remain observable.

  • Multilingual support

    Text fields support multiple languages (initially German, English, and Easy German). Translations are preserved even if removed from the original sources.

  • Human-in-the-loop moderation

    Crawler output is never published directly. All changes go into a moderation queue, with diffs and provenance information. Moderators approve or reject changes, and an audit log records decisions.

  • Quality and AI searchability scores

    Entries receive Information Quality and AI Searchability scores to help detect incomplete or outdated data and support downstream ranking and analysis.

  • LLM-ready structure

    Data is stored in a structured, explicit format that supports retrieval-augmented generation, question answering, and future AI-based advisory tools.


Repository Structure

A typical layout looks like this:

data/
  _schemas/
  _taxonomy/
  _sources/
  _quality/
  benefits/
  aid/
  tools/
  organizations/
  contacts/

services/
  benefits/
  aid/
  tools/
  organizations/
  contacts/
  _link_expander/
  _shared/

moderation/
  review_queue.json
  audit_log.jsonl
  dashboard/

scripts/
  validate_entries.js
  generate_diff.js
  calculate_quality_scores.js
  export_temporal_view.js
  report_language_coverage.js

docs/
  architecture.md
  current-state.md
  status.md
  ueberblick.md
  vision.md

For details, see docs/architecture.md.

For the consolidated, current repo state across docs, code, and live GitHub issues, see docs/current-state.md.


Getting Started

Prerequisites

  • Node.js 18+ and npm
  • Python 3.11+
  • PostgreSQL 16+
  • Docker (optional, for running PostgreSQL)
  • Git and GitHub account
  • Recommended: VS Code and GitHub CLI (gh)

Installation

  1. Clone the repository:
git clone git@github.com:steffolino/systemfehler.git
cd systemfehler
  1. Install Node.js dependencies:
npm install
  1. Install Python dependencies:
pip install -r crawlers/requirements.txt
  1. Set up environment variables:
cp .env.example .env
# Edit .env and add your configuration
  1. Start PostgreSQL database:
# Using Docker Compose
docker-compose up -d postgres

# Or manually install and configure PostgreSQL 16
  1. Run database migrations:
npm run db:migrate
  1. Install frontend dependencies:
cd frontend
npm install
cd ..

Quick Start

Run the benefits crawler:

npm run crawl:benefits

Replace local PostgreSQL data from current snapshots:

npm run db:seed

Start the API server:

npm run api

Start the frontend (in a new terminal):

npm run dev

Or start the full production-like local stack (recommended):

npm run dev:all

Open http://127.0.0.1:8788 in your browser.

Legacy local stack (Express + Python AI sidecar + Ollama):

npm run dev:all:legacy

Available Commands

# Crawlers
npm run crawl:benefits          # Crawl Arbeitsagentur benefits data
npm run crawl:all               # Run all available crawlers

# Database
npm run db:migrate              # Run database migrations
npm run db:seed                 # Replace PostgreSQL data from all current snapshots
npm run db:seed:benefits        # Import only benefits snapshot into PostgreSQL

# API and Frontend
npm run api                     # Start Express API server (port 3001)
npm run dev                     # Start Vite frontend dev server (port 5173)
npm run dev:all                 # Start full local Cloudflare Pages stack (recommended, port 8788)
npm run dev:all:fast            # Start local Pages stack without D1 reset/reseed
npm run dev:all:legacy          # Legacy stack: Express + Vite + Python AI sidecar + Ollama + Docker
npm run prepare:dist-pages      # Build and assemble dist-pages artifact
npm run dev:pages:d1:reset      # Reset local D1 schema + seed snapshot entries (chunked)
npm run dev:pages:stop          # Stop stale local wrangler/workerd processes on port 8788

# Validation and Quality
npm run validate                # Validate entries against schemas
npm run validate:report         # Validate but do not fail on schema errors
npm run validate:ci             # CI mode JSON report + non-zero exit on errors
node scripts/enrich_real_entries.js  # Enrich summary/content text for current real entries
npm run score                   # Calculate quality scores (legacy)

# Reports (legacy)
npm run report:temporal         # Generate temporal view report
npm run report:languages        # Report language coverage

# LLM Features (legacy)
npm run llm:setup               # Set up LLM client
npm run llm:embeddings          # Generate embeddings
npm run llm:qa                  # Interactive Q&A
npm run llm:translate           # Generate Easy German translations
npm run llm:costs               # Report LLM costs

Python CLI

The crawler CLI provides direct access to Python crawler functionality:

# Run crawler
python crawlers/cli.py crawl benefits --source arbeitsagentur

# Validate data
python crawlers/cli.py validate --domain benefits

# Import one domain to database
python crawlers/cli.py import --domain benefits --to-db

Validation Pipeline (DATA-05)

Use the Node validator for schema + taxonomy checks and lint warnings:

# Human-readable local report (fails on schema errors)
npm run validate

# Human-readable local report (never fails build)
npm run validate:report

# CI mode: JSON output + non-zero on validation errors
npm run validate:ci

# Optional flags
node scripts/validate_entries.js --domain=benefits --max-samples=10 --fail-on-errors=false

The report includes entry counts, error/warning totals, and sample failures.


Cloudflare Deployment

The repository includes a Cloudflare Pages deployment workflow for production frontend + API.

  • Workflow: .github/workflows/deploy-pages.yml
  • Static output: frontend/dist
  • Required GitHub secrets: CF_PAGES_API_TOKEN, CF_ACCOUNT_ID, PAGES_INGEST_URL, INGEST_TOKEN

Cloudflare Pages Functions source is maintained in cloudflare-pages/functions and is deployed together with the static build.

Note: production hosting currently uses Cloudflare Pages as the primary live target.

See cloudflare-pages/README.md for setup details.


Architecture and Design

For a detailed description of the architecture, see:

  • docs/architecture.md – architectural overview and data flow.
  • docs/ueberblick.md – German high-level overview for non-technical readers.
  • docs/vision.md – strategic and stakeholder-focused overview.
  • docs/current-state.md – current implemented state and active documentation map.
  • docs/status.md – implementation status and known operational limits.

Recent Changes (2026-05-03)

  • Guided AI search now defaults to standard answer mode in the public search flow.
  • Simple-language (Einfach) answer generation was reworked for coherent narrative output.
  • Added editorial life-event semantic governance:
    • D1-backed review-case and override persistence
    • API endpoint /api/data/life-event-review
    • admin dashboard route /admin/life-events
  • Retrieval diagnostics now include editorial review and override metadata.
  • Production deployment was validated with a full suggested-query run (60/60) and temporary Turnstile E2E bypass cleanup.

Contributing

  1. Check open issues in GitHub and pick an Epic or sub-issue that matches your interests.

  2. Create a feature branch and implement changes in a small, focused scope.

  3. Run validation and any relevant scripts before committing.

  4. Open a Pull Request and describe:

    • What changed.
    • Which issue(s) it closes.
    • Any schema updates or data migrations.

Guidelines:

  • Do not bypass moderation: crawlers should never write directly into final entries.
  • Keep schemas backward compatible where possible and update schema versioning and changelogs when changes are made.
  • Update documentation under docs/ if changes affect other contributors.

License

This project is licensed under the MIT License.

See the LICENSE file for the full text.

Everyone is welcome to use, fork, adapt, and contribute.


Status

Systemfehler is in active implementation with a live static deployment and validated real-data snapshots. Ongoing work focuses on broadening source coverage, deepening extraction quality, and maintaining strict schema/taxonomy compliance.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors