Skip to content

liuchzzyy/rss-cli-agent

Repository files navigation

rss-cli-agent

rss-cli-agent is a small daily pipeline for collecting paper candidates from journal RSS feeds and Google Scholar alert emails, screening titles with an AI model, and exporting the selected entries as JSON.

The project is intentionally narrow. It keeps only the fields needed by the downstream literature workflow: title, DOI, timestamp, URL, and pipeline state.

Current Flow

  1. sync_scholar.py reads Google Scholar alert emails from Gmail and writes candidate entries into the temporary source cache.
  2. refresh_feeds.py refreshes configured journal RSS feeds into storage/db/source_cache.sqlite.
  3. sync_entries.py compares the temporary source cache with storage/db/rss_entries.sqlite and inserts or updates changed entries.
  4. filter_titles.py reads entries with state = "pending_filter", writes AI decisions to storage/db/title_filtered.sqlite, and updates the persistent RSS entry state.
  5. export_daily.py exports selected entries to storage/exports.

There is no Crossref stage, metadata-cleaning stage, or missing_doi state. DOI is best-effort; entries without DOI still go through title filtering and export.

Pipeline State

  • pending_filter: entry is waiting for AI title filtering. If the AI call fails, the entry stays here and is retried on the next run.
  • selected: entry passed title filtering and is ready for export.
  • filtered_out: entry was rejected by title filtering.
  • exported: selected entry has been included in a daily export.

Runtime Files

Tracked daily state:

  • storage/db/rss_entries.sqlite
  • storage/db/title_filtered.sqlite
  • storage/exports/*.selected.json
  • storage/exports/*.manifest.json

Temporary or local-only files:

  • storage/db/source_cache.sqlite
  • config/settings.toml
  • config/cache/
  • log/

source_cache.sqlite is a per-run comparison cache. It should not be kept as a long-term database.

Configuration

Local runs use config/settings.toml. Do not use .env for this project.

The expected settings sections are:

  • [paths]
  • [deepseek]
  • [ai_title_filter]
  • [google_scholar_alerts]
  • [google_oauth]
  • [rss_feeds]

For GitHub Actions, configure these repository secrets:

  • DEEPSEEK_API_KEY
  • GWS_CLIENT_SECRET_JSON_B64
  • GWS_TOKEN_JSON_B64

The two GWS secrets are base64-encoded JSON files for client_secret.json and token.json.

Run

Use the wrapper for normal daily runs:

pwsh -NoProfile -ExecutionPolicy Bypass -File tools\run-daily-rss-gmail.ps1

For CI-style local validation in the current terminal:

pwsh -NoProfile -ExecutionPolicy Bypass -File tools\run-daily-rss-gmail.ps1 -RunInCurrentWindow

Pipeline modules should be run with:

--settings-toml config/settings.toml

Development Checks

uv run ruff check .
uv run ty check .
uv run python -m pytest

Release

Current release line: v1.2.0.

License: MIT.

About

Daily RSS and Google Scholar alert pipeline with AI title filtering.

Resources

License

Stars

Watchers

Forks

Contributors