Skip to content

feat(docs-server): dynamic sitemap discovery for automatic content adaptation#190

Merged
cbjuan merged 12 commits intomainfrom
feat/179-dynamic-sitemap-discovery
Apr 20, 2026
Merged

feat(docs-server): dynamic sitemap discovery for automatic content adaptation#190
cbjuan merged 12 commits intomainfrom
feat/179-dynamic-sitemap-discovery

Conversation

@cbjuan
Copy link
Copy Markdown
Member

@cbjuan cbjuan commented Apr 16, 2026

Summary

Closes #179

  • Dynamic sitemap discovery: fetches and parses sitemap-0.xml to automatically discover all documentation pages (modules, addons, guides, tutorials, API packages), with TTL caching and graceful fallback to hardcoded constants when the sitemap is unreachable
  • Two new MCP resources: qiskit-docs://tutorials (43 tutorials) and qiskit-docs://api-packages (6 packages including qiskit-ibm-runtime, qiskit-ibm-transpiler, REST APIs)
  • Updated fallback constants: 28 SDK modules (was 17), ~160 guides (was ~40), 43 tutorials and 6 API packages (both new)
  • 22 new tests covering sitemap XML parsing, HTTP fetching, caching, and fallback behavior

…t adaptation

Closes #179. Resource lists were hardcoded and significantly out of date.
This adds sitemap-based dynamic discovery so the server automatically picks
up new pages when IBM updates the documentation site, with graceful fallback
to comprehensive hardcoded constants when the sitemap is unreachable.

- Add _parse_sitemap_xml / _classify_page for sitemap parsing
- Add _fetch_sitemap_pages with TTL caching and fallback
- Convert get_list_of_modules/addons/guides to async with sitemap-first
- Add get_list_of_tutorials and get_list_of_api_packages
- Add qiskit-docs://tutorials and qiskit-docs://api-packages resources
- Update constants: 28 modules, ~160 guides, 43 tutorials, 6 API packages
- Add 22 new tests covering sitemap parsing, fetching, and fallback paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cbjuan cbjuan requested a review from vabarbosa as a code owner April 16, 2026 10:45
cbjuan added 2 commits April 16, 2026 12:46
The cache TTL environment variable was configurable but not listed
in the README's environment variables table.
Copy link
Copy Markdown
Collaborator

@vabarbosa vabarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!
just a couple comments for consideration later (not necessarily right now):

  • how should the hard coded values (e.g., guides, packages, etc.) remain up-to-date?
  • it may be time to think about splitting the data_fetcher.py into multiple files as it is getting long

# Conflicts:
#	qiskit-docs-mcp-server/README.md
#	qiskit-docs-mcp-server/src/qiskit_docs_mcp_server/data_fetcher.py
#	qiskit-docs-mcp-server/tests/test_server.py
@cbjuan
Copy link
Copy Markdown
Member Author

cbjuan commented Apr 17, 2026

The sitemap updater should enable automatic loading of categories and sections. The current hardcoded things should act as a backfill in case it fails.

I do agree, working on a refactor of it.

cbjuan and others added 7 commits April 17, 2026 21:31
… add fallback update script

Address PR reviewer feedback by extracting data_fetcher.py (823 lines) into
four focused modules: http.py (HTTP infrastructure and caching), sitemap.py
(sitemap discovery), html_processing.py (HTML-to-markdown conversion), and a
slimmed data_fetcher.py (business logic). Also adds scripts/update_fallback_constants.py
to regenerate hardcoded fallback values from the live sitemap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…resources, and project structure

Add documentation for dynamic sitemap discovery feature, the fallback
constants update script, resource templates, two new resources (tutorials,
api-packages), and the new modular project structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…afe XML parsing

Bandit flagged xml.etree.ElementTree as vulnerable to XML attacks (B405, B314).
Switch to defusedxml which provides the same API with protection against
XML bombs and external entity expansion.
Use direct import of fromstring instead of aliasing the module as ET,
which triggered N817 (CamelCase imported as acronym) and I001 (unsorted
imports).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@vabarbosa vabarbosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!!
just one minor comment (more of a preference than an issue so could be ignored)

return {key: sorted(values) for key, values in buckets.items()}


async def _fetch_sitemap_pages() -> dict[str, list[str]] | None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if it would make sense to just fetch the sitemap on startup/lifespan and store the result? all subsequent calls would use the stored value

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the change proposal. Implemented in 481bcc1

…zily

Move sitemap fetching from per-resource-call lazy cache lookups to a
single eager fetch during server lifespan startup, per reviewer feedback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cbjuan cbjuan merged commit 5ad45ff into main Apr 20, 2026
30 checks passed
@cbjuan cbjuan deleted the feat/179-dynamic-sitemap-discovery branch April 20, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs MCP server: hardcoded resource lists are stale and missing major content sections

2 participants