feat(docs-server): dynamic sitemap discovery for automatic content adaptation#190
Merged
feat(docs-server): dynamic sitemap discovery for automatic content adaptation#190
Conversation
…t adaptation Closes #179. Resource lists were hardcoded and significantly out of date. This adds sitemap-based dynamic discovery so the server automatically picks up new pages when IBM updates the documentation site, with graceful fallback to comprehensive hardcoded constants when the sitemap is unreachable. - Add _parse_sitemap_xml / _classify_page for sitemap parsing - Add _fetch_sitemap_pages with TTL caching and fallback - Convert get_list_of_modules/addons/guides to async with sitemap-first - Add get_list_of_tutorials and get_list_of_api_packages - Add qiskit-docs://tutorials and qiskit-docs://api-packages resources - Update constants: 28 modules, ~160 guides, 43 tutorials, 6 API packages - Add 22 new tests covering sitemap parsing, fetching, and fallback paths Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cache TTL environment variable was configurable but not listed in the README's environment variables table.
vabarbosa
approved these changes
Apr 17, 2026
Collaborator
vabarbosa
left a comment
There was a problem hiding this comment.
thank you!
just a couple comments for consideration later (not necessarily right now):
- how should the hard coded values (e.g., guides, packages, etc.) remain up-to-date?
- it may be time to think about splitting the
data_fetcher.pyinto multiple files as it is getting long
# Conflicts: # qiskit-docs-mcp-server/README.md # qiskit-docs-mcp-server/src/qiskit_docs_mcp_server/data_fetcher.py # qiskit-docs-mcp-server/tests/test_server.py
Member
Author
|
The sitemap updater should enable automatic loading of categories and sections. The current hardcoded things should act as a backfill in case it fails. I do agree, working on a refactor of it. |
… add fallback update script Address PR reviewer feedback by extracting data_fetcher.py (823 lines) into four focused modules: http.py (HTTP infrastructure and caching), sitemap.py (sitemap discovery), html_processing.py (HTML-to-markdown conversion), and a slimmed data_fetcher.py (business logic). Also adds scripts/update_fallback_constants.py to regenerate hardcoded fallback values from the live sitemap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…resources, and project structure Add documentation for dynamic sitemap discovery feature, the fallback constants update script, resource templates, two new resources (tutorials, api-packages), and the new modular project structure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…afe XML parsing Bandit flagged xml.etree.ElementTree as vulnerable to XML attacks (B405, B314). Switch to defusedxml which provides the same API with protection against XML bombs and external entity expansion.
Use direct import of fromstring instead of aliasing the module as ET, which triggered N817 (CamelCase imported as acronym) and I001 (unsorted imports). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
vabarbosa
approved these changes
Apr 20, 2026
Collaborator
vabarbosa
left a comment
There was a problem hiding this comment.
thank you!!
just one minor comment (more of a preference than an issue so could be ignored)
| return {key: sorted(values) for key, values in buckets.items()} | ||
|
|
||
|
|
||
| async def _fetch_sitemap_pages() -> dict[str, list[str]] | None: |
Collaborator
There was a problem hiding this comment.
wondering if it would make sense to just fetch the sitemap on startup/lifespan and store the result? all subsequent calls would use the stored value
Member
Author
There was a problem hiding this comment.
I like the change proposal. Implemented in 481bcc1
…zily Move sitemap fetching from per-resource-call lazy cache lookups to a single eager fetch during server lifespan startup, per reviewer feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #179
sitemap-0.xmlto automatically discover all documentation pages (modules, addons, guides, tutorials, API packages), with TTL caching and graceful fallback to hardcoded constants when the sitemap is unreachableqiskit-docs://tutorials(43 tutorials) andqiskit-docs://api-packages(6 packages includingqiskit-ibm-runtime,qiskit-ibm-transpiler, REST APIs)