Skip to content

Add codespell support with configuration and fixes#589

Open
yarikoptic wants to merge 6 commits into
pangaea-data-publisher:masterfrom
yarikoptic:enh-codespell
Open

Add codespell support with configuration and fixes#589
yarikoptic wants to merge 6 commits into
pangaea-data-publisher:masterfrom
yarikoptic:enh-codespell

Conversation

@yarikoptic
Copy link
Copy Markdown

Add codespell configuration and fix existing typos.

More about codespell: https://github.com/codespell-project/codespell

I personally introduced it to over a hundred of projects already mostly with
a positive feedback (see the improveit-dashboard note).

CI workflow has permissions set only to read so also should be safe.

Changes

Configuration & Infrastructure

  • Added [tool.codespell] configuration to pyproject.toml.
  • Created GitHub Actions workflow (.github/workflows/codespell.yml)
    to check spelling on push and PRs to master.
  • Added a codespell hook to .pre-commit-config.yaml
    (with tomli fallback for Python <3.11).
  • Configured to skip external vocabulary data files and other
    third-party / auto-generated content:
    • fuji_server/data/*.yaml, fuji_server/data/*.json,
      fuji_server/data/linked_vocabs/ — external identifier and
      ontology data (identifiers.org, bioregistry, bioportal, etc.).
    • tests/*/cassettes/ — VCR recordings.
    • simpleclient/ — auto-generated PHP client.
    • *.ipynb — base64-encoded images cause false positives.
  • Added URL ignore-regex (https?://\S+) so DOIs and links
    are never "fixed" (e.g. doi.org/.../j.patter.2021.100370).

Domain-Specific Whitelist (ignore-words-list)

Typo Fixes

Ambiguous typos fixed manually (7 fixes with context review,
fb5135f):

  • SelectinSelecting (metrics_configuration.md heading)
  • shatthat (docs/source/conf.py comment)
  • statistatuses (metadata_harvester.py comment)
  • doesntdoesn't (metadata_harvester.py comment)
  • "fo rthe""for the" (misplaced space in body.py, openapi.yaml)
  • takedtagged (tests/helper/test_preprocessor.py — not in
    codespell's dictionary; actual intended word is "tagged")
  • identiferidentifiers (fuji_server/data/README.md
    refers to the identifiers.org service)

Non-ambiguous typos fixed automatically via
datalad run codespell -w (single-suggestion fixes,
5f43837). Common fixes include:
explicitelyexplicitly, ressourcesresources,
Sucessfully/sucessfully/SuccesfullySuccessfully/successfully,
inaccesibleinaccessible, peristentpersistent,
accesibilityaccessibility, folowsfollows,
seperateseparate, soucesource, paramterparameter,
ReturmReturn, metadat/matadatametadata,
variuosvarious, namepacenamespace, exentextent,
opionaloptional, reposirotyrepository, occuredoccurred,
insufficentinsufficient, publicallypublicly,
stanardstandard, identiedidentified.

Potential functional fix

fuji_server/models/core_metadata_output.py:79 contained an
allowed_values validator list that used "insufficent metadata" (typo),
while the producing evaluator
(fair_evaluator_minimal_metadata.py:219) already uses
"insufficient metadata" (correct). This means the validator would
have raised ValueError for the normal code path. Fix propagates the
correct spelling to the enum here and in openapi.yaml.

Historical Context

Master has ~21 prior commits mentioning typo / spelling fixes,
confirming the value of automated spell-checking going forward.

Testing

  • Codespell passes with zero errors after all fixes.

🤖 Generated with Claude Code and love to typos free code

yarikoptic and others added 6 commits April 20, 2026 12:54
- Skip external vocabulary data files (fuji_server/data/*.{yaml,json},
  linked_vocabs/) — these are third-party identifier and ontology data.
- Skip test VCR cassettes and the PHP simpleclient (external content).
- Skip Jupyter notebooks (base64 images cause false positives).
- Add ignore-regex for URLs (DOIs/links can contain "typos" that must
  not be fixed).
- Add ignore-words-list entries for domain terms:
  - connexion — Python library name
  - lod — Linked Open Data
  - ore — Object Reuse and Exchange (OAI-ORE)
- Fix stale comment in pre-commit config (codespell config lives in
  pyproject.toml, not .pre-commit-config.yaml) and add tomli fallback
  dependency for Python <3.11.

Co-Authored-By: Claude Code 2.1.114 / Claude Opus 4.7 <noreply@anthropic.com>
Fixed typos where codespell offers multiple suggestions, selecting the
correct fix based on surrounding context:

- Selectin -> Selecting (metrics_configuration.md:38 heading)
- shat -> that (docs/source/conf.py:324 comment)
- stati -> statuses (metadata_harvester.py:111 comment)
- doesnt -> doesn't (metadata_harvester.py:1268 comment)
- "fo rthe" -> "for the" (body.py, openapi.yaml — misplaced space)
- taked -> tagged (test_preprocessor.py:17 docstring — not in
  codespell's dictionary; actual word is "tagged", not "took/taken")
- identifer -> identifiers (fuji_server/data/README.md:9 — refers
  to the identifiers.org service)

Co-Authored-By: Claude Code 2.1.114 / Claude Opus 4.7 <noreply@anthropic.com>
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "uvx codespell -w",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
:type core_metadata_status: str
"""
allowed_values = ["insufficent metadata", "partial metadata", "all metadata"]
allowed_values = ["insufficient metadata", "partial metadata", "all metadata"]
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential functional bugfix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant