Skip to content

[FEATURE] Umlauts (e.g., ä → ae), ß (→ ss), and the lang-s (ſ → s)#1971

Open
BFallert wants to merge 2 commits into
kitodo:mainfrom
UB-Mannheim:feature/UmlautLongS
Open

[FEATURE] Umlauts (e.g., ä → ae), ß (→ ss), and the lang-s (ſ → s)#1971
BFallert wants to merge 2 commits into
kitodo:mainfrom
UB-Mannheim:feature/UmlautLongS

Conversation

@BFallert
Copy link
Copy Markdown
Collaborator

Umlauts, the ß, and the long s (ſ) are converted consistently

ä => ae
...
ß => ss
ſ => s

@BFallert
Copy link
Copy Markdown
Collaborator Author

After a reindex, a search for "s" in the full text will also find "Lang-S"

@michaelkubina
Copy link
Copy Markdown
Collaborator

Dear Bernd,
we had experimented with the MappingCharacterFilter as well; and also with ICU-Transformation, ASCII-FoldingFilter, and maybe some others that i dont remember anymore. But ultimately we have turned it down and went with normalizing the OCR itself before ingesting/indexing. I did a lightning talk on this on the Kitodo Praxistreffen 2024 in Marburg (https://www.kitodo.org/fileadmin/groups/kitodo/Dokumente/Praxistreffen_2024_Betrachtung_von_Grenzfaellen_in_der_OCR_und_Suche.pdf )

The main reason was, that it broke word highlighting in the page-view (ocr overlay) and that our PDFs used the same original ALTO for the fulltext enhancement, so that searching within the PDFs failed as where it worked in Kitodo.Presentation - the latter could have been solved by an additional normalization step though.

The text-snippets (and the word highlighting within the snippets) in the result list were okay and search was absolutly improved by introducing the MappingCharacterFilter while preserving historical writing! But I believe the solr ocr-highlighting plugin does not normalize and the fulltext tokens are stored as they were in the ALTO. That caused the javascript for the openlayers to not being able to match an token that should be highlighted to the string in the word_highlighting parameter (there seems to be an issue with case-sensivity as well).

There is also another aspect worth noting...your approach is likely because of the output generated by german_print.mlmodel . There are also historic Umlauts, those with the superscript "e" -> Aͤ/aͤ, Oͤ/oͤ, Uͤ /uͤ that are being written out by german print...sadly they cause the same trouble of not being highlighted: search is fixed with the MappingCharacterFilter, but highlighting in the pageview isnt.

So, in order for this to be robust, the highlighting mechanism in the pageview would likely need some sort of normalization as well. I believe @stweil also mentioned, that german_print.mlmodel did create some other superscripted/subscripted characters, that i did not mention here but that might cause trouble...

ähm, no criticism here, but i just wanted to point out that there are some other implications to this PR.

@sebastian-meyer sebastian-meyer changed the title Umlauts (e.g., ä → ae), ß (→ ss), and the lang-s (ſ → s) [FEATURE] Umlauts (e.g., ä → ae), ß (→ ss), and the lang-s (ſ → s) May 12, 2026
@sebastian-meyer sebastian-meyer added the ↷ feature A new feature or enhancement. label May 12, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (5b21abf) to head (5485ac0).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff      @@
##   main   #1971   +/-   ##
============================
============================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sebastian-meyer
Copy link
Copy Markdown
Member

Thank you for this contribution!

As @michaelkubina already said: Before merging this has to be extended to also include highlighting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

↷ feature A new feature or enhancement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants