[FEATURE] Umlauts (e.g., ä → ae), ß (→ ss), and the lang-s (ſ → s)#1971
[FEATURE] Umlauts (e.g., ä → ae), ß (→ ss), and the lang-s (ſ → s)#1971BFallert wants to merge 2 commits into
Conversation
|
After a reindex, a search for "s" in the full text will also find "Lang-S" |
|
Dear Bernd, The main reason was, that it broke word highlighting in the page-view (ocr overlay) and that our PDFs used the same original ALTO for the fulltext enhancement, so that searching within the PDFs failed as where it worked in Kitodo.Presentation - the latter could have been solved by an additional normalization step though. The text-snippets (and the word highlighting within the snippets) in the result list were okay and search was absolutly improved by introducing the MappingCharacterFilter while preserving historical writing! But I believe the solr ocr-highlighting plugin does not normalize and the fulltext tokens are stored as they were in the ALTO. That caused the javascript for the openlayers to not being able to match an token that should be highlighted to the string in the word_highlighting parameter (there seems to be an issue with case-sensivity as well). There is also another aspect worth noting...your approach is likely because of the output generated by german_print.mlmodel . There are also historic Umlauts, those with the superscript "e" -> Aͤ/aͤ, Oͤ/oͤ, Uͤ /uͤ that are being written out by german print...sadly they cause the same trouble of not being highlighted: search is fixed with the MappingCharacterFilter, but highlighting in the pageview isnt. So, in order for this to be robust, the highlighting mechanism in the pageview would likely need some sort of normalization as well. I believe @stweil also mentioned, that german_print.mlmodel did create some other superscripted/subscripted characters, that i did not mention here but that might cause trouble... ähm, no criticism here, but i just wanted to point out that there are some other implications to this PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1971 +/- ##
============================
============================
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for this contribution! As @michaelkubina already said: Before merging this has to be extended to also include highlighting. |
Umlauts, the ß, and the long s (ſ) are converted consistently
ä => ae
...
ß => ss
ſ => s