fix_invalid_readingorder.py: prune ReadingOrder of invalid references#36
fix_invalid_readingorder.py: prune ReadingOrder of invalid references#36kba wants to merge 1 commit into
Conversation
|
Shouldn't the combination of Regarding partial ReadingOrder, are you sure we want to disallow that (by way of PAGE validation in core)? After all, there may be a lot of practical use-cases for that. |
No idea how I missed that one and I was looking for it because I knew that we talked about this in the past 😊 Good that I did not spend too much time on it.
Definitely. I only chose a python script because I already spent time on the validator and could reuse the results of the report.
No, not disallow it but notify users about it. Mustn't even be a warning and could be smarter (e.g. only checking for TextRegions or exclude stuff like SeparatorRegion). But we had the need to detect them and the validator in core seemed like the best place for it. |
We have some GT data with inconsistent ReadingOrder, with regions referenced not being in the document (A) and existing regions not being in the ReadingOrder (B).
Uses OCR-D/core#1360 for detecting these.
page-ensure-readingorder.xslbut that does not support partial ROs, so that aspect is still unsolved (and tricky to solve anyway).ReadingOrderInvalidErrordetected and removes the offendingRegionRef(Indexed)from the document with etree.There is obviously a more planned way to do this, but I needed a solution quickly and this seemed the best place to keep that. I would really love to have these scripts and snippets including tooling in core itself, so you can do
ocrd validate pagefirst and thenocrd fix ...to fix issues that can be fixed automatically.