Skip to content

fix_invalid_readingorder.py: prune ReadingOrder of invalid references#36

Closed
kba wants to merge 1 commit into
bertsky:masterfrom
kba:fix-reading-order
Closed

fix_invalid_readingorder.py: prune ReadingOrder of invalid references#36
kba wants to merge 1 commit into
bertsky:masterfrom
kba:fix-reading-order

Conversation

@kba
Copy link
Copy Markdown
Contributor

@kba kba commented May 12, 2026

We have some GT data with inconsistent ReadingOrder, with regions referenced not being in the document (A) and existing regions not being in the ReadingOrder (B).

Uses OCR-D/core#1360 for detecting these.

  • For B, there is page-ensure-readingorder.xsl but that does not support partial ROs, so that aspect is still unsolved (and tricky to solve anyway).
  • For A, this PR adds a script that runs PAGE validation, iterates over the ReadingOrderInvalidError detected and removes the offending RegionRef(Indexed) from the document with etree.

There is obviously a more planned way to do this, but I needed a solution quickly and this seemed the best place to keep that. I would really love to have these scripts and snippets including tooling in core itself, so you can do ocrd validate page first and then ocrd fix ... to fix issues that can be fixed automatically.

@bertsky
Copy link
Copy Markdown
Owner

bertsky commented May 12, 2026

Shouldn't the combination of page-remove-dead-regionrefs.xsl, page-remove-empty-readingorder.xsl and page-ensure-readingorder.xsl cover everything? Or would you require an additional page-remove-partial-readingorder.xsl? (I prefer keeping those scripts XSL-only where I can, because these XSLs are more versatile, i.e. can be re-used.)

Regarding partial ReadingOrder, are you sure we want to disallow that (by way of PAGE validation in core)? After all, there may be a lot of practical use-cases for that.

@kba
Copy link
Copy Markdown
Contributor Author

kba commented May 13, 2026

page-remove-dead-regionrefs.xsl

No idea how I missed that one and I was looking for it because I knew that we talked about this in the past 😊 Good that I did not spend too much time on it.

(I prefer keeping those scripts XSL-only where I can, because these XSLs are more versatile, i.e. can be re-used.)

Definitely. I only chose a python script because I already spent time on the validator and could reuse the results of the report.

Regarding partial ReadingOrder, are you sure we want to disallow that

No, not disallow it but notify users about it. Mustn't even be a warning and could be smarter (e.g. only checking for TextRegions or exclude stuff like SeparatorRegion). But we had the need to detect them and the validator in core seemed like the best place for it.

@kba kba closed this May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants