Skip to content

Fix language codes not recognized correctly#453

Open
Aunirbhan wants to merge 1 commit intoopenzim:mainfrom
Aunirbhan:fix/337-language-resolution
Open

Fix language codes not recognized correctly#453
Aunirbhan wants to merge 1 commit intoopenzim:mainfrom
Aunirbhan:fix/337-language-resolution

Conversation

@Aunirbhan
Copy link
Copy Markdown

Problem

get_zim_language_metadata() misses ISO 639-3 codes like gla, nap, and oji because it only looks up 2-letter keys in ISO_MATRIX. They resolve to None, producing an empty language list, and the ZIM crashes.

Solution

  • Track unresolved codes and mention them through a warning.
  • Raises a clear error pointing to --zim-languages when nothing resolves.
  • Added ISO_MATRIX_REV fallback so codes already known as values (like gd→gla, oj→oji) resolve correctly.
  • @benoit74 I also added nap to ZIM_LANGUAGES_MAP since you mentioned it on the issue thread.

Before / After

Beforeget_zim_language_metadata() on upstream:

get_zim_language_metadata(["gla"], books)  →  []  # silent, no warning
get_zim_language_metadata(["nap"], books)  →  []
get_zim_language_metadata(["oji"], books)  →  []
get_zim_language_metadata(["myn"], books)  →  []

After:

get_zim_language_metadata(["gla"], books)  →  ["gla"]
get_zim_language_metadata(["nap"], books)  →  ["nap"]
get_zim_language_metadata(["oji"], books)  →  ["oji"]

get_zim_language_metadata(["myn"], books)  →  []
# WARNING: "Could not resolve ZIM language metadata for: myn"

get_zim_language_metadata(["myn", "nai"], books)  →  []
# WARNING: "Could not resolve ZIM language metadata for: myn, nai"
# ValueError: "Cannot resolve language metadata for: myn, nai.
#              Use --zim-languages to override."

Passes hatch run test:run

Closes #337

Copy link
Copy Markdown
Collaborator

@benoit74 benoit74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I feel like we are on the right path.

What needs to be enhanced:

  • I prefer former logic where we build the language_counts in a single line
  • we can probably write the same single line to detect any unresolved code (something like unresolved = [lang for lang in languages if not ZIM_LANGUAGES_MAP.get(lang, [ISO_MATRIX.get(lang, None)])] (to be updated with ISO_MATRIX_REV as well)
  • we should fail the scraper immediately when one language is unresolved, not issue a warning, this needs to be fixed rather than silently creating ZIMs with incorrect metadata
  • we can probably extract the "complex" logic to get zim_langs from a single language in a dedicated function
  • we need to add tests to this
  • we need a CHANGELOG entry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some language are not recognized correctly?

2 participants