Skip to content

[WIP] output metadata in EPUB#1948

Open
xworld21 wants to merge 2 commits into
brucemiller:masterfrom
xworld21:epub-metadata
Open

[WIP] output metadata in EPUB#1948
xworld21 wants to merge 2 commits into
brucemiller:masterfrom
xworld21:epub-metadata

Conversation

@xworld21
Copy link
Copy Markdown
Contributor

I wanted to understand CrossRef, Scan, etc, so I tried to fix #1947. It should be a reasonable start, except for that ugly CrossRef::getTextContent.

@xworld21
Copy link
Copy Markdown
Contributor Author

xworld21 commented Aug 25, 2022

References for improvements:

@xworld21
Copy link
Copy Markdown
Contributor Author

Note: many readers (e.g. Apple Books) are not able to deal with multiple dc:creator tags and they end up showing the first one only (possibly the first one with aut role, but I haven't tested this theory). So instead of following the spec, the authors should be compressed in a single tag, for instance separated by &.

As far as I can tell, that's the only critical piece of metadata that needs special handling. The rest (accepted date, doi, pii, keywords, etc) maps quite naturally to the spec.

@xworld21 xworld21 changed the title Epub metadata [WIP] output metadata in EPUB Aug 29, 2022
@xworld21
Copy link
Copy Markdown
Contributor Author

I have mapped all front matter bits to the closest metadata counterparts for EPUB, and dropped the few that don't match anything in the EPUB spec (some dates).

Assuming the mapping is reasonable, the remaining issues are:

  • CrossRef::getTextContent – this should be a generic util
  • the attribute @name (of <ltx:classification>) may need punctuation, but it's hard to anticipate what it should be. This is the same problem that the XSLT stylesheets already have, so I won't try to come up with a solution here.

PS: I think the code is quite verbose to generate the metadata, and XSLT feels like a better fit in this case. Except one would have to translate getTextContent to XSLT.

@brucemiller
Copy link
Copy Markdown
Owner

Actually, that was my first impression: that this was legitimately in the realm of XSLT (which is usually what you like :> ). Most of getTextContent should be easily reproducible in XSLT (should probably be in the Common module, if there isn't already something like it). The main thing lacking, I would think, is unicodemath, which could be pasted onto ltx:Math by CrossRef, perhaps?

@xworld21
Copy link
Copy Markdown
Contributor Author

this was legitimately in the realm of XSLT

Cool! Ok, I can do that. I am just struggling with the best location. Could we e.g. have a special mode in LaTeXML-epub3.xsl (say mode="manifest-content-opf")? And maybe do the processing in XSLT.pm?

The main thing lacking, I would think, is unicodemath, which could be pasted onto ltx:Math by CrossRef, perhaps?

It can be exposed to XSLT via register_function as something like f:unicodemath($node), if not reimplemented in XSLT.

@xworld21
Copy link
Copy Markdown
Contributor Author

If #1951 is good (or can be made good for merge), then

<exsl:document href="OPS/content.opf">
  <package unique-identifier="pub-id" version="3.0">
    <xsl:apply-templates mode="manifest" />
  </package>
</exsl:document>

in LaTeXML-epub3.xsl can create the package document. Then Epub.pm can fill out content.opf with manifest and spine. @brucemiller do you like this approach?

@dginev
Copy link
Copy Markdown
Collaborator

dginev commented Sep 2, 2022

Note: many readers (e.g. Apple Books) are not able to deal with multiple dc:creator tags and they end up showing the first one only (possibly the first one with aut role, but I haven't tested this theory). So instead of following the spec, the authors should be compressed in a single tag, for instance separated by &amp;.

That is terrible news @xworld21 . The one advantage of the Dublin Core metadata was supposed to be its simple canonical approach to separating the fields. What you are instead describing is similar to the chaos of using a single \author macro to envelop all authors of a document (with an open-ended variety of separators such as \and, \And, \AND, ,, \qquad...).

The epub spec is thankfully quite clear that each dc:creator element holds a single author, as you say.

Could you share a link with more information on the situation with the Apple Books reader (or others) that do not implement this correctly? I could at least try to send them a comment encouraging them to follow the spec closer.

@xworld21
Copy link
Copy Markdown
Contributor Author

xworld21 commented Sep 2, 2022

Could you share a link with more information on the situation with the Apple Books reader (or others) that do not implement this correctly? I could at least try to send them a comment encouraging them to follow the spec closer.

It seems like the issue goes back to at least 2013:

Just for fun, I just uploaded a mock EPUB on the Apple Store with two primary authors (using Pages), and only the resulting EPUB has only one <dc:creator> element with the first author, with role aut.

So indeed we should keep separate authors, and let the readers deal with them... unfortunately the spec only says that reading systems "should" display all authors so results won't be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

author missing from EPUB

3 participants