Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions Classes/Common/FullTextReader.php
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class FullTextReader

/**
* Constructor
*
*
* @param array $formats
*/
public function __construct(array $formats)
Expand All @@ -44,7 +44,7 @@ public function __construct(array $formats)

/**
* This extracts the OCR full text for a physical structure node / IIIF Manifest / Canvas from an
* XML full text representation (currently only ALTO). For IIIF manifests, ALTO documents have
* XML full text representation. For IIIF manifests, ALTO documents have
* to be given in the Canvas' / Manifest's "seeAlso" property.
*
* @param string $id The "@ID" attribute of the physical structure node (METS) or the "@id" property
Expand Down Expand Up @@ -83,7 +83,7 @@ public function getFromXml(string $id, array $fileLocations, $physicalStructureN
if (!empty($fileContent) && !empty($this->formats[$textFormat])) {
$textMiniOcr = '';
if (!empty($this->formats[$textFormat]['class'])) {
$textMiniOcr = $this->getRawTextFromClass($fileContent, $textFormat);
$textMiniOcr = $this->getRawTextFromClass($id, $fileContent, $textFormat);
}
$fullText = $textMiniOcr;
} else {
Expand All @@ -98,12 +98,14 @@ public function getFromXml(string $id, array $fileLocations, $physicalStructureN
*
* @access private
*
* @param string $id The "@ID" attribute of the physical structure node (METS) or the "@id" property
* of the Manifest / Range (IIIF)
* @param string $fileContent The content of the XML file
* @param string $textFormat
*
* @return string
*/
private function getRawTextFromClass(string $fileContent, string $textFormat): string
private function getRawTextFromClass(string $id, string $fileContent, string $textFormat): string
{
$textMiniOcr = '';
$class = $this->formats[$textFormat]['class'];
Expand All @@ -113,6 +115,7 @@ private function getRawTextFromClass(string $fileContent, string $textFormat): s
if ($obj instanceof FulltextInterface) {
// Load XML from file.
$ocrTextXml = Helper::getXmlFileAsString($fileContent);
$obj->setPageId($id);
$textMiniOcr = $obj->getTextAsMiniOcr($ocrTextXml);
} else {
$this->logger->warning('Invalid class/method "' . $class . '->getRawText()" for text format "' . $textFormat . '"');
Expand Down
13 changes: 11 additions & 2 deletions Classes/Common/FulltextInterface.php
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,17 @@
*
* @abstract
*/
interface FulltextInterface
{
interface FulltextInterface{

Check notice on line 25 in Classes/Common/FulltextInterface.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Common/FulltextInterface.php#L25

Opening brace of a interface must be on the line after the definition

Check notice on line 25 in Classes/Common/FulltextInterface.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Common/FulltextInterface.php#L25

Opening brace of a interface must be on the line after the definition
Comment thread
sebastian-meyer marked this conversation as resolved.
Outdated

/**
* Set the page identifier.
*
* @access public
*
* @param string $pageId The page identifier of mets:div in the physical struct map of the METS.
*/
public function setPageId(string $pageId): void;

/**
* This extracts raw fulltext data from XML
*
Expand Down
9 changes: 8 additions & 1 deletion Classes/Format/Alto.php
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

namespace Kitodo\Dlf\Format;

use Kitodo\Dlf\Common\FulltextInterface;

/**
* Fulltext ALTO format class for the 'dlf' extension
*
Expand All @@ -22,7 +24,7 @@
*
* @access public
*/
class Alto implements \Kitodo\Dlf\Common\FulltextInterface
class Alto implements FulltextInterface
{
/**
* This extracts the fulltext data from ALTO XML
Expand Down Expand Up @@ -159,4 +161,9 @@ private function registerAltoNamespace(\SimpleXMLElement &$xml)
$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v4#');
}
}

public function setPageId(string $pageId): void
{
// Nothing to do here.
}
}
132 changes: 132 additions & 0 deletions Classes/Format/Tei.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
<?php

/**
* (c) Kitodo. Key to digital objects e.V. <contact@kitodo.org>
*
* This file is part of the Kitodo and TYPO3 projects.
*
* @license GNU General Public License version 3 or later.
* For the full copyright and license information, please read the
* LICENSE.txt file that was distributed with this source code.
*/

namespace Kitodo\Dlf\Format;

use Kitodo\Dlf\Common\FulltextInterface;
use Psr\Log\LoggerAwareInterface;
use Psr\Log\LoggerAwareTrait;
Comment thread
sebastian-meyer marked this conversation as resolved.

/**
* Fulltext ALTO format class for the 'dlf' extension
*
* ** This currently supports ALTO 2.x / 3.x / 4.x **
*
* @package TYPO3
* @subpackage dlf
*
* @access public
*/
class Tei implements FulltextInterface, LoggerAwareInterface
{
use LoggerAwareTrait;

private string $pageId;

public function setPageId(string $pageId): void
{
$this->pageId = $pageId;
}

/**
* This extracts the fulltext data from TEI XML
*
* @access public
*
* @param \SimpleXMLElement $xml The XML to extract the raw text from
*
* @return string The raw unformatted fulltext
*/
public function getRawText(\SimpleXMLElement $xml): string
{
if(empty($this->pageId)) {

Check notice on line 51 in Classes/Format/Tei.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Tei.php#L51

Expected "if (...) {\n"; found "if(...) {\n"

Check notice on line 51 in Classes/Format/Tei.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Tei.php#L51

Expected 1 space after IF keyword; 0 found
Comment thread
sebastian-meyer marked this conversation as resolved.
Outdated
$this->logger->warning('Text could not be retrieved from TEI because the page ID is empty.');
return '';
}

// register ALTO namespace depending on document
$this->registerTeiNamespace($xml);

// Get all (presumed) words of the text.
$contentXml = $xml->xpath('./TEI:text')[0]->asXML();

// Remove tags but keep their content
$contentXml = preg_replace('/<\/?(?:body|front|div|head|titlePage)[^>]*>/u', '', $contentXml);

// Replace linebreaks
$contentXml = preg_replace('/<lb(?:\s[^>]*)?\/>/u', '', $contentXml);
$contentXml = preg_replace('/\s+/', ' ', $contentXml);

// Extract content between each <pb /> and the next <pb /> or end of string
$pattern = '/<pb[^>]*facs="([^"]+)"[^>]*\/>([\s\S]*?)(?=<pb[^>]*\/>|$)/u';
$facs = [];

// Use preg_match_all to get all matches at once
if (preg_match_all($pattern, $contentXml, $matches, PREG_SET_ORDER)) {
foreach ($matches as $match) {
$facsMatch = trim($match[1]);
$facsId = str_starts_with($facsMatch, "#") ? substr($facsMatch, 1) : $facsMatch;
$facs[$facsId] = trim(strip_tags($match[2])); // Everything until next <pb /> or end of string
}
}

if(!array_key_exists($this->pageId, $facs)) {

Check notice on line 82 in Classes/Format/Tei.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Tei.php#L82

Expected "if (...) {\n"; found "if(...) {\n"

Check notice on line 82 in Classes/Format/Tei.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Tei.php#L82

Expected 1 space after IF keyword; 0 found
Comment thread
sebastian-meyer marked this conversation as resolved.
Outdated
$this->logger->debug('The page break attribute "facs" with the page identifier postfix "' . $this->pageId . '" could not be found in the TEI document');
return '';
}

return $facs[$this->pageId];
}

/**
* This extracts the fulltext data from TEI XML and returns it in MiniOCR format
*
* @access public
*
* @param \SimpleXMLElement $xml The XML to extract the raw text from
*
* @return string The unformatted fulltext in MiniOCR format
*/
public function getTextAsMiniOcr(\SimpleXMLElement $xml): string
{
$rawText = $this->getRawText($xml);

if (empty($rawText)) {
return '';
}

$miniOcr = new \SimpleXMLElement("<ocr></ocr>");

Check notice on line 107 in Classes/Format/Tei.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Tei.php#L107

Missing class import via use statement (line '106', column '24').
Comment thread
sebastian-meyer marked this conversation as resolved.
Outdated
$miniOcr->addChild('b', $rawText);
$miniOcrXml = $miniOcr->asXml();
if (\is_string($miniOcrXml)) {
return $miniOcrXml;
}
return '';
}

/**
* This registers the necessary TEI namespace for the current TEI-XML
*
* @access private
*
* @param \SimpleXMLElement &$xml: The XML to register the namespace for
*/
private function registerTeiNamespace(\SimpleXMLElement $xml)
{
$namespace = $xml->getDocNamespaces();

if (in_array('http://www.tei-c.org/ns/1.0', $namespace, true)) {
$xml->registerXPathNamespace('TEI', 'http://www.tei-c.org/ns/1.0');
}
}

}

Check notice on line 132 in Classes/Format/Tei.php

View check run for this annotation

Codacy Production / Codacy Static Code Analysis

Classes/Format/Tei.php#L132

The closing brace for the class must go on the next line after the body
Comment thread
sebastian-meyer marked this conversation as resolved.
Outdated
74 changes: 73 additions & 1 deletion Documentation/User/Index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ User Manual
:local:
:depth: 2


.. _indexing_documents:

Indexing Documents
Expand Down Expand Up @@ -545,3 +544,76 @@ With the command `kitodo:optimize` it is possible to hard commit documents to an
Show each processed documents uid and location with timestamp and
amount of processed/all documents.
:Example:


.. _indexing_fulltexts:

Indexing full texts
==================

Full texts must be provided in the ``FULLTEXT`` file group within the METS. Kitodo.Presentation supports the ALTO and TEI format for indexing full texts.

**ALTO**

Each ALTO file contains the full text of a single page of the document.

.. code-block:: xml
<mets:fileGrp USE="FULLTEXT">
<mets:file ID="..." MIMETYPE="text/xml">
<mets:FLocat LOCTYPE="URL" xlink:href="https://www.example.com/example-alto-page-1.xml"/>
</mets:file>
<mets:file ID="..." MIMETYPE="text/xml">
<mets:FLocat LOCTYPE="URL" xlink:href="https://www.example.com/example-alto-page-2.xml"/>
</mets:file>
...
</mets:fileGrp>

**TEI**

TEI contains all full texts of the entire document.

.. code-block:: xml
<mets:fileGrp USE="FULLTEXT">
<mets:file ID="..." MIMETYPE="application/tei+xml">
<mets:FLocat LOCTYPE="URL" xlink:href="https://www.example.com/example-tei.xml"/>
</mets:file>
</mets:fileGrp>

.. note::

The identifier of the ``facsimile`` tag (and thus the ``pb`` tag (page break) references) in the TEI must match the ``ID`` attribute of the ``mets:div`` with type ``page`` in the physical structMap of the METS. Otherwise, the pages cannot be mapped and will not be indexed.


For indexing full texts, the formats need to be defined in the Data Formats or in the table ``tx_dlf_formats`` with following settings.

.. t3-field-list-table::
:header-rows: 1

- :Type:
Format Name (e.g. in METS)
:Root:
Root Element
:Namespace:
Namespace URI
:Class:
Class Name

- :Type:
ALTO
:Root:
alto
:Namespace:
http://www.loc.gov/standards/alto/ns-v2#
:Class:
``Kitodo\Dlf\Format\Alto``

- :Type:
TEI
:Root:
TEI
:Namespace:
http://www.tei-c.org/ns/1.0
:Class:
``Kitodo\Dlf\Format\Tei``

After configuration, all full texts will be indexed when executing the commands of :ref:`indexing_documents`.
32 changes: 32 additions & 0 deletions Tests/Fixtures/Format/tei.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
</teiHeader>
<facsimile>
<graphic mimeType="image/jpeg" url="https://www.example.com/00000001.tif.original.jpg" id="f0001"/>
<graphic mimeType="image/jpeg" url="https://www.example.com/00000002.tif.original.jpg" id="f0002"/>
</facsimile>
<text>
<front>
<titlePage id="uuid-82add175-7012-4a6d-bc13-a1a666acb769">
<pb facs="#f0001" n=" - " corresp="https://www.example.com/0001"/>
<p>
<lb/>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

</p>
</titlePage>
</front>
<body>
<div id="uuid-cf72f6ba-61a0-41b3-ba9b-a6331b7a504b" n="1" rend="Content">
<pb facs="#f0002" n=" - " corresp="https://www.example.com/0002"/>
</div>
<div id="uuid-45e92103-ecd2-46ab-aabd-ddc589a548d2" n="1" rend="Aenean commodo ligula eget dolor">
<head>
<lb/>
Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim.
</head>
</div>
</body>
<back/>
</text>
</TEI>
Loading