Skip to content

Optimize processing of sitemap with Tika#5206

Open
stweil wants to merge 1 commit into
vufind-org:devfrom
stweil:optimize_tika
Open

Optimize processing of sitemap with Tika#5206
stweil wants to merge 1 commit into
vufind-org:devfrom
stweil:optimize_tika

Conversation

@stweil
Copy link
Copy Markdown
Contributor

@stweil stweil commented Apr 12, 2026

  • Don't use temporary file for Tika.
  • Replace two calls of Tika app by a single call.

- Don't use temporary file for Tika.
- Replace two calls of Tika app by a single call.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Copy Markdown
Contributor Author

stweil commented Apr 12, 2026

This is a request for comments. Reducing the number of calls for the Tika app reduces the time for crawling in my test case from 7:25 minutes to 4:59 minutes.

*
* @return array
*/
protected static function getTikaFields($htmlFile)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTikaFields now is only used in the tests and could be removed otherwise, but that might break local code.

break;
case 'Tika':
$fields = static::getTikaFields($htmlFile);
$fields = static::getTikaData($url);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here getTikaFields is no longer called. Do we need it? I think its functionality could be better implemented in getTikaData which has all relevant information.

@stweil
Copy link
Copy Markdown
Contributor Author

stweil commented Apr 12, 2026

Although this proof of concept already reduces the processing time a lot, a much larger gain can be achieved by using the Tika server instead of the Tika app. With the Tika server, the processing time is reduced to 0:23 minutes for my test case even without the PR here. And with the PR, the number of Tika server calls would also be reduced by 50 %.

@stweil
Copy link
Copy Markdown
Contributor Author

stweil commented Apr 12, 2026

Some thoughts and open questions regarding web crawling with VuFind and Tika:

  • Should we combine the two Tika calls for getting text and metadata into a single Tika call (like in this PR here)?
  • Should we support the Tika server? It could be enabled by setting General parser = Tika without setting Tika path. Would it be sufficient to use the default local URL of the Tika server? Or do we want a new setting Tika url?
  • Tika supports the the "boilerpipe" algorithm which eliminates content which is typically not desired for the web search (header, footer, navigation, ...). Should we use it, either always or optionally – maybe by default?
  • Should some code changes be provided for the dev branch? Or only for dev-12.0?
  • Is Tika sufficient? Can Aperture support be deprecated for dev and removed for dev-12.0?

Copy link
Copy Markdown
Member

@demiankatz demiankatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stweil, I have several local custom subclasses of VuFindSitemap, all of which rely on custom extensions of getHtmlFields to parse specific patterns out of the HTML independent of Tika (mostly parsing custom local meta tags). This refactoring completely eliminates the call to getHtmlFields and will break all of my customizations. I'm not sure whether the Tika JSON output offers access to non-standard meta tags; if so, maybe that's a better solution... but if not, then we need to restore the ability to parse the raw HTML. In any case, we either need to figure out a way to restore the lost functionality, or we need to treat this as a breaking change and document a process for migrating custom code. Please let me know if you need me to look into this more deeply on my end -- it's likely going to take me at least a week or two to catch up on other things before I can, but I'm willing once I have the bandwidth!

@stweil
Copy link
Copy Markdown
Contributor Author

stweil commented Apr 12, 2026

@demiankatz, do you have an example URL with such meta tags? I could test it with my code. Or you run the Tika app manually yourself: java -jar YOUR_TIKA_PATH/tika-app.jar --jsonRecursive --text-main -eUTF8 --pretty-print YOUR_URL.

@demiankatz
Copy link
Copy Markdown
Member

@stweil, to answer some of your other questions:

  • I think it makes sense to support Tika server, and if we do, we should allow URL configuration for flexibility. Would it make sense to simply allow the Tika path to be a URL, and if so, to use the server approach? It should be pretty easy to differentiate between a command path and a URL.
  • I'm open to adding a config setting for the boilerpipe algorithm; I wouldn't change default behavior without offering an ability to switch back in case of side effects, but it never hurts to add flexibility if it can be done without too much extra complexity.
  • I'm open to deprecating Aperture support if that makes this easier to manage. There have been no releases in 16 years, so I doubt anyone is realistically still using it today. Maybe that would make sense as a separate PR against dev.
  • Regarding an example of custom meta tags, see this course guide which contains subject, cg_course, and several other custom meta tags that we use in indexing. I tried the command-line tool as you suggested and it looks very promising.

Thought: maybe what we need to do is add a config setting to use Tika JSON mode instead of Tika XML mode... then we can keep the legacy functionality in XML mode for back-compatibility, but deprecate it to encourage people to switch to the more efficient JSON approach. This would allow us to add the new functionality in release 11.1 and give people a transitional release to improve customizations before things break in 12.0. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants