Optimize processing of sitemap with Tika#5206
Conversation
stweil
commented
Apr 12, 2026
- Don't use temporary file for Tika.
- Replace two calls of Tika app by a single call.
- Don't use temporary file for Tika. - Replace two calls of Tika app by a single call. Signed-off-by: Stefan Weil <sw@weilnetz.de>
|
This is a request for comments. Reducing the number of calls for the Tika app reduces the time for crawling in my test case from 7:25 minutes to 4:59 minutes. |
| * | ||
| * @return array | ||
| */ | ||
| protected static function getTikaFields($htmlFile) |
There was a problem hiding this comment.
getTikaFields now is only used in the tests and could be removed otherwise, but that might break local code.
| break; | ||
| case 'Tika': | ||
| $fields = static::getTikaFields($htmlFile); | ||
| $fields = static::getTikaData($url); |
There was a problem hiding this comment.
Here getTikaFields is no longer called. Do we need it? I think its functionality could be better implemented in getTikaData which has all relevant information.
|
Although this proof of concept already reduces the processing time a lot, a much larger gain can be achieved by using the Tika server instead of the Tika app. With the Tika server, the processing time is reduced to 0:23 minutes for my test case even without the PR here. And with the PR, the number of Tika server calls would also be reduced by 50 %. |
|
Some thoughts and open questions regarding web crawling with VuFind and Tika:
|
demiankatz
left a comment
There was a problem hiding this comment.
@stweil, I have several local custom subclasses of VuFindSitemap, all of which rely on custom extensions of getHtmlFields to parse specific patterns out of the HTML independent of Tika (mostly parsing custom local meta tags). This refactoring completely eliminates the call to getHtmlFields and will break all of my customizations. I'm not sure whether the Tika JSON output offers access to non-standard meta tags; if so, maybe that's a better solution... but if not, then we need to restore the ability to parse the raw HTML. In any case, we either need to figure out a way to restore the lost functionality, or we need to treat this as a breaking change and document a process for migrating custom code. Please let me know if you need me to look into this more deeply on my end -- it's likely going to take me at least a week or two to catch up on other things before I can, but I'm willing once I have the bandwidth!
|
@demiankatz, do you have an example URL with such meta tags? I could test it with my code. Or you run the Tika app manually yourself: |
|
@stweil, to answer some of your other questions:
Thought: maybe what we need to do is add a config setting to use Tika JSON mode instead of Tika XML mode... then we can keep the legacy functionality in XML mode for back-compatibility, but deprecate it to encourage people to switch to the more efficient JSON approach. This would allow us to add the new functionality in release 11.1 and give people a transitional release to improve customizations before things break in 12.0. What do you think? |