diff --git a/modules/restapi/src/main/resources/docspell-openapi.yml b/modules/restapi/src/main/resources/docspell-openapi.yml index dda2bce4f0..b149fae707 100644 --- a/modules/restapi/src/main/resources/docspell-openapi.yml +++ b/modules/restapi/src/main/resources/docspell-openapi.yml @@ -7266,8 +7266,10 @@ components: format: glob language: description: | - The language used for text extraction and analysis when - processing mails. + The language for text extraction and analysis when + processing mails. Use **ISO 639-3** (3-letter codes) such + as `eng`, `deu`, `fra`; ISO 639-1 (2-letter) like `en`, + `de` and language names like `english` are also accepted. type: string format: language postHandleAll: @@ -8242,8 +8244,18 @@ components: type: string format: language description: | - The `language` of the document may be specified, otherwise - the one from settings is used. + The language of the document for processing (OCR, text + extraction, analysis). If not specified, the collective's + default language is used. + + Use **ISO 639-3** (3-letter codes) such as `eng`, `deu`, + `fra`, `spa`. ISO 639-1 (2-letter) codes like `en`, `de` + and language names like `english`, `german` are also + accepted. Supported languages include German, English, + French, Italian, Spanish, Portuguese, Czech, Danish, + Finnish, Norwegian, Swedish, Russian, Romanian, Dutch, + Latvian, Japanese, Hebrew, Hungarian, Lithuanian, Polish, + Estonian, Ukrainian, Khmer, Slovak. attachmentsOnly: type: boolean default: false @@ -8292,6 +8304,11 @@ components: language: type: string format: language + description: | + Default document language for the collective. Use **ISO + 639-3** (3-letter codes) like `eng`, `deu`, `fra`; ISO + 639-1 (2-letter) like `en`, `de` and language names are + also accepted. integrationEnabled: type: boolean description: | @@ -8401,6 +8418,10 @@ components: language: type: string format: language + description: | + Document language for uploads via this source. Use **ISO + 639-3** (3-letter codes) like `eng`, `deu`, `fra`; ISO + 639-1 (2-letter) and language names are also accepted. created: description: DateTime type: integer diff --git a/website/site/content/docs/api/upload.md b/website/site/content/docs/api/upload.md index b02bcca74a..dbebe318e5 100644 --- a/website/site/content/docs/api/upload.md +++ b/website/site/content/docs/api/upload.md @@ -32,7 +32,7 @@ For example, here is a curl command uploading two files with meta data. Since `multiple` is `false`, both files are added to one item: ``` bash -curl -XPOST -F meta='{"multiple":false, "direction": "outgoing", "tags": {"items":["Order"]}}' \ +curl -XPOST -F meta='{"multiple":false, "direction": "outgoing", "language": "eng", "tags": {"items":["Order"]}}' \ -F file=@letter-en.pdf \ -F file=@letter-de.pdf \ http://192.168.1.95:7880/api/v1/open/upload/item/3H7hvJcDJuk-NrAW4zxsdfj-K6TMPyb6BGP-xKptVxUdqWa @@ -89,9 +89,11 @@ specified via a JSON structure in a part with name `meta`: files or `*.html|*.pdf` for selecting html and pdf files. This only applies to archive files, like zip or e-mails (where this is applied to the attachments). -- The `language` is used for processing the document(s) contained in - the request. If not specified the collective's default language is - used. +- The `language` specifies the document language for processing (OCR, + text extraction, analysis). If not specified, the collective's + default language is used. Use **ISO 639-3** (3-letter codes) such as + `eng`, `deu`, `fra`, `spa`. ISO 639-1 (2-letter) codes like `en`, + `de` and language names like `english`, `german` are also accepted. - The `attachmentsOnly` property only applies to e-mail files (usually `*.eml`). If this is `true`, then the e-mail body is discarded and only the attachments are imported. An e-mail without any attachments diff --git a/website/site/content/docs/webapp/metadata.md b/website/site/content/docs/webapp/metadata.md index 2fd2929ebf..a36a5c8230 100644 --- a/website/site/content/docs/webapp/metadata.md +++ b/website/site/content/docs/webapp/metadata.md @@ -131,11 +131,12 @@ page](@/docs/webapp/customfields.md) for more information. An important setting is the language of your documents. This helps OCR and text analysis. You can select between various languages. The -language can also specified with each [upload -request](@/docs/api/upload.md). +language can also be specified with each [upload +request](@/docs/api/upload.md) using ISO 639-3 codes (e.g. `eng`, +`deu`) or ISO 639-1 (e.g. `en`, `de`). Go to the *Collective Settings* page and click *Document -Language*. This will set the lanugage for all your documents. +Language*. This will set the language for all your documents. The language has effects in several areas: text extraction, fulltext search and text analysis. When extracting text from images, tesseract