A flexible tool for transcribing images of handwritten manuscript into searchable PDFs using a combination of Kraken (for layout analysis), LLMs (for handwriting recognition), and ReportLab (for PDF generation).
This tool is designed to be accessible for Digital Humanities researchers while remaining hackable for developers.
- Hybrid Pipeline: Uses Kraken's segmentation features to find lines of text, then send a numbered annotation map to a Generative AI model for high-accuracy transcription. Note that Kraken only works well with full-page images, not smaller fragments of paper. Support for other segmentation providers may be added in the future.
- Batch Processing: Handle single images or glob patterns (e.g.,
filename??.jpgor*.jpg). - Combined Output: Optionally combine multiple input images into a single PDF output file.
- Searchable PDF Output: Generates PDFs in which the image is visible, but the text is and searchable and selectable (invisible text layer).
- Python 3.10 or 3.11 (Recommended for Kraken compatibility).
- Note: Newer versions of Python may have compatibility issues with some dependencies.
- API Key: You will need an API key for any model provider you intend to use (e.g., Google Gemini, OpenAI, etc).
- Clone or Download this repository.
- Create a Virtual Environment (Recommended):
# macOS/Linux python3.10 -m venv venv source venv/bin/activate
- Install Dependencies:
pip install -r requirements.txt
- Configure Environment:
Create a
.envfile in the project root using the steps below (or rename the provided example.env to.env). In either case, you will also need to add your API key(s) to the file.# Create .env file echo "GEMINI_API_KEY=your_api_key_here" > .env echo "OPENAI_API_KEY=your_api_key_here" >> .env
This allows you to run subscript from any directory.
# Clone the repository
git clone https://github.com/eluhrs/subscript.git
cd subscript
# Install
pip install .
# Or install directly from GitHub
pip install git+https://github.com/eluhrs/subscript.gitYou can run the script directly from the repository root without installing.
./subscript.py ...If installed, use the subscript command. If running locally, use ./subscript.py.
Note: The tool operates relative to your current working directory. Input files, output directories, and .env files should be in the folder where you run the command.
# Basic usage
subscript [SEGMENTATION-MODEL] [TRANSCRIPTION-MODEL] [INPUT-FILE-OR-GLOB]- SEGMENTATION-NICKNAME: (Optional) The nickname of the segmentation model to use (defined in
config.yml), e.g.,historical-manuscript. If omitted, the segmentation model defined as default is used. - MODEL-NICKNAME: (Optional) The nickname of the transcription model to use (defined in
config.yml), e.g.,gemini-flash. If omitted, the transcription model defined as default is used. - INPUT: Path to an image or a wildcard pattern for multiple images
1. Transcribe a single image (using defaults):
./subscript.py input/sample.jpg2. Transcribe using a specific transcription model:
./subscript.py gemini-flash input/sample.jpg3. Transcribe using a specific segmentation model:
./subscript.py historical-manuscript input/sample.jpg4. Transcribe using specific models for both:
./subscript.py historical-manuscript gemini-flash input/sample.jpg5. Combine multiple images into one book:
./subscript.py "input/*.jpg" --combine my_filename.pdfOutput: output/my_filename.pdf and output/my_filename.txt (all pages)
-
--help: Show this help message and exit. -
--config [path]: Path to an alternate config file (default:./config.yml). -
Transcription Overrides:
--prompt "Your custom prompt": Override the system prompt for the transcription model.--temp 0.5: Override the temperature (creativity) of the model.
-
Preprocessing Overrides:
--resize [large|medium|small|false]: Resize the input image before processing to save tokens/cost.--contrast [float]: Adjust contrast (1.0 = original, <1.0 lower, >1.0 higher).--binarize: Apply Otsu binarization to the image.--invert: Invert image colors (useful for negative scans).
-
PDF Generation:
--nopdf: Skip PDF generation (only TXT/XML).--combine output.pdf: Combine multiple input images into a single PDF.--onlypdf: Skip segmentation/transcription and rebuild PDF from existing XML.
The config.yml file defines the available models, segmentation providers, and their default settings.
# --- Segmentation Analysis ---
segmentation:
default_segmentation: "historical-manuscript"
# Define available segmentation models (additional models to be added in the future)
models:
historical-manuscript:
provider: "kraken"
model: "default"
# --- Transcription ---
transcription:
default_model: "gemini-pro-3"
# Define available models here
models:
gemini-pro-3:
provider: "gemini"
model: "gemini-3-pro-preview"
prompt: "You are a literal transcription engine..."
cost_config:
input_token_cost: 2.0
output_token_cost: 12.0
API_passthrough: # provide model-specific settings below
temperature: 0.0
max_output_tokens: 8192GNU General Public License v3.0
Copyright (c) 2025
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Built using:
- Kraken for segmentation.
- ReportLab for PDF generation.
- Google Gemini for transcription.
Inspiration: