Subscript HTR pipeline: image segmentation, transcription, and searchable PDF conversion

A flexible tool for transcribing images of handwritten manuscript into searchable PDFs using a combination of Kraken (for layout analysis), LLMs (for handwriting recognition), and ReportLab (for PDF generation).

This tool is designed to be accessible for Digital Humanities researchers while remaining hackable for developers.

Features

Hybrid Pipeline: Uses Kraken's segmentation features to find lines of text, then send a numbered annotation map to a Generative AI model for high-accuracy transcription. Note that Kraken only works well with full-page images, not smaller fragments of paper. Support for other segmentation providers may be added in the future.
Batch Processing: Handle single images or glob patterns (e.g., filename??.jpg or *.jpg).
Combined Output: Optionally combine multiple input images into a single PDF output file.
Searchable PDF Output: Generates PDFs in which the image is visible, but the text is and searchable and selectable (invisible text layer).

Installation

Prerequisites

Python 3.10 or 3.11 (Recommended for Kraken compatibility).
- Note: Newer versions of Python may have compatibility issues with some dependencies.
API Key: You will need an API key for any model provider you intend to use (e.g., Google Gemini, OpenAI, etc).

Setup

Clone or Download this repository.

Create a Virtual Environment (Recommended):

# macOS/Linux
python3.10 -m venv venv
source venv/bin/activate

Install Dependencies:
```
pip install -r requirements.txt
```
Configure Environment: Create a .env file in the project root using the steps below (or rename the provided example.env to .env). In either case, you will also need to add your API key(s) to the file.
```
# Create .env file
echo "GEMINI_API_KEY=your_api_key_here" > .env
echo "OPENAI_API_KEY=your_api_key_here" >> .env
```

Option 1: Install as a Package (Recommended)

This allows you to run subscript from any directory.

# Clone the repository
git clone https://github.com/eluhrs/subscript.git
cd subscript

# Install
pip install .

# Or install directly from GitHub
pip install git+https://github.com/eluhrs/subscript.git

Option 2: Run Locally (Development)

You can run the script directly from the repository root without installing.

./subscript.py ...

Usage

Command Line Interface

If installed, use the subscript command. If running locally, use ./subscript.py.

Note: The tool operates relative to your current working directory. Input files, output directories, and .env files should be in the folder where you run the command.

# Basic usage
subscript [SEGMENTATION-MODEL] [TRANSCRIPTION-MODEL] [INPUT-FILE-OR-GLOB]

SEGMENTATION-NICKNAME: (Optional) The nickname of the segmentation model to use (defined in config.yml), e.g., historical-manuscript. If omitted, the segmentation model defined as default is used.
MODEL-NICKNAME: (Optional) The nickname of the transcription model to use (defined in config.yml), e.g., gemini-flash. If omitted, the transcription model defined as default is used.
INPUT: Path to an image or a wildcard pattern for multiple images

Examples

1. Transcribe a single image (using defaults):

./subscript.py input/sample.jpg

2. Transcribe using a specific transcription model:

./subscript.py gemini-flash input/sample.jpg

3. Transcribe using a specific segmentation model:

./subscript.py historical-manuscript input/sample.jpg

4. Transcribe using specific models for both:

./subscript.py historical-manuscript gemini-flash input/sample.jpg

5. Combine multiple images into one book:

./subscript.py "input/*.jpg" --combine my_filename.pdf

Output: output/my_filename.pdf and output/my_filename.txt (all pages)

Options

--help: Show this help message and exit.
--config [path]: Path to an alternate config file (default: ./config.yml).
Transcription Overrides:
- --prompt "Your custom prompt": Override the system prompt for the transcription model.
- --temp 0.5: Override the temperature (creativity) of the model.
Preprocessing Overrides:
- --resize [large|medium|small|false]: Resize the input image before processing to save tokens/cost.
- --contrast [float]: Adjust contrast (1.0 = original, <1.0 lower, >1.0 higher).
- --binarize: Apply Otsu binarization to the image.
- --invert: Invert image colors (useful for negative scans).
PDF Generation:
- --nopdf: Skip PDF generation (only TXT/XML).
- --combine output.pdf: Combine multiple input images into a single PDF.
- --onlypdf: Skip segmentation/transcription and rebuild PDF from existing XML.

Configuration (config.yml)

The config.yml file defines the available models, segmentation providers, and their default settings.

# --- Segmentation Analysis ---
segmentation:
  default_segmentation: "historical-manuscript"
  
  # Define available segmentation models (additional models to be added in the future)
  models:
    historical-manuscript:
      provider: "kraken"
      model: "default"

# --- Transcription ---
transcription:
  default_model: "gemini-pro-3"

  # Define available models here
  models:
    gemini-pro-3:
      provider: "gemini"
      model: "gemini-3-pro-preview"
      prompt: "You are a literal transcription engine..."
      cost_config:
        input_token_cost: 2.0
        output_token_cost: 12.0
      API_passthrough: # provide model-specific settings below
        temperature: 0.0
        max_output_tokens: 8192

License

GNU General Public License v3.0

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Built using:

Kraken for segmentation.
ReportLab for PDF generation.
Google Gemini for transcription.

Inspiration:

htr.
Coded with Google AntiGravity.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
build/lib/subscript		build/lib/subscript
input		input
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
TIMELINE.md		TIMELINE.md
config.yml		config.yml
example.env		example.env
mr-test.jpg		mr-test.jpg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
subscript.py		subscript.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subscript HTR pipeline: image segmentation, transcription, and searchable PDF conversion

Features

Installation

Prerequisites

Setup

Option 1: Install as a Package (Recommended)

Option 2: Run Locally (Development)

Usage

Command Line Interface

Examples

Options

Configuration (config.yml)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Subscript HTR pipeline: image segmentation, transcription, and searchable PDF conversion

Features

Installation

Prerequisites

Setup

Option 1: Install as a Package (Recommended)

Option 2: Run Locally (Development)

Usage

Command Line Interface

Examples

Options

Configuration (config.yml)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages