PDF Text Sanitizer

A specialized tool for selectively rebuilding a PDF's searchable text layer.

This application allows you to ingest a PDF, flatten its visible text into vector outlines (using Ghostscript), and then selectively re-insert specific text as an invisible, searchable layer (OCR-style). This is ideal for cleaning up poorly OCR'ed documents, removing sensitive information while keeping the layout pixel-perfect, or selectively enabling searchability on complex documents.

Key Features

Pixel-Perfect Outlines: Converts all visible text into vector shapes, ensuring the document looks exactly as intended without relying on installed fonts.
Selective Searchability: Choose exactly which text remains searchable.
Interactive UI: A keyboard-first web interface for reviewing, editing, and classifying text spans.
Regex-Powered Classification: Batch-select text for keeping or deletion using regular expressions.
Non-Destructive Workflow: The original PDF is never modified; a new, sanitized version is generated.

How it Works

Ingest: Upload a PDF.
Extract: The tool extracts every text span and its precise coordinates.
Outline: Visible text is flattened to vector outlines via Ghostscript.
Classify: Use regex or the manual UI to mark spans as "keep" or "delete".
Rebuild: Only the "keep" spans are written back into the PDF as invisible text (render mode 3) layered perfectly over the outlines.
Result: A pixel-identical PDF where only your chosen text is searchable and selectable.

Setup & Installation

1. Prerequisites (Ghostscript)

The gs command must be available in your system PATH.

macOS (Homebrew): brew install ghostscript
Linux (apt): sudo apt install ghostscript
Windows (winget): winget install ArtifexSoftware.Ghostscript

2. Installation

Clone the repository and choose your preferred package manager:

git clone https://github.com/your-username/pdf-text-sanitizer.git
cd pdf-text-sanitizer

Option A: Using pip

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Option B: Using uv (Recommended)

uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Running the Application

Easiest (recommended)

python start.py

Then open http://127.0.0.1:8000.

Equivalent standard entrypoint

python -m app

Optional run flags

Both commands support:

python start.py --host 0.0.0.0 --port 8000 --reload
python -m app --host 0.0.0.0 --port 8000 --reload

To disable reload:

python start.py --no-reload

Legacy uvicorn command (still works)

python -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reload

After the server is running, drag and drop a PDF into the browser to begin processing.

UI Keyboard Shortcuts

Key	Action
`←` / `→`	Previous / next page
`g`	Jump to page
`Ctrl+A`	Select all spans on current page
`Delete`	Mark selected spans as Delete
`Enter`	Mark selected spans as Keep
`Ctrl+S`	Save and download the sanitized PDF
`Ctrl+Z`	Undo last action
`Double-click`	Inline edit span text (automatically marks as Keep)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
sessions		sessions
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Text Sanitizer

Key Features

How it Works

Setup & Installation

1. Prerequisites (Ghostscript)

2. Installation

Option A: Using pip

Option B: Using uv (Recommended)

Running the Application

Easiest (recommended)

Equivalent standard entrypoint

Optional run flags

Legacy uvicorn command (still works)

UI Keyboard Shortcuts

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Text Sanitizer

Key Features

How it Works

Setup & Installation

1. Prerequisites (Ghostscript)

2. Installation

Option A: Using pip

Option B: Using uv (Recommended)

Running the Application

Easiest (recommended)

Equivalent standard entrypoint

Optional run flags

Legacy uvicorn command (still works)

UI Keyboard Shortcuts

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages