A specialized tool for selectively rebuilding a PDF's searchable text layer.
This application allows you to ingest a PDF, flatten its visible text into vector outlines (using Ghostscript), and then selectively re-insert specific text as an invisible, searchable layer (OCR-style). This is ideal for cleaning up poorly OCR'ed documents, removing sensitive information while keeping the layout pixel-perfect, or selectively enabling searchability on complex documents.
- Pixel-Perfect Outlines: Converts all visible text into vector shapes, ensuring the document looks exactly as intended without relying on installed fonts.
- Selective Searchability: Choose exactly which text remains searchable.
- Interactive UI: A keyboard-first web interface for reviewing, editing, and classifying text spans.
- Regex-Powered Classification: Batch-select text for keeping or deletion using regular expressions.
- Non-Destructive Workflow: The original PDF is never modified; a new, sanitized version is generated.
- Ingest: Upload a PDF.
- Extract: The tool extracts every text span and its precise coordinates.
- Outline: Visible text is flattened to vector outlines via Ghostscript.
- Classify: Use regex or the manual UI to mark spans as "keep" or "delete".
- Rebuild: Only the "keep" spans are written back into the PDF as invisible text (render mode 3) layered perfectly over the outlines.
- Result: A pixel-identical PDF where only your chosen text is searchable and selectable.
The gs command must be available in your system PATH.
- macOS (Homebrew):
brew install ghostscript - Linux (apt):
sudo apt install ghostscript - Windows (winget):
winget install ArtifexSoftware.Ghostscript
Clone the repository and choose your preferred package manager:
git clone https://github.com/your-username/pdf-text-sanitizer.git
cd pdf-text-sanitizerpython3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtOption B: Using uv (Recommended)
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -r requirements.txtpython start.pyThen open http://127.0.0.1:8000.
python -m appBoth commands support:
python start.py --host 0.0.0.0 --port 8000 --reload
python -m app --host 0.0.0.0 --port 8000 --reloadTo disable reload:
python start.py --no-reloadpython -m uvicorn app.main:app --host 127.0.0.1 --port 8000 --reloadAfter the server is running, drag and drop a PDF into the browser to begin processing.
| Key | Action |
|---|---|
← / → |
Previous / next page |
g |
Jump to page |
Ctrl+A |
Select all spans on current page |
Delete |
Mark selected spans as Delete |
Enter |
Mark selected spans as Keep |
Ctrl+S |
Save and download the sanitized PDF |
Ctrl+Z |
Undo last action |
Double-click |
Inline edit span text (automatically marks as Keep) |
MIT