pdf-text-extraction

Star

Here are 32 public repositories matching this topic...

houking-can / PDFSDK

Star

Based on Foxit Quick PDF Library，python interface

pdf-merge pdf-split pdf-document-processor pdf-sdk pdf-text-extraction

Updated Apr 4, 2020
Python

mamiriqbal1 / rag_book_qa_prompt

Star

A simple demonstration of how you can implement retrieval augmented generation (RAG) for a book.

question-answering rag pdf-text-extraction large-language-models llm chatgpt-web retrieval-augmented-generation

Updated Nov 29, 2023
Jupyter Notebook

rithulkamesh / docproc

Sponsor

Star

Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.

python machine-learning ocr text-classification text-extraction data-extraction region-detection content-extraction document-analysis layout-analysis pdf-processing pdf-text-extraction document-parsing equation-detection mathematical-symbols

Updated Apr 8, 2026
Python

hyeonsangjeon / PDF2LLM-Tuning-Studio

Star

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

Updated Jan 22, 2026
Jupyter Notebook

vijayengineer / PDFTextSpeechConverter

Star

Converts scanned documents and ordinary documents into speech mp3 using Amazon Polly

pdf text images speech aws-polly audiobook synthesis scanned-documents pdf-text-extraction

Updated Dec 30, 2020
Python

PrathameshDhande22 / PdfTxtBot

Star

A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python

python telegram telegram-bot python3 python-telegram-bot image-extractor python-telegram pdf-text pdf-text-extraction pdf-image

Updated Feb 27, 2023
Python

Zeeshanahmad4 / NLP-Pdf-Minning-Extracting-text-from-pdf

Star

NLP Pdf Minning Extracting text from pdf

python pdf pdf-converter text-extraction pdfkit pdf-files extract-text pdftotext pdf-format pdf-document-processor pdftoimage pdftools pdftohtml pdf-text-extraction pdfcon

Updated Apr 2, 2020
Python

bladeacer / pdf-fmt

Sponsor

Star

A PDF extractor, processor and formatter. Supports regex based exclusions and other niceties.

python pdf text-formatting pdf-table-extraction pdf-text-extraction pdf-image-extractor

Updated May 5, 2026
Python

kushalpatel0265 / Resume-Parser

Star

A resume parser that extracts key details from PDF files using Groq's LLM

python nlp api google-colab pdf-text-extraction streamlit-webapp llm

Updated Apr 14, 2025
Jupyter Notebook

eli64s / pdflex

Sponsor

Star

CLI for merging PDF contexts.

pdf-converter pdf-document pdf-generator pdf-manipulation pdf-extractor pdf-library pdf-parser pdf-data-extraction pdf-processor pdf-tools pdf-document-processor python-pdf pdf-search pdf-text-extraction pdf-python pdf-automation python-pdf-tools pdf-document-parser pdf-regex

Updated Mar 20, 2025
Python

ZobayerAkib / AI-Invoice-Analyzer

Star

An AI-powered invoice and receipt analyzer that extracts structured invoice data from images (JPG/PNG) and PDF documents using a Large Language Model (LLM).

pdf image fastapi pdf-text-extraction openai-api pymupdf-fitz llm invoice-analysis

Updated Mar 3, 2026
Python

VirajMadhu / pdf_key_matcher

Star

Highlights the key matches between your Given PDF and the description text

python open-source pdf cv python-script python3 text-extraction terminal-based ats text-compression pdf-text-extraction virajmadhu

Updated Dec 4, 2024
Python

Fanaperana / spdf

Star

Fast, spatial PDF parsing in Rust — column-aware text extraction, optional OCR, and format conversion. Competitive with LiteParse on real-world documents.

rust cli pdf ocr tesseract text-extraction command-line-tool pdf-parser pdfium pdf-extraction pdf-text-extraction document-parsing spatial-layout layout-preservation

Updated Apr 22, 2026
Rust

holasoymas / text-finder

Star

PDF Text Finder Console App along with page number

csharp console-app pdf-text-extraction pdf-text-processing

Updated Mar 20, 2025
C#

alorbach / pypdf-toolbox-gui

Star

A local, Python-based GUI toolbox for common PDF operations such as merge, split, scan, OCR, and document preprocessing. Fully offline, extensible, and open source.

python pdf cross-platform pdf-converter pdf-manipulation pdf-merge pdf-utilities pdf-tools pdf-splitter pdf-processing pdf-text-extraction

Updated Apr 2, 2026
Python

rmottanet / unchainedtext

Star

UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.

extractor text-extraction data-extraction text-processing pdf-text-extraction text-extraction-tool

Updated Apr 21, 2026
Python

nsourlos / OCR_and_RAG

Star

Tests of OCR and RAG with LLMs

information-retrieval ocr gemini openai mistral document-processing cohere rag pdf-text-extraction colpali qwen2-vl

Updated Jun 23, 2025
Jupyter Notebook

hafsa-imtiaz / legal-nlp-pipeline-from-scratch

Star

An end-to-end NLP pipeline for legal documents, including OCR-based text extraction, neural language modeling from scratch (NumPy), sentence and document embeddings, extractive and abstractive summarization, grammar refinement, and semantic case similarity retrieval using cosine similarity.

natural-language-processing ocr numpy word-embeddings semantic-similarity cosine-similarity extractive-summarization sentence-embeddings abstractive-summarization document-embeddings grammar-correction legal-nlp pdf-text-extraction nlp-from-scratch

Updated Feb 7, 2026
Jupyter Notebook

anagha-aravind13 / intelligent-document-classification-system

Star

AI-powered document classification system built using FastAPI, Machine Learning, scikit-learn, NLP, and Streamlit. The application extracts text from uploaded PDF documents, preprocesses the content using NLP techniques, converts text into TF-IDF vectors, and predicts document such as invoices, orders and reports using a ML classification model.

machine-learning natural-language-processing logistic-regression document-classification rest-apis text-preprocessing fastapi pdf-text-extraction tf-idf-vectorization file-upload-handling model-training-inference

Updated May 26, 2026
Python

avenuequeenflow / pdf-toolkit-utility

Star

Merge, split, compress, and edit PDFs with this lightweight toolkit

Updated May 31, 2026
C++

Improve this page

Add a description, image, and links to the pdf-text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-text-extraction

Here are 32 public repositories matching this topic...

houking-can / PDFSDK

mamiriqbal1 / rag_book_qa_prompt

rithulkamesh / docproc

hyeonsangjeon / PDF2LLM-Tuning-Studio

vijayengineer / PDFTextSpeechConverter

PrathameshDhande22 / PdfTxtBot

Zeeshanahmad4 / NLP-Pdf-Minning-Extracting-text-from-pdf

bladeacer / pdf-fmt

kushalpatel0265 / Resume-Parser

eli64s / pdflex

ZobayerAkib / AI-Invoice-Analyzer

VirajMadhu / pdf_key_matcher

Fanaperana / spdf

holasoymas / text-finder

alorbach / pypdf-toolbox-gui

rmottanet / unchainedtext

nsourlos / OCR_and_RAG

hafsa-imtiaz / legal-nlp-pipeline-from-scratch

anagha-aravind13 / intelligent-document-classification-system

avenuequeenflow / pdf-toolkit-utility

Improve this page

Add this topic to your repo