Skip to content
#

pdf-text-extraction

Here are 32 public repositories matching this topic...

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

  • Updated Jan 22, 2026
  • Jupyter Notebook

An end-to-end NLP pipeline for legal documents, including OCR-based text extraction, neural language modeling from scratch (NumPy), sentence and document embeddings, extractive and abstractive summarization, grammar refinement, and semantic case similarity retrieval using cosine similarity.

  • Updated Feb 7, 2026
  • Jupyter Notebook

AI-powered document classification system built using FastAPI, Machine Learning, scikit-learn, NLP, and Streamlit. The application extracts text from uploaded PDF documents, preprocesses the content using NLP techniques, converts text into TF-IDF vectors, and predicts document such as invoices, orders and reports using a ML classification model.

  • Updated May 26, 2026
  • Python

Improve this page

Add a description, image, and links to the pdf-text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more