Production-ready RAG (Retrieval-Augmented Generation) API built with FastAPI, ChromaDB, and Ollama. Supports querying documents with semantic search and optional LLM-powered answers.
- Query: POST documents and query them via semantic search
- Embed: Ingest documents at runtime via
/embedor use theembed.pyscript for batch ingestion - Health: Monitor service status and collection count via
/health - Mock mode: Run without Ollama using
USE_MOCK_LLM=1for CI and testing
- Python 3.11+
- (Optional) Ollama with
tinyllamafor production LLM answers
pip install -r requirements.txtPlace .txt files in the docs/ directory and run:
python embed.pyDocuments are split into 500-char chunks with 50-char overlap. A summary of stored chunks is printed.
uvicorn app:app --host 0.0.0.0 --port 8000Without Ollama (mock mode, returns retrieved context as answer):
USE_MOCK_LLM=1 uvicorn app:app --host 0.0.0.0 --port 8000Query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is Kubernetes?"}'Embed at runtime
curl -X POST http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"text": "Your document content here.", "doc_id": "my_doc"}'Health check
curl http://localhost:8000/healthdocker build -t rag-app .
docker run -p 8000:8000 rag-appThe image uses a non-root user and includes a HEALTHCHECK pointing to /health.
# Build and load image (minikube)
eval $(minikube docker-env)
docker build -t rag-app .
# Deploy
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
# Access (NodePort)
kubectl get svc rag-app-serviceSwitch service.yaml to type: LoadBalancer for cloud (GKE, EKS, AKS) to get an external IP.
pip install -r requirements-dev.txt
USE_MOCK_LLM=1 pytest -vTests use mock LLM mode and an isolated Chroma database.
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
tinyllama | Ollama model for query answers |
N_RESULTS |
1 | Number of chunks to retrieve |
DB_PATH |
./db | ChromaDB persistence path |
DOCS_DIR |
./docs | Directory for embed.py input files |
LOG_LEVEL |
INFO | Logging level |
USE_MOCK_LLM |
0 | 1 = return context only (no Ollama) |
MAX_QUERY_LENGTH |
2000 | Max query length for validation |