A production-ready API that predicts which headlines people will click using LLM-powered synthetic personas
- Purpose & Scope
- Quick Start
- Architecture Overview
- Core Components
- API Endpoints
- Deployment
- Test Results
- Limitations & Trade-offs
This project started with a practical question: can an LLM predict which headline will win an A/B test? After analyzing 47 real A/B tests from Upworthys dataset, the answer is yes, with 85.11% accuracy.
The system uses LLMs to simulate multiple people making independent decisions about which headline they would click. By running the same prompt multiple times with controlled randomness (temperature 0.7) and aggregating votes, it creates synthetic personas that collectively predict human behavior.
My goal with this project was to train my software-engineering skills by focusing on architecture and best practises. Therefore, this time the LLM evaluation-part is not the focus and only incorporated in a reduced manner. Also expect a slight overkill of used principles for such a small project, but also a robust framework for future projects.
Our fictional client releases online press articles regularly, but struggles with inconsistent click-through rates (CTR: Clicks on the headline in comparisons to the Impressions). A/B tests for important articles are conducted, but for the bulk of the articles it would not be worthwile, as these tests with real users are pricy and not in a good relationship with the expected return. On the other hand, LLMs proved to have good capabilities of mimicing human behaviour. The client is asking for a system, that allows pre-testing of headline variations before committing real traffic. It is by purpose no headline-generation tool, nor a replacement for real A/B-testing, but rather a validation tool.
Key principles that shaped the architecture:
- Stateless Design: Each request is independent, so we need no conversation history. -> Easy to scale, easy to test, more robust. Logs are saved anyway, of course. Accompanies "Fail Fast" principle:
- Fail Fast: Strict error handling, any failure aborts the request. Since this tool is nether customer facing, nor has to have high availability. This is accompanied through extensive error-catching and meaningful error-messages to guide users.
- Trust the Chain: Validation happens at entry points, downstream components trust upstream guarantees. This ensures quicker development, good performance and clear responsibilities.
- Production Patterns: Proper logging, error handling, request tracking via ULIDs.
- Simplicity Over Elegance: Manual parsing beats over-engineered solutions for simple cases.
- Python 3.13+
- OpenAI API Key
- Docker (for containerized deployment)
- Azure CLI (for cloud deployment)
# Clone repository
git clone https://github.com/ksokoll/AI_headline_evaluator.git
cd AI_headline_evaluator
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env and add OPENAI_API_KEY=sk-...# Start API server
uvicorn main:app --reload
# Open browser
# - API Docs: http://localhost:8000/docs
# - Health Check: http://localhost:8000/healthcurl -X POST http://localhost:8000/process \
-H "Content-Type: application/json" \
-d '{
"variants": [
{"headline": "How to Write Better Headlines"},
{"headline": "The Secret to Viral Headlines Nobody Tells You"},
{"headline": "The simple secret to write good viral headlines is ..."}
]
}'Response:
{
"test_id": "01KF97YQWFPS0YW95SQBAK3AQQ",
"predicted_winner": 3,
"vote_distribution": {"3": 3}
}API Layer (main.py)
- CORS Middleware
- Request Validation (Pydantic)
- Dependency Injection
↓
Application Layer (pipeline.py & Components)
- User Request
- 1. Validator → ID Generation, Deduplication, Variant Count Check
- 2. Processor → LLM API Calls (num_runs iterations)
- 3. ResultFormatter → Vote Aggregation, Winner Determination
- Response (JSON)
↓
Infastructure Layer
- OpenAI API (gpt-4o-mini) (but can be any other LLM, free as you choose)
- ULID Generation
- Config (Pydantic Settings)
# 1. Request arrives
POST /process
Body: {
"variants": [
{"headline": "Option A"},
{"headline": "Option B"}
]
}
# 2. API Layer
→ Validation: Pydantic validates schema
→ Sanitization: Empty headlines rejected via @field_validator
# 3. Pipeline Orchestration
→ Validator:
- Generate ULID if test_id not provided
- Remove duplicate headlines
- Check minimum variant count (3)
Result: ClickabilityTest with unique test_id
→ Processor:
- Loop num_runs times (default: 3) (note: More than 3 iterations did not prove any significant accuracy gain)
- Call OpenAI API for each run
- Parse response: int(response.content.strip())
- Validate: 1 <= choice <= len(variants)
Result: ClickabilityTestResult with winner_list
→ ResultFormatter:
- Aggregate votes into distribution
- Determine winner via max(votes)
Result: {vote_distribution: {1: 2, 2: 1}, winner: 1}
# 4. ResponseCentralized settings management using Pydantic.
class Settings(BaseSettings):
openai_api_key: str | None
model_name: str = "gpt-4o-mini"
temperature: float = 0.7
max_tokens: int = 500
num_runs: int = 3
min_variants_per_test: int = 3
class Config:
env_file = ".env"
env_file_required = False # "False" because: it works both local (.env) and Azure (system vars)Type-safe data structures enforcing validation at runtime.
class Variant(BaseModel):
headline: str
variant_id: int | None = None
@field_validator('headline')
def validate_headline(cls, v):
if not v or not v.strip():
raise ValueError('Headline cannot be empty')
return v.strip()
class ClickabilityTest(BaseModel):
test_id: str | None = None
variants: list[Variant]
class ClickabilityTestResult(BaseModel):
test_id: str
winner_list: list[int]
class PipelineResult(BaseModel):
test_id: str
predicted_winner: int
vote_distribution: dict[int, int]Why did I use the @field_validator instead of manual checks? I was following the Fail Fast principle: The validation happens at request parsing Empty headlines rejected before business logic
Automatic HTTP 422 responses with field-level errors
It helps to have no defensive programming needed downstream
Also keeps the code clean as this is hidden in the "models" section
Purpose: Ensure data quality and assign unique IDs.
Responsibilities:
- Generate ULID if test_id not provided
- Remove duplicate headlines (preserving order)
- Validate minimum variant count
- Return validated ClickabilityTest or raise RuntimeError
class Validator:
def __init__(self):
self.process_step = "1_validation"
def validate(self, item: ClickabilityTest) -> ClickabilityTest:
# Generate ID first (enables logging from start)
if not item.test_id:
try:
item.test_id = str(ULID())
except Exception as e:
logger.error("ULID generation failed", exc_info=True)
raise RuntimeError("Failed to generate test_id") from e
# Deduplication
unique_variants = []
seen = set()
for variant in item.variants:
if variant.headline not in seen:
unique_variants.append(variant)
seen.add(variant.headline)
# Minimum check
if len(unique_variants) < settings.min_variants_per_test:
logger.warning("Insufficient variants after dedup")
raise RuntimeError("Insufficient unique variants")
return ClickabilityTest(test_id=item.test_id, variants=unique_variants)Error Handling:
- ULID generation: raises RuntimeError (catastrophic failure)
- Insufficient variants: raises RuntimeError (business rule violation)
- No
Nonereturns - exceptions propagate to FastAPI for HTTP conversion
Logging Strategy:
- Log at decision points (start, duplicate removal, validation failure, success)
- Include
process_stepfor hierarchical filtering in Azure Insights - Structured
extrafields (test_id, variant counts) for queryability
Purpose: Execute LLM predictions via OpenAI API.
Responsibilities:
- Construct user prompt with numbered headlines
- Loop
num_runstimes calling OpenAI - Parse responses (string → int)
- Validate choices are in range
- Return ClickabilityTestResult with vote list
class Processor:
def __init__(self):
self.client = OpenAI(api_key=settings.openai_api_key)
self.system_prompt = self._load_system_prompt()
self.process_step = "2_processor"
def process(self, item: ClickabilityTest) -> ClickabilityTestResult:
user_prompt = "Here are the headline options:\n"
for idx, variant in enumerate(item.variants, start=1):
user_prompt += f"{idx}. {variant.headline}\n"
user_prompt += "Please select the headline number you would click on:"
winners = []
for i in range(settings.num_runs):
try:
response = self.client.chat.completions.create(
model=settings.model_name,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=settings.temperature,
max_tokens=settings.max_tokens
)
except Exception as e:
logger.error("LLM API call failed", exc_info=True)
raise RuntimeError("Error in LLM API response") from e
try:
raw = response.choices[0].message.content.strip()
choice = int(raw)
if not (1 <= choice <= len(item.variants)):
raise ValueError(f"Choice {choice} out of range")
winners.append(choice)
logger.debug("LLM run successful", extra={"run": i, "choice": choice})
except (ValueError, IndexError, AttributeError) as e:
logger.error("Failed to parse LLM response")
raise RuntimeError("Error in parsing LLM response") from e
return ClickabilityTestResult(test_id=item.test_id, winner_list=winners)System Prompt Design:
You are participating in a headline evaluation study. You will be presented with
multiple headline variations for the same article content.
Your task: Evaluate each headline as if you were browsing social media or a news
website. Choose the ONE headline that would most likely make you click to read
the full article.
Guidelines:
- Consider which headline captures your attention most effectively
- Think about which headline makes you most curious about the article content
- Evaluate based on your immediate, instinctive reaction
- You must select exactly one headline
Response format: Respond with only the number of the headline you would click
(e.g., "1", "2", "3", etc.). Do not provide explanations or justifications.
Why This Prompt Works:
- Simulates user context (browsing social media/news)
- Emphasizes instinctive reaction over analysis
- Enforces single choice (matches A/B test reality)
- Simple response format (reduces parsing errors)
Purpose: Aggregate votes and determine winner.
Responsibilities:
- Count votes from winner_list
- Create vote_distribution dict
- Determine winner via max(votes)
- Return ResultFormatterOutput
class ResultFormatter:
def __init__(self):
self.process_step = "3_result_formatter"
def format_results(self, item: ClickabilityTestResult) -> ResultFormatterOutput:
vote_distribution: dict[int, int] = {}
for winner in item.winner_list:
vote_distribution[winner] = vote_distribution.get(winner, 0) + 1
overall_winner = max(vote_distribution, key=vote_distribution.get)
logger.info("Results formatted", extra={
"test_id": item.test_id,
"winner": overall_winner
})
return ResultFormatterOutput(
vote_distribution=vote_distribution,
winner=overall_winner
)Error Handling:
- None needed - Processor guarantees winner_list is valid
- If empty list passed (shouldn't happen): max() raises ValueError
- Trust the chain: upstream components validate, downstream trusts
Purpose: Orchestrate components without business logic.
class Pipeline:
def __init__(self):
self.processor = Processor()
self.result_formatter = ResultFormatter()
self.validator = Validator()
self.process_step = "0_pipeline"
def process(self, item: ClickabilityTest) -> PipelineResult:
# Validation
validated_item = self.validator.validate(item)
# Processing
process_result = self.processor.process(validated_item)
# Formatting
formatted_result = self.result_formatter.format_results(process_result)
logger.info("Pipeline completed", extra={
"test_id": validated_item.test_id,
"predicted_winner": formatted_result.winner
})
return PipelineResult(
test_id=validated_item.test_id,
vote_distribution=formatted_result.vote_distribution,
predicted_winner=formatted_result.winner
)Design Pattern: Dumb Orchestrator
- No try/except (modules handle their errors)
- No business logic (delegates to components)
- Logs workflow milestones (start/end)
- Exceptions propagate to FastAPI
At this point, explicit error handling is deliberately omitted. The individual modules throw meaningful RuntimeErrors as soon as a functional or technical error occurs. The pipeline itself cannot provide any additional context, as it does not know which module or specific step has failed. FastAPI automatically converts these exceptions into HTTP 500 responses. This keeps the separation clean: the business logic is responsible for detecting and signaling errors, while the orchestration remains deliberately lean.
Predict which headline will win based on LLM votes.
Request:
{
"test_id": "optional-string",
"variants": [
{"headline": "First headline option"},
{"headline": "Second headline option"}
]
}Response (200):
{
"test_id": "01KF97YQWFPS0YW95SQBAK3AQQ",
"predicted_winner": 2,
"vote_distribution": {"1": 0, "2": 3}
}Error Responses:
422 Unprocessable Entity (Pydantic validation):
{
"detail": [
{
"loc": ["body", "variants", 0, "headline"],
"msg": "Value error, Headline cannot be empty",
"type": "value_error"
}
]
}500 Internal Server Error (Runtime failures):
{
"detail": "Insufficient unique variants after deduplication"
}Health check endpoint.
Response:
{
"message": "AI Pipeline API",
"status": "running",
"version": "1.0.0"
}Detailed health check (same as /).
Interactive API documentation (Swagger UI).
# Build image
docker build -t headline-predictor .
# Run container
docker run -d -p 8000:8000 --env-file .env --name headline-api headline-predictor
# Check logs
docker logs headline-api
# Test
curl http://localhost:8000/Dockerfile Highlights:
FROM python:3.13-slim
WORKDIR /app
# System dependencies
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*
# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Application code
COPY . .
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]Why This Configuration?
python:3.13-slim: Matches dev environment, smaller than full python version- Layer chaching: Dependencies before code (faster rebuilds)
- Health check: Azure uses this for container lifecycle management
- Stateless: No volumes needed (no data persistence)
Complete deployment from local machine to public Azure endpoint.
# Install Azure CLI
# Windows: https://aka.ms/installazurecliwindows
# Mac: brew install azure-cli
# Linux: curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
# Login
az loginaz group create \
--name headline-predictor-rg \
--location westeuropeaz acr create \
--resource-group headline-predictor-rg \
--name headlinepredictor \
--sku Basic \
--location westeurope
# Enable admin credentials
az acr update --name headlinepredictor --admin-enabled true
# Get credentials
az acr credential show --name headlinepredictorNote: Registry name must be globally unique, lowercase, no hyphens.
# Login to ACR
az acr login --name headlinepredictor
# Tag image for ACR
docker tag headline-predictor headlinepredictor.azurecr.io/headline-predictor:latest
# Push
docker push headlinepredictor.azurecr.io/headline-predictor:latestaz appservice plan create \
--name headline-predictor-plan \
--resource-group headline-predictor-rg \
--is-linux \
--sku B1 \
--location westeuropeSKU Options:
- F1 (Free): Limited Docker support, container sleeps after 20 min
- B1 (Basic): ~€13/month, Always-On, faster startup, recommended for demos
- S1 (Standard): ~€60/month, auto-scaling, custom domains
az webapp create \
--resource-group headline-predictor-rg \
--plan headline-predictor-plan \
--name headline-predictor-ksokoll \
--deployment-container-image-name headlinepredictor.azurecr.io/headline-predictor:latestApp name must be globally unique - becomes {name}.azurewebsites.net
az webapp config appsettings set \
--resource-group headline-predictor-rg \
--name headline-predictor-ksokoll \
--settings OPENAI_API_KEY="sk-proj-..."Or via Azure Portal:
- Navigate to App Service
- Settings → Configuration → Application Settings
- New application setting:
OPENAI_API_KEY=sk-proj-... - Save
Portal → Deployment Center → Settings:
- Registry:
headlinepredictor.azurecr.io - Image:
headline-predictor - Tag:
latest - Port:
8000
# Check deployment logs
az webapp log tail \
--resource-group headline-predictor-rg \
--name headline-predictor-ksokoll
# Test endpoint
curl https://headline-predictor-ksokoll.azurewebsites.net/
# Test prediction
curl -X POST https://headline-predictor-ksokoll.azurewebsites.net/process \
-H "Content-Type: application/json" \
-d '{
"variants": [
{"headline": "Test Headline 1"},
{"headline": "Test Headline 2"}
]
}'Upworthy Research Archive: 6,191 A/B tests with real CTR data.
Data Cleaning:
- Filtered single-variant tests (not real A/B tests): 6 removed
- Removed duplicate headline tests (data corruption): Reduced to 47 clean tests
- Run LLM prediction (3 votes per test)
- Compare predicted winner to actual winner (highest CTR)
- Calculate accuracy: correct predictions / total tests
Overall Accuracy: 85.11% (40/47 correct)
CTR Gap Analysis:
- Correct predictions average CTR gap: 0.3463 (34.6 percentage points)
- Incorrect predictions average CTR gap: 0.1149 (11.5 percentage points)
- 3.0x difference in CTR gaps between correct and incorrect predictions
Key Finding: All 7 errors occurred at CTR gaps < 0.4%
- 5 errors at gaps < 0.1% (statistical near-ties)
- These are tests where humans would also struggle
Vote Distribution Insights:
- Clear winners (CTR gap > 0.2%): Unanimous LLM votes (3:0)
- Close tests (CTR gap < 0.1%): Split votes (2:1)
- LLM internalizes not just which headline wins, but how decisively
Test Input: 4 LinkedIn post titles about this project
- Maximum clickbait: "The Results Will Shock You (And Every Marketer I Know)"
- Data-driven: "85% Accuracy: My LLM Predicted Viral Headlines..."
- Question hook: "Can AI Predict What Headlines You'll Click?..."
- Technical: "LLM-Powered A/B Testing Predictor: Technical Insights..."
Result:
- Predicted winner: Headline #1 (maximum clickbait)
- Vote distribution: {1: 3} (100% consensus)
- Consistency: All 3 runs chose same headline
Interpretation: LLM reliably identifies engagement triggers even when result is uncomfortable (choosing obvious clickbait over professional tone).
Technical Outlook & Future Enhancements
1. Deterministic Error Handling The system currently follows a strict fail-fast approach, ensuring high data integrity by aborting a request as soon as any run fails. A natural next step would be to support partial successes, for example by accepting results when a configurable minimum threshold (e.g. 2 out of 3 runs) succeeds. This would improve availability while still allowing teams to explicitly balance data quality against robustness.
2. Straightforward Failure Semantics Transient API failures are surfaced immediately, keeping the execution model transparent and easy to reason about. In production-grade environments, this design can be extended with retry mechanisms such as exponential backoff to increase resilience against temporary outages, trading a small amount of complexity for higher reliability.
3. Stateless-by-Design Architecture Right now Each request is processed independently, without relying on prior context or conversation history. This makes the system highly scalable, predictable, and easy to test. An optional evolution would be to introduce stateful extensions, enabling conversation memory or user feedback loops to support continuous learning and adaptive behavior where required.
4. Focused Single-Model Setup The pipeline intentionally relies on a single model (gpt-4o-mini), keeping costs low and behavior consistent. This setup can later be expanded into a multi-model evaluation framework, allowing comparisons across models (e.g. Claude, Gemini, Llama) and helping validate whether observed clickability judgments are model-specific or more generally applicable.
5. Transparent Vote Aggregation Predictions are combined using a simple majority vote, making the decision logic easy to interpret and debug. Future iterations could enrich this mechanism with confidence weighting or uncertainty estimates, enabling more nuanced interpretations while preserving the current approach as a clear and reliable baseline.
6. Custom Exceptions Instead of using runtimme-error, a possibility would also be to use custom exceptions (ULIDGenerationError etc).
Built using:
- FastAPI - Web framework
- Pydantic - Data validation
- OpenAI - Language models (gpt-4o-mini)
- ULID - Sortable unique identifiers
- Azure - Cloud hosting platform
Dataset: Upworthy Research Archive (A/B test results)
Kevin Sokoll
- GitHub: @ksokoll
- LinkedIn: https://www.linkedin.com/in/kevin-sokoll-51a492179/
Built to demonstrate production API development patterns, not to replace real A/B testing. Have fun :)