Skip to content

ksokoll/AI_headline_evaluator

Repository files navigation

AI Headline Predictor

A production-ready API that predicts which headlines people will click using LLM-powered synthetic personas


Table of Contents


Purpose & Scope

This project started with a practical question: can an LLM predict which headline will win an A/B test? After analyzing 47 real A/B tests from Upworthys dataset, the answer is yes, with 85.11% accuracy.

The system uses LLMs to simulate multiple people making independent decisions about which headline they would click. By running the same prompt multiple times with controlled randomness (temperature 0.7) and aggregating votes, it creates synthetic personas that collectively predict human behavior.

My goal with this project was to train my software-engineering skills by focusing on architecture and best practises. Therefore, this time the LLM evaluation-part is not the focus and only incorporated in a reduced manner. Also expect a slight overkill of used principles for such a small project, but also a robust framework for future projects.

Business Use Case

Our fictional client releases online press articles regularly, but struggles with inconsistent click-through rates (CTR: Clicks on the headline in comparisons to the Impressions). A/B tests for important articles are conducted, but for the bulk of the articles it would not be worthwile, as these tests with real users are pricy and not in a good relationship with the expected return. On the other hand, LLMs proved to have good capabilities of mimicing human behaviour. The client is asking for a system, that allows pre-testing of headline variations before committing real traffic. It is by purpose no headline-generation tool, nor a replacement for real A/B-testing, but rather a validation tool.

Design Philosophy

Key principles that shaped the architecture:

  1. Stateless Design: Each request is independent, so we need no conversation history. -> Easy to scale, easy to test, more robust. Logs are saved anyway, of course. Accompanies "Fail Fast" principle:
  2. Fail Fast: Strict error handling, any failure aborts the request. Since this tool is nether customer facing, nor has to have high availability. This is accompanied through extensive error-catching and meaningful error-messages to guide users.
  3. Trust the Chain: Validation happens at entry points, downstream components trust upstream guarantees. This ensures quicker development, good performance and clear responsibilities.
  4. Production Patterns: Proper logging, error handling, request tracking via ULIDs.
  5. Simplicity Over Elegance: Manual parsing beats over-engineered solutions for simple cases.

Quick Start

Prerequisites

  • Python 3.13+
  • OpenAI API Key
  • Docker (for containerized deployment)
  • Azure CLI (for cloud deployment)

Installation

# Clone repository
git clone https://github.com/ksokoll/AI_headline_evaluator.git
cd AI_headline_evaluator

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env and add OPENAI_API_KEY=sk-...

First Run

# Start API server
uvicorn main:app --reload

# Open browser
# - API Docs: http://localhost:8000/docs
# - Health Check: http://localhost:8000/health

Your First Prediction

curl -X POST http://localhost:8000/process \
  -H "Content-Type: application/json" \
  -d '{
  "variants": [
    {"headline": "How to Write Better Headlines"},
    {"headline": "The Secret to Viral Headlines Nobody Tells You"},
    {"headline": "The simple secret to write good viral headlines is ..."}
  ]
}'

Response:

{
  "test_id": "01KF97YQWFPS0YW95SQBAK3AQQ",
  "predicted_winner": 3,
  "vote_distribution": {"3": 3}
}

Architecture Overview

API Layer (main.py)

  • CORS Middleware
  • Request Validation (Pydantic)
  • Dependency Injection

Application Layer (pipeline.py & Components)

  • User Request
    • 1. Validator → ID Generation, Deduplication, Variant Count Check
    • 2. Processor → LLM API Calls (num_runs iterations)
    • 3. ResultFormatter → Vote Aggregation, Winner Determination
    • Response (JSON)

Infastructure Layer

  • OpenAI API (gpt-4o-mini) (but can be any other LLM, free as you choose)
  • ULID Generation
  • Config (Pydantic Settings)

The Data Flow

# 1. Request arrives
POST /process
Body: {
  "variants": [
    {"headline": "Option A"},
    {"headline": "Option B"}
  ]
}

# 2. API LayerValidation: Pydantic validates schemaSanitization: Empty headlines rejected via @field_validator

# 3. Pipeline OrchestrationValidator:
   - Generate ULID if test_id not provided
   - Remove duplicate headlines
   - Check minimum variant count (3)
   Result: ClickabilityTest with unique test_idProcessor:
   - Loop num_runs times (default: 3) (note: More than 3 iterations did not prove any significant accuracy gain)
   - Call OpenAI API for each run
   - Parse response: int(response.content.strip())
   - Validate: 1 <= choice <= len(variants)
   Result: ClickabilityTestResult with winner_listResultFormatter:
   - Aggregate votes into distribution
   - Determine winner via max(votes)
   Result: {vote_distribution: {1: 2, 2: 1}, winner: 1}

# 4. Response

Core Components

1. Config (config.py)

Centralized settings management using Pydantic.

class Settings(BaseSettings):
    openai_api_key: str | None
    model_name: str = "gpt-4o-mini"
    temperature: float = 0.7
    max_tokens: int = 500
    num_runs: int = 3
    min_variants_per_test: int = 3
    
    class Config:
        env_file = ".env"
        env_file_required = False  # "False" because: it works both local (.env) and Azure (system vars)

2. Data Models (models.py)

Type-safe data structures enforcing validation at runtime.

class Variant(BaseModel):
    headline: str
    variant_id: int | None = None
    
    @field_validator('headline')
    def validate_headline(cls, v):
        if not v or not v.strip():
            raise ValueError('Headline cannot be empty')
        return v.strip()

class ClickabilityTest(BaseModel):
    test_id: str | None = None
    variants: list[Variant]

class ClickabilityTestResult(BaseModel):
    test_id: str
    winner_list: list[int]

class PipelineResult(BaseModel):
    test_id: str
    predicted_winner: int
    vote_distribution: dict[int, int]

Why did I use the @field_validator instead of manual checks? I was following the Fail Fast principle: The validation happens at request parsing Empty headlines rejected before business logic

Automatic HTTP 422 responses with field-level errors

It helps to have no defensive programming needed downstream

Also keeps the code clean as this is hidden in the "models" section


3. Validator (validator.py)

Purpose: Ensure data quality and assign unique IDs.

Responsibilities:

  1. Generate ULID if test_id not provided
  2. Remove duplicate headlines (preserving order)
  3. Validate minimum variant count
  4. Return validated ClickabilityTest or raise RuntimeError
class Validator:
    def __init__(self):
        self.process_step = "1_validation"
    
    def validate(self, item: ClickabilityTest) -> ClickabilityTest:
        # Generate ID first (enables logging from start)
        if not item.test_id:
            try:
                item.test_id = str(ULID())
            except Exception as e:
                logger.error("ULID generation failed", exc_info=True)
                raise RuntimeError("Failed to generate test_id") from e
        
        # Deduplication
        unique_variants = []
        seen = set()
        for variant in item.variants:
            if variant.headline not in seen:
                unique_variants.append(variant)
                seen.add(variant.headline)
        
        # Minimum check
        if len(unique_variants) < settings.min_variants_per_test:
            logger.warning("Insufficient variants after dedup")
            raise RuntimeError("Insufficient unique variants")
        
        return ClickabilityTest(test_id=item.test_id, variants=unique_variants)

Error Handling:

  • ULID generation: raises RuntimeError (catastrophic failure)
  • Insufficient variants: raises RuntimeError (business rule violation)
  • No None returns - exceptions propagate to FastAPI for HTTP conversion

Logging Strategy:

  • Log at decision points (start, duplicate removal, validation failure, success)
  • Include process_step for hierarchical filtering in Azure Insights
  • Structured extra fields (test_id, variant counts) for queryability

4. Processor (processor.py)

Purpose: Execute LLM predictions via OpenAI API.

Responsibilities:

  1. Construct user prompt with numbered headlines
  2. Loop num_runs times calling OpenAI
  3. Parse responses (string → int)
  4. Validate choices are in range
  5. Return ClickabilityTestResult with vote list
class Processor:
    def __init__(self):
        self.client = OpenAI(api_key=settings.openai_api_key)
        self.system_prompt = self._load_system_prompt()
        self.process_step = "2_processor"
    
    def process(self, item: ClickabilityTest) -> ClickabilityTestResult:
        user_prompt = "Here are the headline options:\n"
        for idx, variant in enumerate(item.variants, start=1):
            user_prompt += f"{idx}. {variant.headline}\n"
        user_prompt += "Please select the headline number you would click on:"
        
        winners = []
        for i in range(settings.num_runs):
            try:
                response = self.client.chat.completions.create(
                    model=settings.model_name,
                    messages=[
                        {"role": "system", "content": self.system_prompt},
                        {"role": "user", "content": user_prompt}
                    ],
                    temperature=settings.temperature,
                    max_tokens=settings.max_tokens
                )
            except Exception as e:
                logger.error("LLM API call failed", exc_info=True)
                raise RuntimeError("Error in LLM API response") from e
            
            try:
                raw = response.choices[0].message.content.strip()
                choice = int(raw)
                if not (1 <= choice <= len(item.variants)):
                    raise ValueError(f"Choice {choice} out of range")
                winners.append(choice)
                logger.debug("LLM run successful", extra={"run": i, "choice": choice})
            except (ValueError, IndexError, AttributeError) as e:
                logger.error("Failed to parse LLM response")
                raise RuntimeError("Error in parsing LLM response") from e
        
        return ClickabilityTestResult(test_id=item.test_id, winner_list=winners)

System Prompt Design:

You are participating in a headline evaluation study. You will be presented with 
multiple headline variations for the same article content.

Your task: Evaluate each headline as if you were browsing social media or a news 
website. Choose the ONE headline that would most likely make you click to read 
the full article.

Guidelines:
- Consider which headline captures your attention most effectively
- Think about which headline makes you most curious about the article content
- Evaluate based on your immediate, instinctive reaction
- You must select exactly one headline

Response format: Respond with only the number of the headline you would click 
(e.g., "1", "2", "3", etc.). Do not provide explanations or justifications.

Why This Prompt Works:

  • Simulates user context (browsing social media/news)
  • Emphasizes instinctive reaction over analysis
  • Enforces single choice (matches A/B test reality)
  • Simple response format (reduces parsing errors)

5. ResultFormatter (result_formatter.py)

Purpose: Aggregate votes and determine winner.

Responsibilities:

  1. Count votes from winner_list
  2. Create vote_distribution dict
  3. Determine winner via max(votes)
  4. Return ResultFormatterOutput
class ResultFormatter:
    def __init__(self):
        self.process_step = "3_result_formatter"
    
    def format_results(self, item: ClickabilityTestResult) -> ResultFormatterOutput:
        vote_distribution: dict[int, int] = {}
        for winner in item.winner_list:
            vote_distribution[winner] = vote_distribution.get(winner, 0) + 1
        
        overall_winner = max(vote_distribution, key=vote_distribution.get)
        
        logger.info("Results formatted", extra={
            "test_id": item.test_id,
            "winner": overall_winner
        })
        
        return ResultFormatterOutput(
            vote_distribution=vote_distribution,
            winner=overall_winner
        )

Error Handling:

  • None needed - Processor guarantees winner_list is valid
  • If empty list passed (shouldn't happen): max() raises ValueError
  • Trust the chain: upstream components validate, downstream trusts

6. Pipeline (pipeline.py)

Purpose: Orchestrate components without business logic.

class Pipeline:
    def __init__(self):
        self.processor = Processor()
        self.result_formatter = ResultFormatter()
        self.validator = Validator()
        self.process_step = "0_pipeline"
    
    def process(self, item: ClickabilityTest) -> PipelineResult:
        # Validation
        validated_item = self.validator.validate(item)
        
        # Processing
        process_result = self.processor.process(validated_item)
        
        # Formatting
        formatted_result = self.result_formatter.format_results(process_result)
        
        logger.info("Pipeline completed", extra={
            "test_id": validated_item.test_id,
            "predicted_winner": formatted_result.winner
        })
        
        return PipelineResult(
            test_id=validated_item.test_id,
            vote_distribution=formatted_result.vote_distribution,
            predicted_winner=formatted_result.winner
        )

Design Pattern: Dumb Orchestrator

  • No try/except (modules handle their errors)
  • No business logic (delegates to components)
  • Logs workflow milestones (start/end)
  • Exceptions propagate to FastAPI

At this point, explicit error handling is deliberately omitted. The individual modules throw meaningful RuntimeErrors as soon as a functional or technical error occurs. The pipeline itself cannot provide any additional context, as it does not know which module or specific step has failed. FastAPI automatically converts these exceptions into HTTP 500 responses. This keeps the separation clean: the business logic is responsible for detecting and signaling errors, while the orchestration remains deliberately lean.


API Endpoints

POST /process

Predict which headline will win based on LLM votes.

Request:

{
  "test_id": "optional-string",
  "variants": [
    {"headline": "First headline option"},
    {"headline": "Second headline option"}
  ]
}

Response (200):

{
  "test_id": "01KF97YQWFPS0YW95SQBAK3AQQ",
  "predicted_winner": 2,
  "vote_distribution": {"1": 0, "2": 3}
}

Error Responses:

422 Unprocessable Entity (Pydantic validation):

{
  "detail": [
    {
      "loc": ["body", "variants", 0, "headline"],
      "msg": "Value error, Headline cannot be empty",
      "type": "value_error"
    }
  ]
}

500 Internal Server Error (Runtime failures):

{
  "detail": "Insufficient unique variants after deduplication"
}

GET /

Health check endpoint.

Response:

{
  "message": "AI Pipeline API",
  "status": "running",
  "version": "1.0.0"
}

GET /health

Detailed health check (same as /).

GET /docs

Interactive API documentation (Swagger UI).


Deployment

Local Docker

# Build image
docker build -t headline-predictor .

# Run container
docker run -d -p 8000:8000 --env-file .env --name headline-api headline-predictor

# Check logs
docker logs headline-api

# Test
curl http://localhost:8000/

Dockerfile Highlights:

FROM python:3.13-slim

WORKDIR /app

# System dependencies
RUN apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Application code
COPY . .

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Why This Configuration?

  • python:3.13-slim: Matches dev environment, smaller than full python version
  • Layer chaching: Dependencies before code (faster rebuilds)
  • Health check: Azure uses this for container lifecycle management
  • Stateless: No volumes needed (no data persistence)

Azure Deployment

Complete deployment from local machine to public Azure endpoint.

Prerequisites

# Install Azure CLI
# Windows: https://aka.ms/installazurecliwindows
# Mac: brew install azure-cli
# Linux: curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash

# Login
az login

Step 1: Create Resource Group

az group create \
  --name headline-predictor-rg \
  --location westeurope

Step 2: Create Azure Container Registry

az acr create \
  --resource-group headline-predictor-rg \
  --name headlinepredictor \
  --sku Basic \
  --location westeurope

# Enable admin credentials
az acr update --name headlinepredictor --admin-enabled true

# Get credentials
az acr credential show --name headlinepredictor

Note: Registry name must be globally unique, lowercase, no hyphens.

Step 3: Push Image to ACR

# Login to ACR
az acr login --name headlinepredictor

# Tag image for ACR
docker tag headline-predictor headlinepredictor.azurecr.io/headline-predictor:latest

# Push
docker push headlinepredictor.azurecr.io/headline-predictor:latest

Step 4: Create App Service Plan

az appservice plan create \
  --name headline-predictor-plan \
  --resource-group headline-predictor-rg \
  --is-linux \
  --sku B1 \
  --location westeurope

SKU Options:

  • F1 (Free): Limited Docker support, container sleeps after 20 min
  • B1 (Basic): ~€13/month, Always-On, faster startup, recommended for demos
  • S1 (Standard): ~€60/month, auto-scaling, custom domains

Step 5: Create Web App

az webapp create \
  --resource-group headline-predictor-rg \
  --plan headline-predictor-plan \
  --name headline-predictor-ksokoll \
  --deployment-container-image-name headlinepredictor.azurecr.io/headline-predictor:latest

App name must be globally unique - becomes {name}.azurewebsites.net

Step 6: Configure Environment Variables

az webapp config appsettings set \
  --resource-group headline-predictor-rg \
  --name headline-predictor-ksokoll \
  --settings OPENAI_API_KEY="sk-proj-..."

Or via Azure Portal:

  1. Navigate to App Service
  2. Settings → Configuration → Application Settings
  3. New application setting: OPENAI_API_KEY = sk-proj-...
  4. Save

Step 7: Configure Container Settings

Portal → Deployment Center → Settings:

  • Registry: headlinepredictor.azurecr.io
  • Image: headline-predictor
  • Tag: latest
  • Port: 8000

Step 8: Verify Deployment

# Check deployment logs
az webapp log tail \
  --resource-group headline-predictor-rg \
  --name headline-predictor-ksokoll

# Test endpoint
curl https://headline-predictor-ksokoll.azurewebsites.net/

# Test prediction
curl -X POST https://headline-predictor-ksokoll.azurewebsites.net/process \
  -H "Content-Type: application/json" \
  -d '{
  "variants": [
    {"headline": "Test Headline 1"},
    {"headline": "Test Headline 2"}
  ]
}'

Test Results

Dataset

Upworthy Research Archive: 6,191 A/B tests with real CTR data.

Data Cleaning:

  • Filtered single-variant tests (not real A/B tests): 6 removed
  • Removed duplicate headline tests (data corruption): Reduced to 47 clean tests

Evaluation Methodology

  1. Run LLM prediction (3 votes per test)
  2. Compare predicted winner to actual winner (highest CTR)
  3. Calculate accuracy: correct predictions / total tests

Results

Overall Accuracy: 85.11% (40/47 correct)

CTR Gap Analysis:

  • Correct predictions average CTR gap: 0.3463 (34.6 percentage points)
  • Incorrect predictions average CTR gap: 0.1149 (11.5 percentage points)
  • 3.0x difference in CTR gaps between correct and incorrect predictions

Key Finding: All 7 errors occurred at CTR gaps < 0.4%

  • 5 errors at gaps < 0.1% (statistical near-ties)
  • These are tests where humans would also struggle

Vote Distribution Insights:

  • Clear winners (CTR gap > 0.2%): Unanimous LLM votes (3:0)
  • Close tests (CTR gap < 0.1%): Split votes (2:1)
  • LLM internalizes not just which headline wins, but how decisively

Production Test: LinkedIn Headlines

Test Input: 4 LinkedIn post titles about this project

  1. Maximum clickbait: "The Results Will Shock You (And Every Marketer I Know)"
  2. Data-driven: "85% Accuracy: My LLM Predicted Viral Headlines..."
  3. Question hook: "Can AI Predict What Headlines You'll Click?..."
  4. Technical: "LLM-Powered A/B Testing Predictor: Technical Insights..."

Result:

  • Predicted winner: Headline #1 (maximum clickbait)
  • Vote distribution: {1: 3} (100% consensus)
  • Consistency: All 3 runs chose same headline

Interpretation: LLM reliably identifies engagement triggers even when result is uncomfortable (choosing obvious clickbait over professional tone).


Outlook

Technical Outlook & Future Enhancements

1. Deterministic Error Handling The system currently follows a strict fail-fast approach, ensuring high data integrity by aborting a request as soon as any run fails. A natural next step would be to support partial successes, for example by accepting results when a configurable minimum threshold (e.g. 2 out of 3 runs) succeeds. This would improve availability while still allowing teams to explicitly balance data quality against robustness.

2. Straightforward Failure Semantics Transient API failures are surfaced immediately, keeping the execution model transparent and easy to reason about. In production-grade environments, this design can be extended with retry mechanisms such as exponential backoff to increase resilience against temporary outages, trading a small amount of complexity for higher reliability.

3. Stateless-by-Design Architecture Right now Each request is processed independently, without relying on prior context or conversation history. This makes the system highly scalable, predictable, and easy to test. An optional evolution would be to introduce stateful extensions, enabling conversation memory or user feedback loops to support continuous learning and adaptive behavior where required.

4. Focused Single-Model Setup The pipeline intentionally relies on a single model (gpt-4o-mini), keeping costs low and behavior consistent. This setup can later be expanded into a multi-model evaluation framework, allowing comparisons across models (e.g. Claude, Gemini, Llama) and helping validate whether observed clickability judgments are model-specific or more generally applicable.

5. Transparent Vote Aggregation Predictions are combined using a simple majority vote, making the decision logic easy to interpret and debug. Future iterations could enrich this mechanism with confidence weighting or uncertainty estimates, enabling more nuanced interpretations while preserving the current approach as a clear and reliable baseline.

6. Custom Exceptions Instead of using runtimme-error, a possibility would also be to use custom exceptions (ULIDGenerationError etc).


Acknowledgments

Built using:

  • FastAPI - Web framework
  • Pydantic - Data validation
  • OpenAI - Language models (gpt-4o-mini)
  • ULID - Sortable unique identifiers
  • Azure - Cloud hosting platform

Dataset: Upworthy Research Archive (A/B test results)


Contact

Kevin Sokoll


Built to demonstrate production API development patterns, not to replace real A/B testing. Have fun :)

About

Production-focused collection of AI/ML systems showcasing LLM pipelines, RAG, and classical ML. Emphasis on clean architecture, feature engineering, evaluation layers, and production patterns (FastAPI, Docker, Azure/AWS). Built for real-world deployable ML services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages