feat: add 48 benchmark blueprints from automated pipeline by evanhadfield · Pull Request #20 · weval-org/configs

evanhadfield · 2026-02-12T02:26:32Z

Summary

Adds 48 benchmark evaluation Blueprints auto-generated from published AI research papers
Generated using the new benchmark-pipeline tool (arXiv discovery → PDF download → Gemini analysis → Blueprint YAML generation)
All Blueprints pass the official validator
Placed in blueprints/benchmarks/ subdirectory

Benchmarks included

Reasoning & Knowledge:
HellaSwag, HellaSwag-Pro, BBH (BIG-Bench Hard), MMLU-Pro, MMLU-ProX, Turkish MMLU, ConceptMath, OlymMath, MR-GSM8K, STEM-PoM

Safety & Red-teaming:
JailbreakBench, SG-Bench, MART, Code-Switching Red-Teaming, Latent Jailbreak, Red-Teaming GPT-4V, MTSA, SAGE-RT

Bias & Fairness:
AI Gender Bias, HateCheckHIn, HASOC 2020/2021, Generative AI Hate Speech Detection

Code & Tool Use:
PPTC (PowerPoint Task Completion), PPTC-R

Meta-evaluation:
MetaBench, Open LLM Leaderboard, LLM-as-a-Judge, LLM-SRBench, Robustness & Reliability

Other:
BAMBOO, FFT, V-STaR, Video-MMLU, EU20, DSL, AFT Reasoning, JETTS, and more

Source papers

53 papers discovered via arXiv API, 49 successfully analyzed via Gemini, 48 Blueprints generated (4 papers had insufficient methodology for extraction).

Test plan

All 48 Blueprints pass validate_blueprints.py
Spot-check sample Blueprints against source papers for accuracy
Run sample evaluations on weval.org

Blueprints auto-generated from published AI benchmark papers using the benchmark-pipeline (arXiv discovery → Gemini analysis → Blueprint YAML). Benchmarks include: HellaSwag, BBH, MMLU-Pro, JailbreakBench, SG-Bench, OlympicMath, MetaBench, and 41 others covering reasoning, safety, multilingual evaluation, and code generation.

… separately hellaswag.yml now contains the original HellaSwag paper (1905.07830, 10 prompts) instead of the validity critique paper (2504.07825, 4 prompts). The critique is now hellaswag-validity-critique.yml.

evanhadfield added 3 commits February 11, 2026 18:26

fix: replace HellaSwag critique with original benchmark, add critique…

f8cc120

… separately hellaswag.yml now contains the original HellaSwag paper (1905.07830, 10 prompts) instead of the validity critique paper (2504.07825, 4 prompts). The critique is now hellaswag-validity-critique.yml.

add MoreBench moral reasoning blueprint

512a562

evanhadfield merged commit 512a562 into main Feb 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add 48 benchmark blueprints from automated pipeline#20

feat: add 48 benchmark blueprints from automated pipeline#20
evanhadfield merged 3 commits into
mainfrom
benchmark-pipeline-batch-1

evanhadfield commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evanhadfield commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks included

Source papers

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

evanhadfield commented Feb 12, 2026 •

edited

Loading