Skip to content

feat(blueprints): create blueprints/users/Varunrnair/sakhi-expert-maternal-health-benchmark.yml #28

Open
Varunrnair wants to merge 1 commit into
weval-org:mainfrom
Varunrnair:proposal/sakhi-expert-maternal-health-benchmark-1780927875934
Open

feat(blueprints): create blueprints/users/Varunrnair/sakhi-expert-maternal-health-benchmark.yml #28
Varunrnair wants to merge 1 commit into
weval-org:mainfrom
Varunrnair:proposal/sakhi-expert-maternal-health-benchmark-1780927875934

Conversation

@Varunrnair

Copy link
Copy Markdown
Contributor

Blueprint Contribution

Blueprint Details

  • Blueprint ID: sakhi-expert-maternal-health-benchmark
  • Category/Focus: public-health, maternal-health, healthcare-safety
  • Models to test: CORE

What This Blueprint Tests

  • Evaluates model understanding of maternal health topics using 149 doctor-validated questions, with clinician-reviewed reference answers across English, Hindi, and Marathi (447 evaluation cases total).
  • Tests whether responses align with expert-curated, evidence-based guidance on pregnancy, antenatal care, maternal complications, and related clinical scenarios, representing the expert track of the Sakhi benchmark.
  • Assesses the model's ability to communicate maternal health information accurately, responsibly, and precisely when evaluated against theme-specific clinical rubrics and supporting source citations.
  • Measures consistency on high-stakes maternal health questions where incorrect, incomplete, or ambiguous guidance could contribute to real-world harm, including in rural and semi-urban healthcare contexts.
  • Evaluates multilingual parity by measuring whether response quality remains consistent across English, Hindi, and Marathi rather than concentrating performance in a single language.

Checklist

  • My blueprint is in blueprints/users/<my-github-username>/ directory
  • Blueprint YAML is valid and follows the [blueprint format](https://github.com/weval-org/configs/blob/main/README.md)
  • Each prompt has a meaningful, descriptive id (e.g., france-capital-test, not p1 or auto-generated)
  • Blueprint has clear success criteria (should assertions with specific criteria)
  • I've used $not_* functions instead of should_not blocks where applicable
  • I've tested the blueprint locally if possible (pnpm cli run <path-to-blueprint>)
  • I agree to dedicate my contribution to the public domain under CC0 1.0 Universal

Notes

This blueprint expands CivicEval's public-health coverage by evaluating maternal health questions curated and validated by clinical experts. The benchmark focuses on evidence-based maternal health guidance across English, Hindi, and Marathi, enabling assessment of both clinical accuracy and multilingual consistency on high-impact public-health topics.


Automated Evaluation: This PR will trigger an automated evaluation with cost-controlled limits (max 10 prompts, CORE models only). Full evaluation runs automatically after merge.

  • Validation: GitHub Actions will check YAML syntax and structure
  • 🤖 Evaluation: Webhook will run limited evaluation and post results
  • 📊 Results: View status and full analysis via links in comments

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@weval-bot

weval-bot Bot commented Jun 8, 2026

Copy link
Copy Markdown

Blueprint validation failed

  • blueprints/users/Varunrnair/sakhi-expert-maternal-health-benchmark.yml: Failed to fetch blueprint content

@nojibe

nojibe commented Jun 11, 2026

Copy link
Copy Markdown

Hi @Varunrnair — thanks for this contribution. Also, great meeting you today!

The automated evaluation failed with Failed to fetch blueprint content, and we've tracked down why: the blueprint file is ~1.14 MB, and the GitHub Contents API doesn't return inline content for files over 1 MB. Our fetcher therefore receives empty content and the eval can't run.

Suggested fix: split the blueprint by language, so each file stays well under 1 MB:

  • blueprints/users/Varunrnair/sakhi-expert-maternal-health-en.yml
  • blueprints/users/Varunrnair/sakhi-expert-maternal-health-hi.yml
  • blueprints/users/Varunrnair/sakhi-expert-maternal-health-mr.yml

Each file keeps its own config header (you can adjust title/description/tags per language, e.g. add a language tag). This has a side benefit too: per-language results make the multilingual-parity comparison you describe much easier to read on the dashboard.

We're also planning a fix on the app side so large blueprints are handled more gracefully in the future, but the split above will unblock this PR right away.

Thanks again! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants