From f0ad3b7218abb88f84b882da7fcd016e76775994 Mon Sep 17 00:00:00 2001 From: Ramakrishna Prabhu Date: Thu, 4 Jun 2026 13:31:52 -0500 Subject: [PATCH 1/2] ci: nudge multi-objective skill to trigger eval Co-Authored-By: Claude Sonnet 4.6 --- skills/cuopt-multi-objective-exploration/SKILL.md | 1 + 1 file changed, 1 insertion(+) diff --git a/skills/cuopt-multi-objective-exploration/SKILL.md b/skills/cuopt-multi-objective-exploration/SKILL.md index cd22ab33d..cfb8da0b6 100644 --- a/skills/cuopt-multi-objective-exploration/SKILL.md +++ b/skills/cuopt-multi-objective-exploration/SKILL.md @@ -17,6 +17,7 @@ metadata: # Multi-Objective Exploration + cuOpt optimizes **one** objective per solve. Many real problems have several objectives that pull against each other — cost vs. service level, return vs. risk, makespan vs. overtime, distance vs. vehicle count. A single solve answers "what's optimal *for one particular weighting*," but it hides the tradeoff the user actually needs to see. This skill turns a sequence of single-objective cuOpt solves into a **Pareto frontier** — the set of solutions where you can't improve one objective without giving up another — and gives the discipline to read it. It adds no solver features; it orchestrates the LP / MILP / QP solves already covered by the formulation and API skills. From d6eccab97f1b383ea1bf76c9ad3724c9fe632589 Mon Sep 17 00:00:00 2001 From: nvskills-svc-account Date: Thu, 4 Jun 2026 18:50:45 +0000 Subject: [PATCH 2/2] Attach NVSkills validation signatures Signed-off-by: nvskills-svc-account --- .../BENCHMARK.md | 88 +++++++++++++++++++ .../skill-card.md | 81 +++++++++++++++++ .../skill.oms.sig | 1 + 3 files changed, 170 insertions(+) create mode 100644 skills/cuopt-multi-objective-exploration/BENCHMARK.md create mode 100644 skills/cuopt-multi-objective-exploration/skill-card.md create mode 100644 skills/cuopt-multi-objective-exploration/skill.oms.sig diff --git a/skills/cuopt-multi-objective-exploration/BENCHMARK.md b/skills/cuopt-multi-objective-exploration/BENCHMARK.md new file mode 100644 index 000000000..088ba3a04 --- /dev/null +++ b/skills/cuopt-multi-objective-exploration/BENCHMARK.md @@ -0,0 +1,88 @@ +# Evaluation Report + +Evaluation of the `cuopt-multi-objective-exploration` skill before publication through NVSkills-Eval. + +This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use. + +## Evaluation Summary + +- Skill: `cuopt-multi-objective-exploration` +- Evaluation date: 2026-06-04 +- NVSkills-Eval profile: `external` +- Environment: `astra-sandbox` +- Dataset: 3 evaluation tasks +- Attempts per task: 2 +- Pass threshold: 50% +- Overall verdict: PASS + +## Agents Used + +- `claude-code` +- `codex` + +## Metrics Used + +Reported benchmark dimensions: + +- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access. +- Correctness: checks whether the agent follows the expected workflow and produces the correct final output. +- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant. +- Effectiveness: checks whether the agent performs measurably better with the skill than without it. +- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work. + +Underlying evaluation signals used in this run: + +- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access. +- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow. +- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage. +- `accuracy` (Accuracy): grades final-answer correctness against the reference answer. +- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully. +- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations. +- `token_efficiency` (Token Efficiency): compares token usage with and without the skill. + +## Test Tasks + +The benchmark dataset contained 3 evaluation tasks: + +- Positive tasks: 2 tasks where the skill was expected to activate. +- Negative tasks: 1 tasks where no skill was expected. +- Unlabeled tasks: 0 tasks where positive/negative intent could not be inferred. + +Task composition is derived from the evaluation dataset when possible. Entries with `expected_skill` set are treated as positive skill-activation cases, while entries with `expected_skill: null` are treated as negative activation cases. + +## Results + +| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 6 | 100% (+0%) | 100% (+0%) | +| Correctness | 6 | 82% (+27%) | 78% (+15%) | +| Discoverability | 6 | 67% (+33%) | 64% (+23%) | +| Effectiveness | 6 | 87% (+16%) | 80% (+8%) | +| Efficiency | 6 | 71% (+22%) | 63% (+13%) | + +Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available. + +## Tier 1: Static Validation Summary + +Tier 1 validation passed with observations. NVSkills-Eval ran 9 checks and found 9 total findings. + +Top findings: + +- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Instructions' (`skills/cuopt-multi-objective-exploration/SKILL.md`) +- MEDIUM SCHEMA/body_recommended_section: Missing recommended section: '## Examples' (`skills/cuopt-multi-objective-exploration/SKILL.md`) +- LOW QUALITY/quality_discoverability: Description doesn't mention WHEN to use this skill (`skills/cuopt-multi-objective-exploration/SKILL.md`) +- LOW QUALITY/quality_discoverability: No '## Purpose' section (`skills/cuopt-multi-objective-exploration/SKILL.md`) +- LOW QUALITY/quality_reliability: No mention of error handling or validation (`skills/cuopt-multi-objective-exploration/SKILL.md`) + +## Tier 2: Deduplication Summary + +Tier 2 validation passed. NVSkills-Eval ran 2 checks and found 0 total findings. + +Notable observations: + +- Context Deduplication: Collected 1 file(s) +- Inter-Skill Deduplication: Parsed skill 'cuopt-multi-objective-exploration': 145 char description + +## Publication Recommendation + +The skill is suitable to proceed toward NVSkills-Eval publication based on this benchmark. Skill owners should keep this file with the skill and refresh it when the evaluation dataset, skill behavior, or target agents materially change. diff --git a/skills/cuopt-multi-objective-exploration/skill-card.md b/skills/cuopt-multi-objective-exploration/skill-card.md new file mode 100644 index 000000000..4ce49241d --- /dev/null +++ b/skills/cuopt-multi-objective-exploration/skill-card.md @@ -0,0 +1,81 @@ +## Description:
+Trace and interpret the Pareto frontier across competing objectives using repeated single-objective cuOpt solves (weighted-sum and ε-constraint).
+ +This skill is ready for commercial/non-commercial use.
+ +## Owner +NVIDIA
+ +### License/Terms of Use:
+Apache 2.0
+## Use Case:
+Developers and optimization engineers exploring multi-objective tradeoffs across LP, MILP, QP, and routing problems to present decision-makers with the full Pareto frontier rather than a single collapsed optimum.
+ +### Deployment Geography for Use:
+Global
+ +## Known Risks and Mitigations:
+Risk: Review before execution as proposals could introduce incorrect or misleading guidance into skills.
+Mitigation: Review and scan skill before deployment.
+ +## Reference(s):
+- [cuOpt User Guide](https://docs.nvidia.com/cuopt/user-guide/latest/introduction.html)
+- [cuopt-examples](https://github.com/NVIDIA/cuopt-examples)
+ + +## Skill Output:
+**Output Type(s):** [Analysis, Code]
+**Output Format:** [Markdown with inline code blocks]
+**Output Parameters:** [1D]
+**Other Properties Related to Output:** [None]
+ +## Evaluation Agents Used:
+- Claude Code (`claude-code`)
+- Codex (`codex`)
+ + + +## Evaluation Tasks:
+Evaluated against 3 internal evaluation tasks (2 positive skill-activation, 1 negative) with 2 attempts per task.
+ +## Evaluation Metrics Used:
+Reported benchmark dimensions:
+- Security: Checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
+- Correctness: Checks whether the agent follows the expected workflow and produces the correct final output.
+- Discoverability: Checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
+- Effectiveness: Checks whether the agent performs measurably better with the skill than without it.
+- Efficiency: Checks whether the agent uses fewer tokens and avoids redundant work.
+ +Underlying evaluation signals used in this run:
+- `security`: Checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution`: Verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency`: Checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy`: Grades final-answer correctness against the reference answer.
+- `goal_accuracy`: Checks whether the overall user task completed successfully.
+- `behavior_check`: Verifies expected behavior steps, including safety expectations.
+- `token_efficiency`: Compares token usage with and without the skill.
+ + + +## Evaluation Results:
+| Dimension | Num | `claude-code` | `codex` | +|---|---:|---:|---:| +| Security | 6 | 100% (+0%) | 100% (+0%) | +| Correctness | 6 | 82% (+27%) | 78% (+15%) | +| Discoverability | 6 | 67% (+33%) | 64% (+23%) | +| Effectiveness | 6 | 87% (+16%) | 80% (+8%) | +| Efficiency | 6 | 71% (+22%) | 63% (+13%) | + +## Testing Completed:
+**[x] Agent Red-Teaming**
+**[ ] Network Security**
+**[ ] Product Security**
+ +## Skill Version(s):
+26.08.00 (source: frontmatter)
+ +## Ethical Considerations:
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal team to ensure this skill meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+ +(For Release on NVIDIA Platforms Only)
+Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
diff --git a/skills/cuopt-multi-objective-exploration/skill.oms.sig b/skills/cuopt-multi-objective-exploration/skill.oms.sig new file mode 100644 index 000000000..19d830553 --- /dev/null +++ b/skills/cuopt-multi-objective-exploration/skill.oms.sig @@ -0,0 +1 @@ +{"mediaType":"application/vnd.dev.sigstore.bundle.v0.3+json","verificationMaterial":{"x509CertificateChain":{"certificates":[{"rawBytes":"MIICgzCCAgmgAwIBAgIUKIyS7SxNteQIiWzK1dWj85E6520wCgYIKoZIzj0EAwMwVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwHhcNMjYwNDAxMDAwMDAwWhcNMjgwNDIyMTUzMzA5WjBUMQswCQYDVQQGEwJVUzEbMBkGA1UECgwSTlZJRElBIENvcnBvcmF0aW9uMSgwJgYDVQQDDB9OVklESUEgQWdlbnQgU2tpbGxzIFNpZ25pbmcgMDAxMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEYoRM9bQl/dGlwSRNi6bTpIJUXH8Nv9GciP6LSflJYYMLCc296kpyuTSsk5ddbAWiDcFX3C/ydX3jwc+qCLYP6uHy9XphyLjOQ27Yb2J6rBLVtRBS1mgGco/Gr7fL6ODco4GaMIGXMB0GA1UdDgQWBBRQ/5ZW3nJ6lmo9SVk7I15o7UGmpTAfBgNVHSMEGDAWgBRPGpILxMBBleJSsBGjrMKsby1CgjAMBgNVHRMBAf8EAjAAMA4GA1UdDwEB/wQEAwIHgDA3BggrBgEFBQcBAQQrMCkwJwYIKwYBBQUHMAGGG2h0dHA6Ly9vY3NwLm5kaXMubnZpZGlhLmNvbTAKBggqhkjOPQQDAwNoADBlAjAUygu/GiOCIXrgGr4SmLgeEVDcEitfFUv7ALbvLVGVyMysB3mxmO/uInZfXzWcJZsCMQDxuoxj4ZmO30jhkPIcCxGFCOvnUsnfU3TfGcouYm4M6iRpbKvtVnHPiy4bi6pcKf0="},{"rawBytes":"MIICiDCCAg6gAwIBAgIUZsIuSv9NkpJCNqtYEfCouVv5BzowCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowVTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjEpMCcGA1UEAwwgTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBJQ0EgMDEwdjAQBgcqhkjOPQIBBgUrgQQAIgNiAASI72cR3ctKGg4VWnB3bNja6g1Z2PnOmFEopkPof+QeIcPk9rT+g9MjJnq51EQXL93a7C2GJ9J985G4o2V85VD7wJ1RaXhluHW2rf3y8bQGeAYaKMr5s/hUgn+M3/9WlWejgaAwgZ0wHQYDVR0OBBYEFE8akgvEwEGV4lKwEaOswqxvLUKCMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMBIGA1UdEwEB/wQIMAYBAf8CAQAwDgYDVR0PAQH/BAQDAgEGMDcGCCsGAQUFBwEBBCswKTAnBggrBgEFBQcwAYYbaHR0cDovL29jc3AubmRpcy5udmlkaWEuY29tMAoGCCqGSM49BAMDA2gAMGUCMQCeIMMfAbyzPDacw2MxG+Yt1cikrJX/DVxiGfXuHmkkXn6VgSzE79+lkqDErpVO2gYCMCNEColOyvUvkzZGUEI1hQ3PfMgi3FIo9tHoBKMw4/wGBLFpu/0ubtmbBXM6/UMOEw=="},{"rawBytes":"MIICRTCCAcygAwIBAgIUeJdY3rV86EdvFmG7L8LJBsyQFYkwCgYIKoZIzj0EAwMwUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTAgFw0yNjA0MDEwMDAwMDBaGA85OTk5MTIzMTIzNTk1OVowUTELMAkGA1UEBhMCVVMxGzAZBgNVBAoMEk5WSURJQSBDb3Jwb3JhdGlvbjElMCMGA1UEAwwcTlZJRElBIEFnZW50IENhcGFiaWxpdGllcyBDQTB2MBAGByqGSM49AgEGBSuBBAAiA2IABAYpiXCDjJ9NT2eSDhyHJVSw1Tbze18cGG2F/578oWvHxg23eQAhNRYdq88i1iOshZSO6C29doKui5Xpmo/7Ctw9Sx4PP2RzOmIuOLCuTdNtKcTRwi4GEsd5BAFvWj42M6NjMGEwHQYDVR0OBBYEFItnoAjjfuCEUvzyvWyI2vOGvwPjMB8GA1UdIwQYMBaAFItnoAjjfuCEUvzyvWyI2vOGvwPjMA8GA1UdEwEB/wQFMAMBAf8wDgYDVR0PAQH/BAQDAgEGMAoGCCqGSM49BAMDA2cAMGQCMCwtAjWLaNwgGWNCgdyNoTyvNhqWRECRJV2r3+7w8g0PL6NHLOsbkgE09BH95h8XlgIwTaQmbbUh2ChAJ5TA1wRiVDnCcvbzHlZl2jM2FcwQQZlk19LOAbyGMRixbu2Ww/rj"}]},"tlogEntries":[]},"dsseEnvelope":{"payload":"ewogICJfdHlwZSI6ICJodHRwczovL2luLXRvdG8uaW8vU3RhdGVtZW50L3YxIiwKICAic3ViamVjdCI6IFsKICAgIHsKICAgICAgIm5hbWUiOiAiY3VvcHQtbXVsdGktb2JqZWN0aXZlLWV4cGxvcmF0aW9uIiwKICAgICAgImRpZ2VzdCI6IHsKICAgICAgICAic2hhMjU2IjogIjlhYWE2M2VmODg5ZGJlZDNiMjBmY2IzNWNhOTRmNmRmMWM4OTY1MDY1MGU0Y2UzNTI2MjVhZTE0NWNjMzdiYzQiCiAgICAgIH0KICAgIH0KICBdLAogICJwcmVkaWNhdGVUeXBlIjogImh0dHBzOi8vbW9kZWxfc2lnbmluZy9zaWduYXR1cmUvdjEuMCIsCiAgInByZWRpY2F0ZSI6IHsKICAgICJyZXNvdXJjZXMiOiBbCiAgICAgIHsKICAgICAgICAibmFtZSI6ICJCRU5DSE1BUksubWQiLAogICAgICAgICJkaWdlc3QiOiAiZDYzNDAwMDQ5NWY2ODIwMjRhMTk4YzE2YzQ2YjI0NjQ2NDBmYzdmMDQ2YTZjNWI2Y2U2YzA2YmJjMWFiY2ViMyIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAibmFtZSI6ICJTS0lMTC5tZCIsCiAgICAgICAgImRpZ2VzdCI6ICIyZTU1ZTQ4OTJkNWFiNGM3MmUxMzExM2U0N2JhYjMxNzllNTY1MmQ2YjkxMDkxYjUzY2Q1YzRiNWYyZTQ0NjFhIiwKICAgICAgICAiYWxnb3JpdGhtIjogInNoYTI1NiIKICAgICAgfSwKICAgICAgewogICAgICAgICJuYW1lIjogImV2YWxzL2V2YWxzLmpzb24iLAogICAgICAgICJkaWdlc3QiOiAiYmQyNDZhMGIwZjI3ZDc2ZWM1NWY0ZTQ0ZmI0YTY5OWExYjA5MGEzMWViNTM3ZWQxZDFlNjcxNzVhZjcwY2NkNCIsCiAgICAgICAgImFsZ29yaXRobSI6ICJzaGEyNTYiCiAgICAgIH0sCiAgICAgIHsKICAgICAgICAibmFtZSI6ICJza2lsbC1jYXJkLm1kIiwKICAgICAgICAiZGlnZXN0IjogImIyZWM0OTliY2I3MTZkNmY2MTZmYjllZTdjYjRjNDIzOTk5OTRmNGQxYjRjZjc5ZDU2NWFhMzQyMmY2MDAwYjUiLAogICAgICAgICJhbGdvcml0aG0iOiAic2hhMjU2IgogICAgICB9CiAgICBdLAogICAgInNlcmlhbGl6YXRpb24iOiB7CiAgICAgICJoYXNoX3R5cGUiOiAic2hhMjU2IiwKICAgICAgIm1ldGhvZCI6ICJmaWxlcyIsCiAgICAgICJpZ25vcmVfcGF0aHMiOiBbCiAgICAgICAgIi5naXQiLAogICAgICAgICIuZ2l0YXR0cmlidXRlcyIsCiAgICAgICAgIi5naXRpZ25vcmUiLAogICAgICAgICIuZ2l0aHViIgogICAgICBdLAogICAgICAiYWxsb3dfc3ltbGlua3MiOiBmYWxzZQogICAgfQogIH0KfQ==","payloadType":"application/vnd.in-toto+json","signatures":[{"sig":"MGYCMQDC6YZ+JneK1xyVbw54nL0CZSx3RVTz41x3nuBTevnIbJRA6KL3BBgi9w0U04hXgyMCMQCCLUgA2q2sIYRXxKqsYGgLh1ZpDxZvIrw42soe+x/ALAia5+KlXRVi0yHqXEx+tQw=","keyid":""}]}} \ No newline at end of file