fix(eval): trim legacy trajectory span history#1652
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ebb876ccf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| agent_run_history | ||
| ) | ||
| agent_run_history = str(trajectory_trace.spans) | ||
| agent_run_history = trace_to_str(agent_run_history) |
There was a problem hiding this comment.
Preserve non-tool history for direct-answer runs
For runs whose trace contains only LLM/CHAIN spans (for example an agent answers directly without invoking any tool), this now replaces {{AgentRunHistory}} with trace_to_str(...), but that helper only emits spans that have a tool.name attribute and otherwise returns an empty string. The legacy trajectory evaluator prompt does not include agent_output, so those direct-answer executions lose all observable run history compared with the previous TrajectoryEvaluationTrace conversion, making the evaluator unable to judge whether the no-tool path matched the expected behavior.
Useful? React with 👍 / 👎.
|
Superseded by same-repo replacement #1653 so CI has access to the repo token/secrets. |
Replacement for #1630 because the original branch cannot be updated cleanly from this account: direct push is blocked by repository rules, gh pr update-branch --rebase returns a GitHub GraphQL error, merge commits are disabled, and the branch now conflicts with main's uipath version bump.\n\nChanges:\n- use trace_to_str for legacy trajectory ReadableSpan history to avoid carrying long LLM/system prompt payloads into the evaluator prompt\n- add a regression test for compact tool history\n- pass explicit LegacyTrajectoryEvaluator type metadata in the test so mypy passes\n- bump packages/uipath to 2.10.66 because 2.10.65 is now on main and already published\n\nLocal validation:\n- GITHUB_EVENT_NAME=pull_request BASE_SHA=origin/main HEAD_SHA=HEAD python .github/scripts/check_version_uniqueness.py\n- cd packages/uipath && uv run mypy --config-file pyproject.toml .\n- cd packages/uipath && uv run ruff check .\n- cd packages/uipath && uv run ruff format --check .\n- cd packages/uipath && uv run pytest tests/evaluators/test_legacy_trajectory_evaluator.py\n- cd packages/uipath && uv run pytest