Llm usage records#4
Conversation
|
Thanks for the contribution! Unfortunately we can't verify the commit author(s): Ajay Chinthalapalli Jayakumar <a***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
818d0ba to
0c6bd87
Compare
5a0da28 to
92738e1
Compare
92738e1 to
905ea08
Compare
demianbrecht
left a comment
There was a problem hiding this comment.
See inline comments. The main concern is that LangChain already surfaces token usage on every AIMessage via usage_metadata — this PR re-implements that extraction manually and adds significant ceremony to task code. A LangGraph callback approach would achieve the same result transparently.
| """Extract provider-reported token counts from common LangChain responses.""" | ||
|
|
||
| usage_metadata = getattr(response, "usage_metadata", None) | ||
| if isinstance(usage_metadata, dict): | ||
| prompt = _coerce_int(usage_metadata.get("input_tokens") or usage_metadata.get("prompt_tokens")) | ||
| completion = _coerce_int( | ||
| usage_metadata.get("output_tokens") | ||
| or usage_metadata.get("completion_tokens") | ||
| or usage_metadata.get("generated_tokens") | ||
| ) | ||
| total = _coerce_int(usage_metadata.get("total_tokens")) | ||
| if prompt is not None or completion is not None or total is not None: | ||
| return prompt, completion, total | ||
|
|
||
| response_metadata = getattr(response, "response_metadata", None) | ||
| if isinstance(response_metadata, dict): | ||
| token_usage = response_metadata.get("token_usage") or response_metadata.get("usage") | ||
| if isinstance(token_usage, dict): | ||
| prompt = _coerce_int(token_usage.get("prompt_tokens") or token_usage.get("input_tokens")) | ||
| completion = _coerce_int(token_usage.get("completion_tokens") or token_usage.get("output_tokens")) | ||
| total = _coerce_int(token_usage.get("total_tokens")) | ||
| if prompt is not None or completion is not None or total is not None: | ||
| return prompt, completion, total | ||
|
|
||
| return None, None, None |
There was a problem hiding this comment.
LangChain already surfaces token usage on every AIMessage via usage_metadata (input_tokens, output_tokens, total_tokens). This function is manually re-extracting data that the framework already provides — it doesn't add new information.
A LangGraph callback (on_llm_end) could capture this automatically without any custom extraction logic.
| "claude-sonnet-4-20250514": ModelPricing(3.0, 15.0), | ||
| "claude-sonnet-4-5-20250929": ModelPricing(3.0, 15.0), | ||
| "claude-sonnet-4-6": ModelPricing(3.0, 15.0), | ||
| "claude-opus-4-20250514": ModelPricing(15.0, 75.0), | ||
| "claude-opus-4-6-v1": ModelPricing(15.0, 75.0), | ||
| "claude-haiku-4-5-20251001": ModelPricing(1.0, 5.0), | ||
| "gpt-4o": ModelPricing(2.5, 10.0), | ||
| "gpt-4o-mini": ModelPricing(0.15, 0.60), | ||
| "gemini-2.0-flash": ModelPricing(0.10, 0.40), | ||
| "gemini-2.5-flash": ModelPricing(0.30, 2.50), | ||
| "gemini-2.5-pro": ModelPricing(1.25, 10.0), | ||
| } |
There was a problem hiding this comment.
This pricing table will go stale immediately. Prices change frequently, model IDs are versioned, and this doesn't account for caching discounts, batch API pricing, prompt caching writebacks, etc.
If cost tracking is needed, it belongs in config or an external source — not a compiled-in dict that requires code changes to update.
| """Estimate USD cost for a model if pricing is known.""" | ||
|
|
||
| pricing = MODEL_PRICING.get(model) | ||
| if pricing is None: | ||
| return None | ||
| cost = (prompt_tokens / 1_000_000 * pricing.input_per_million) + ( | ||
| completion_tokens / 1_000_000 * pricing.output_per_million | ||
| ) | ||
| return round(cost, 6) |
There was a problem hiding this comment.
This len(text) / 4 heuristic is very rough, and the results get stored in the same LLMUsageRecord alongside provider-reported actuals. Downstream consumers of these records have no reliable way to distinguish precision levels.
The estimated_tokens_saved metric (raw prompt tokens from this estimate minus actual prompt tokens from the provider) is comparing a pre-call guess against a post-call actual — not a meaningful comparison.
| usage = llm_usage_from_response( | ||
| response, | ||
| task_id=ctx.task_id, | ||
| model=model, | ||
| node_name="summarize", | ||
| fallback_prompt_text=f"{_SYSTEM_PROMPT}\n\n{prompt}", | ||
| fallback_completion_text=str(response.content), | ||
| estimated_raw_prompt_tokens=state["estimated_raw_prompt_tokens"], | ||
| metadata={ | ||
| "deterministic_nodes": 3, | ||
| "llm_nodes": 1, | ||
| "rows_processed": state["rows_processed"], | ||
| "formatted_prompt_tokens_estimate": estimate_text_tokens(prompt), | ||
| }, | ||
| ) | ||
| ctx.record_llm_usage( | ||
| model=usage.model, | ||
| node_name=usage.node_name, | ||
| prompt_tokens=usage.prompt_tokens, | ||
| completion_tokens=usage.completion_tokens, | ||
| total_tokens=usage.total_tokens, | ||
| estimated_cost_usd=usage.estimated_cost_usd, | ||
| estimated_raw_prompt_tokens=usage.estimated_raw_prompt_tokens, | ||
| estimated_tokens_saved=usage.estimated_tokens_saved, | ||
| metadata=usage.metadata, | ||
| ) | ||
| ctx.progress( | ||
| "LLM usage recorded", | ||
| prompt_tokens=usage.prompt_tokens, | ||
| completion_tokens=usage.completion_tokens, | ||
| total_tokens=usage.total_tokens, | ||
| estimated_cost_usd=usage.estimated_cost_usd, | ||
| estimated_tokens_saved=usage.estimated_tokens_saved, | ||
| ) |
There was a problem hiding this comment.
This is the core problem with the approach — the summarize node went from ~5 lines to 30+ lines of usage-tracking ceremony. Every LLM node in every task would need this same boilerplate.
If usage tracking is a framework concern, it should be transparent. A LangGraph callback on on_llm_end could emit the llm.usage event automatically with zero changes to task code.
Whats in the PR?
a switchplane.usage module will normalize token counts/cost estimates, AgentContext will emit llm.usage, and the devops graph will record the one LLM node plus estimated deterministic savings. I’m going to make those focused edits and add tests for the new usage helper and event emission.