fix(json): extract structured payloads from noisy model output#576
fix(json): extract structured payloads from noisy model output#576Niko96-dotcom wants to merge 1 commit intoplastic-labs:mainfrom
Conversation
WalkthroughA JSON extraction pipeline was added to preprocess noisy input by removing Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/utils/json_parser.py`:
- Around line 14-19: The current _strip_reasoning_wrappers uses _THINK_TAG_RE to
remove all <think>...</think> occurrences across the whole text which can
corrupt JSON payloads (e.g., inside string values); update
_strip_reasoning_wrappers to only remove leading/trailing reasoning wrappers
(anchor the pattern to the start and/or end) or, better, move wrapper removal to
after JSON candidate extraction so functions that extract JSON (the JSON
candidate extraction routine referenced in the file) operate on the original
text; specifically, modify _THINK_TAG_RE and _strip_reasoning_wrappers (and the
JSON extraction flow that calls it) so tags inside extracted JSON strings are
preserved while only outer/leading wrappers are stripped.
- Around line 22-32: _fenced_candidates currently loses the priority information
and only returns bodies, allowing _best_valid_json_candidate to pick a
higher-scored generic fence over a later tagged JSON fence; modify
_fenced_candidates to preserve and return the priority with each body (e.g.,
list of (priority, body)) and update _best_valid_json_candidate to prefer lower
priority (tagged JSON) first when selecting among valid JSON candidates—for
example, by sorting/choosing based on (priority, -score) or by grouping by
priority and only comparing scores within the same priority.
- Around line 35-76: The scanner _collect_balanced_json_candidates can get stuck
when a noisy unmatched "{" or "[" appears before a valid JSON because it never
resets start on a mismatched closure; update the loop so that when a closing
char is seen that doesn't match stack[-1] (or stack is empty) you discard the
current candidate by setting start = None and stack = [], then re-evaluate the
same character as a potential opener (if char in "{[" then set start = i and
push the matching closer). Also ensure escaped chars and in_string handling
remain unchanged and let unterminated candidates fall away at end-of-text.
- Around line 86-90: The current scoring gives schema-shaped payloads a higher
primary score (has_schema_signal -> 2), letting long schema markers beat real
payloads; change the primary score logic so schema-shaped candidates are
penalized before length tiebreaking (e.g., set primary = 0 when
has_schema_signal is true and 2 otherwise). Update the return in the block that
computes keys/schema_keys/has_schema_signal (variables payload, keys,
schema_keys, has_schema_signal, candidate) to return the inverted primary score
so schema-like objects lose priority before comparing len(keys) and
len(candidate).
In `@tests/utils/test_json_parser.py`:
- Around line 24-45: Add regression tests covering ambiguous candidate cases for
extract_json_payload: ensure fenced ```json blocks override earlier non-json
fences, literal <think>…</think> content is preserved inside valid/noisy JSON,
unclosed prefix objects are ignored and later valid payloads are chosen, and
that an "Actual" payload with keys like "explicit" wins over a schema-shaped
"Schema" candidate; update tests in tests/utils/test_json_parser.py by adding
cases that exercise these behaviors and reference extract_json_payload (and
indirectly src/utils/json_parser.py) so the parser’s precedence and preservation
rules are validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1345eb18-6ba8-46c5-99b2-59873b1eea6e
📒 Files selected for processing (2)
src/utils/json_parser.pytests/utils/test_json_parser.py
| _THINK_TAG_RE = re.compile(r"<think>.*?</think>", re.IGNORECASE | re.DOTALL) | ||
|
|
||
|
|
||
| def _strip_reasoning_wrappers(text: str) -> str: | ||
| """Remove common reasoning/prose wrappers before JSON extraction.""" | ||
| return _THINK_TAG_RE.sub("", text.strip()).strip() |
There was a problem hiding this comment.
Avoid stripping <think> tags inside extracted JSON strings.
Because stripping runs before candidate extraction, noisy output like Here: {"content":"literal <think>x</think>"} will silently mutate the payload content. Limit stripping to leading reasoning wrappers, or extract JSON candidates before applying wrapper removal.
🐛 Proposed fix
-_THINK_TAG_RE = re.compile(r"<think>.*?</think>", re.IGNORECASE | re.DOTALL)
+_LEADING_THINK_TAG_RE = re.compile(
+ r"^\s*(?:<think>.*?</think>\s*)+",
+ re.IGNORECASE | re.DOTALL,
+) def _strip_reasoning_wrappers(text: str) -> str:
"""Remove common reasoning/prose wrappers before JSON extraction."""
- return _THINK_TAG_RE.sub("", text.strip()).strip()
+ return _LEADING_THINK_TAG_RE.sub("", text.strip()).strip()Also applies to: 121-127
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/json_parser.py` around lines 14 - 19, The current
_strip_reasoning_wrappers uses _THINK_TAG_RE to remove all <think>...</think>
occurrences across the whole text which can corrupt JSON payloads (e.g., inside
string values); update _strip_reasoning_wrappers to only remove leading/trailing
reasoning wrappers (anchor the pattern to the start and/or end) or, better, move
wrapper removal to after JSON candidate extraction so functions that extract
JSON (the JSON candidate extraction routine referenced in the file) operate on
the original text; specifically, modify _THINK_TAG_RE and
_strip_reasoning_wrappers (and the JSON extraction flow that calls it) so tags
inside extracted JSON strings are preserved while only outer/leading wrappers
are stripped.
| def _fenced_candidates(text: str) -> list[str]: | ||
| candidates: list[str] = [] | ||
| for match in _JSON_FENCE_RE.finditer(text): | ||
| body = match.group("body").strip() | ||
| if not body: | ||
| continue | ||
| lang = (match.group("lang") or "").strip().lower() | ||
| priority = 0 if lang == "json" else 1 | ||
| candidates.append((priority, body)) | ||
| candidates.sort(key=lambda item: item[0]) | ||
| return [body for _, body in candidates] |
There was a problem hiding this comment.
Enforce JSON-fence priority before candidate scoring.
_fenced_candidates() only sorts tagged json fences first, but _best_valid_json_candidate() can still select a stronger-scored generic fence over a later tagged JSON fence. That violates the “prefer tagged JSON fences” behavior when both fenced blocks contain valid JSON.
🐛 Proposed fix
-def _fenced_candidates(text: str) -> list[str]:
+def _fenced_candidates(text: str, *, json_only: bool = False) -> list[str]:
candidates: list[str] = []
for match in _JSON_FENCE_RE.finditer(text):
body = match.group("body").strip()
if not body:
continue
lang = (match.group("lang") or "").strip().lower()
- priority = 0 if lang == "json" else 1
- candidates.append((priority, body))
- candidates.sort(key=lambda item: item[0])
- return [body for _, body in candidates]
+ if json_only and lang != "json":
+ continue
+ if not json_only and lang == "json":
+ continue
+ candidates.append(body)
+ return candidates- fenced_valid = _best_valid_json_candidate(_fenced_candidates(cleaned))
+ fenced_valid = _best_valid_json_candidate(
+ _fenced_candidates(cleaned, json_only=True)
+ )
+ if fenced_valid:
+ return fenced_valid
+
+ fenced_valid = _best_valid_json_candidate(_fenced_candidates(cleaned))
if fenced_valid:
return fenced_validAlso applies to: 129-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/json_parser.py` around lines 22 - 32, _fenced_candidates currently
loses the priority information and only returns bodies, allowing
_best_valid_json_candidate to pick a higher-scored generic fence over a later
tagged JSON fence; modify _fenced_candidates to preserve and return the priority
with each body (e.g., list of (priority, body)) and update
_best_valid_json_candidate to prefer lower priority (tagged JSON) first when
selecting among valid JSON candidates—for example, by sorting/choosing based on
(priority, -score) or by grouping by priority and only comparing scores within
the same priority.
| def _collect_balanced_json_candidates(text: str) -> list[str]: | ||
| """Return all balanced JSON object/array substrings, in encounter order.""" | ||
| candidates: list[str] = [] | ||
| start = None | ||
| stack: list[str] = [] | ||
| in_string = False | ||
| escape = False | ||
|
|
||
| for i, char in enumerate(text): | ||
| if start is None: | ||
| if char not in "[{": | ||
| continue | ||
| start = i | ||
| stack.append("}" if char == "{" else "]") | ||
| continue | ||
|
|
||
| if escape: | ||
| escape = False | ||
| continue | ||
|
|
||
| if char == "\\": | ||
| escape = True | ||
| continue | ||
|
|
||
| if char == '"': | ||
| in_string = not in_string | ||
| continue | ||
|
|
||
| if in_string: | ||
| continue | ||
|
|
||
| if char == "{": | ||
| stack.append("}") | ||
| elif char == "[": | ||
| stack.append("]") | ||
| elif stack and char == stack[-1]: | ||
| stack.pop() | ||
| if not stack and start is not None: | ||
| candidates.append(text[start : i + 1].strip()) | ||
| start = None | ||
|
|
||
| return candidates |
There was a problem hiding this comment.
Resync after malformed leading candidates.
If noisy prose contains an unmatched { or [ before the real payload, this scanner never resets start, so later valid JSON is missed entirely. Discard the current candidate on mismatched closure or scan from each opener independently.
🐛 Example regression to add
def test_extract_json_payload_ignores_unclosed_prefix_object():
raw = 'debug {not closed\nActual: {"explicit":[]}'
assert json.loads(extract_json_payload(raw)) == {"explicit": []}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/json_parser.py` around lines 35 - 76, The scanner
_collect_balanced_json_candidates can get stuck when a noisy unmatched "{" or
"[" appears before a valid JSON because it never resets start on a mismatched
closure; update the loop so that when a closing char is seen that doesn't match
stack[-1] (or stack is empty) you discard the current candidate by setting start
= None and stack = [], then re-evaluate the same character as a potential opener
(if char in "{[" then set start = i and push the matching closer). Also ensure
escaped chars and in_string handling remain unchanged and let unterminated
candidates fall away at end-of-text.
| if isinstance(payload, dict): | ||
| keys = set(payload) | ||
| schema_keys = {"explicit", "deductive", "inductive", "observations", "facts"} | ||
| has_schema_signal = int(bool(keys & schema_keys)) | ||
| return (2 if has_schema_signal else 0, len(keys), len(candidate)) |
There was a problem hiding this comment.
Penalize schema-shaped candidates before ranking by size.
A schema such as {"explicit":{"type":"array","items":{"type":"object"}}} scores as a payload because it has explicit, and its longer length can beat the actual {"explicit":[]} payload. Add a schema-marker penalty before using length as a tiebreaker.
🐛 Proposed direction
if isinstance(payload, dict):
keys = set(payload)
- schema_keys = {"explicit", "deductive", "inductive", "observations", "facts"}
- has_schema_signal = int(bool(keys & schema_keys))
- return (2 if has_schema_signal else 0, len(keys), len(candidate))
+ payload_keys = {"explicit", "deductive", "inductive", "observations", "facts"}
+ schema_marker_keys = {
+ "$schema",
+ "type",
+ "properties",
+ "required",
+ "items",
+ "additionalProperties",
+ }
+ has_payload_signal = int(bool(keys & payload_keys))
+ looks_like_schema = int(
+ bool(keys & schema_marker_keys)
+ or any(isinstance(value, dict) and "type" in value for value in payload.values())
+ )
+ return (2 if has_payload_signal else 0, -looks_like_schema, len(keys), len(candidate))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/json_parser.py` around lines 86 - 90, The current scoring gives
schema-shaped payloads a higher primary score (has_schema_signal -> 2), letting
long schema markers beat real payloads; change the primary score logic so
schema-shaped candidates are penalized before length tiebreaking (e.g., set
primary = 0 when has_schema_signal is true and 2 otherwise). Update the return
in the block that computes keys/schema_keys/has_schema_signal (variables
payload, keys, schema_keys, has_schema_signal, candidate) to return the inverted
primary score so schema-like objects lose priority before comparing len(keys)
and len(candidate).
| def test_extract_json_payload_prefers_json_fence_over_earlier_non_json_fence(): | ||
| raw = '```text\nnot json\n```\n```json\n{"explicit":[{"content":"I live in Berlin"}]}\n```' | ||
| extracted = extract_json_payload(raw) | ||
| assert json.loads(extracted) == {"explicit": [{"content": "I live in Berlin"}]} | ||
|
|
||
|
|
||
| def test_extract_json_payload_prefers_payload_over_earlier_schema_object(): | ||
| raw = 'Schema: {"type":"object"}\nActual: {"explicit":[{"content":"I use Cubase"}]}' | ||
| extracted = extract_json_payload(raw) | ||
| assert json.loads(extracted) == {"explicit": [{"content": "I use Cubase"}]} | ||
|
|
||
|
|
||
| def test_extract_json_payload_preserves_literal_think_text_inside_valid_json(): | ||
| raw = '{"content":"literal <think> tag","explicit":[]}' | ||
| extracted = extract_json_payload(raw) | ||
| assert json.loads(extracted) == {"content": "literal <think> tag", "explicit": []} | ||
|
|
||
|
|
||
| def test_extract_json_payload_prefers_later_array_payload_over_earlier_schema_dict(): | ||
| raw = 'Schema: {"type":"object"}\nActual: [{"content":"I use Cubase"}]' | ||
| extracted = extract_json_payload(raw) | ||
| assert json.loads(extracted) == [{"content": "I use Cubase"}] |
There was a problem hiding this comment.
Add regressions for the remaining ambiguous-candidate cases.
Current tests cover the intended path, but they would not catch the fence-priority, literal <think>...</think> corruption, unclosed-prefix, or schema-shaped explicit candidate failures flagged in src/utils/json_parser.py.
🧪 Suggested tests
+def test_extract_json_payload_prefers_json_fence_over_valid_generic_fence():
+ raw = (
+ '```\n{"explicit":[{"content":"generic"}]}\n```\n'
+ '```json\n{"type":"object"}\n```'
+ )
+ assert json.loads(extract_json_payload(raw)) == {"type": "object"}
+
+
+def test_extract_json_payload_preserves_literal_think_text_inside_noisy_json():
+ raw = 'Here: {"content":"literal <think>tag</think>","explicit":[]}'
+ extracted = extract_json_payload(raw)
+ assert json.loads(extracted) == {
+ "content": "literal <think>tag</think>",
+ "explicit": [],
+ }
+
+
+def test_extract_json_payload_ignores_unclosed_prefix_object():
+ raw = 'debug {not closed\nActual: {"explicit":[]}'
+ assert json.loads(extract_json_payload(raw)) == {"explicit": []}
+
+
+def test_extract_json_payload_prefers_payload_over_schema_with_payload_keys():
+ raw = (
+ 'Schema: {"explicit":{"type":"array","items":{"type":"object"}}}\n'
+ 'Actual: {"explicit":[]}'
+ )
+ assert json.loads(extract_json_payload(raw)) == {"explicit": []}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/utils/test_json_parser.py` around lines 24 - 45, Add regression tests
covering ambiguous candidate cases for extract_json_payload: ensure fenced
```json blocks override earlier non-json fences, literal <think>…</think>
content is preserved inside valid/noisy JSON, unclosed prefix objects are
ignored and later valid payloads are chosen, and that an "Actual" payload with
keys like "explicit" wins over a schema-shaped "Schema" candidate; update tests
in tests/utils/test_json_parser.py by adding cases that exercise these behaviors
and reference extract_json_payload (and indirectly src/utils/json_parser.py) so
the parser’s precedence and preservation rules are validated.
Summary
<think>text inside valid JSONTest Plan
set -a && source /Users/nikolaymohr/honcho/.env && set +a && uv run pytest tests/utils/test_json_parser.py -qWhy
OpenAI-compatible providers keep wrapping structured responses in junk. The old path was too trusting and regularly grabbed the wrong payload.
Summary by CodeRabbit
New Features
Tests