Skip to content

Commit f52bb65

Browse files
committed
Update Gemma 4 blocker status: #21365 resolved
PR ggml-org/llama.cpp#21418 (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser and adds <|tool_response> as EOG token, fixing the infinite repetition in llama-server. Fix confirmed by multiple users. Included in build b8721 (2026-04-09). - LAB_NOTEBOOK.md Entry 007: BLOCKED → UNBLOCKED, updated blocking issue section with fix details and next steps - EXPERIMENT_PLAN_gemma4.md: P1 marked RESOLVED, bug table updated with statuses, execution sequence updated (P2 is now CURRENT) https://claude.ai/code/session_011ZNmsVomjKANcnjWJibxnH
1 parent 926234d commit f52bb65

File tree

2 files changed

+18
-18
lines changed

2 files changed

+18
-18
lines changed

EXPERIMENT_PLAN_gemma4.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Date:** 2026-04-03
44
**Author:** Troy Davis
5-
**Status:** BLOCKED — see Prerequisites
5+
**Status:** UNBLOCKED (2026-04-09) — P1 blocker resolved, proceed to P2
66
**Hardware:** NVIDIA Jetson Orin Nano Super 8GB (Orin SoC, Ampere GA10B sm_87, 7.4 GB LPDDR5 unified)
77
**Current champion:** Qwen3.5-4B Q4_K_M — 2.6 GB model, 14.0 tok/s, 32K context, stable
88
**Reference:** LAB_NOTEBOOK.md entries 001-006, JETSON_CONFIG.md
@@ -71,22 +71,22 @@ Two NVIDIA forum threads document Gemma experiences on Orin Nano Super:
7171

7272
## Prerequisites (Must Clear Before Experiments Begin)
7373

74-
### P1: llama-server Infinite Repetition Bug — BLOCKER
74+
### P1: llama-server Infinite Repetition Bug — RESOLVED ✓
7575

76-
**GitHub issue:** [llama.cpp #21365](https://github.com/ggerganov/llama.cpp/issues/21365)
76+
**GitHub issue:** [llama.cpp #21365](https://github.com/ggml-org/llama.cpp/issues/21365) (open but resolved in practice)
7777

78-
Gemma 4 models produce infinite repetition when served via `llama-server` but work correctly in `llama-cli`. Our Jetson runs llama-server via systemd. This is a **showstopper** for server deployment.
78+
**RESOLVED 2026-04-09:** PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser and adds `<|tool_response>` as an EOG token, eliminating the infinite repetition in llama-server. Multiple users confirmed the fix. Included in build **b8721** (released 2026-04-09).
7979

80-
**Additional known bugs (as of 2026-04-03):**
80+
**Known bugs status (as of 2026-04-09):**
8181

82-
| Issue | Problem | Impact on Us |
83-
|-------|---------|-------------|
84-
| [#21365](https://github.com/ggerganov/llama.cpp/issues/21365) | Infinite repetition in llama-server | **Blocker** — our deployment is llama-server |
85-
| [#21329](https://github.com/ggerganov/llama.cpp/issues/21329) | `--parallel` crashes with Gemma 4 | Medium — we run single-slot |
86-
| [#21375](https://github.com/ggerganov/llama.cpp/issues/21375) | Infinite loop in tool-call parser | Medium — affects function calling tests |
87-
| [#21321](https://github.com/ggerganov/llama.cpp/issues/21321) | Generates `<unused24>` tokens | Low — may be fixed by tokenizer update |
82+
| Issue | Problem | Status |
83+
|-------|---------|--------|
84+
| [#21365](https://github.com/ggml-org/llama.cpp/issues/21365) | Infinite repetition in llama-server | **FIXED** by PR #21418 (b8721+) |
85+
| [#21329](https://github.com/ggml-org/llama.cpp/issues/21329) | `--parallel` crashes with Gemma 4 | Unknown — we run single-slot, low risk |
86+
| [#21375](https://github.com/ggml-org/llama.cpp/issues/21375) | Infinite loop in tool-call parser | **FIXED** by PR #21418 per user reports |
87+
| [#21321](https://github.com/ggml-org/llama.cpp/issues/21321) | Generates `<unused24>` tokens | Likely fixed by tokenizer PR #21343 |
8888

89-
**Action:** Monitor #21365 daily. Do not download models or rebuild llama.cpp until this is confirmed fixed.
89+
**Action:** Proceed to P2. Target build b8721 or later.
9090

9191
### P2: Rebuild llama.cpp
9292

@@ -358,10 +358,10 @@ After all experiments, score each viable configuration:
358358
## Execution Sequence
359359

360360
```
361-
P1: Monitor llama.cpp #21365 ← CURRENT (check daily)
361+
P1: Monitor llama.cpp #21365 ← DONE (fixed in b8721, PR #21418)
362362
363363
▼ (bug fixed)
364-
P2: Rebuild llama.cpp from master
364+
P2: Rebuild llama.cpp from master ← CURRENT (target b8721+)
365365
P3: Regression test Qwen3.5-4B
366366
P4: Download Gemma 4 E2B Q4_K_M
367367

LAB_NOTEBOOK.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -553,9 +553,9 @@ Two forum threads document Gemma experiences on Orin Nano Super:
553553

554554
### Blocking Issue
555555

556-
**llama-server infinite repetition bug ([llama.cpp #21365](https://github.com/ggerganov/llama.cpp/issues/21365)):** Gemma 4 produces infinite repetition in `llama-server` but works correctly in `llama-cli`. Our deployment is llama-server via systemd. This is a showstopper — must be resolved before any experiments can proceed.
556+
**llama-server infinite repetition bug ([llama.cpp #21365](https://github.com/ggml-org/llama.cpp/issues/21365)):** ~~Gemma 4 produces infinite repetition in `llama-server` but works correctly in `llama-cli`.~~ **RESOLVED (2026-04-09):** PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser, adds `<|tool_response>` as an EOG token, and removes Gemma 4 from the generic autoparser. Multiple users confirmed the fix resolves the infinite repetition in llama-server. Fix is included in build **b8721** (released 2026-04-09). Issue #21365 remains formally open but the bug is resolved in practice.
557557

558-
Additional bugs: `--parallel` crash ([#21329](https://github.com/ggerganov/llama.cpp/issues/21329)), tool-call parser loop ([#21375](https://github.com/ggerganov/llama.cpp/issues/21375)), `<unused24>` token generation ([#21321](https://github.com/ggerganov/llama.cpp/issues/21321)).
558+
Additional bugs status (2026-04-09): `--parallel` crash ([#21329](https://github.com/ggml-org/llama.cpp/issues/21329)) — status unknown; tool-call parser loop ([#21375](https://github.com/ggml-org/llama.cpp/issues/21375)) — confirmed fixed by PR #21418 per user reports; `<unused24>` token generation ([#21321](https://github.com/ggml-org/llama.cpp/issues/21321)) — likely fixed by tokenizer PR #21343.
559559

560560
### Experiment Plan
561561

@@ -571,10 +571,10 @@ Prerequisites before any testing: rebuild llama.cpp (need b8641+ for Gemma 4 arc
571571

572572
### Status
573573

574-
**BLOCKED** on llama.cpp #21365. Monitoring daily.
574+
**UNBLOCKED** (2026-04-09) — PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) merged 2026-04-04, fixing the llama-server infinite repetition bug. Build b8721 (2026-04-09) is the latest and includes the fix. Next step: proceed to P2 (rebuild llama.cpp to b8721+), then P3 regression test, then P4 model downloads.
575575

576576
### Decision
577577

578-
No changes to current configuration. Qwen3.5-4B Q4_K_M remains the active default.
578+
No changes to current configuration yet. Qwen3.5-4B Q4_K_M remains the active default pending the rebuild and Gemma 4 experiments.
579579

580580
---

0 commit comments

Comments
 (0)