Update Gemma 4 blocker status: #21365 resolved

claude · claude · commit f52bb65aeb83 · 2026-04-09T08:20:06.000Z
PR ggml-org/llama.cpp#21418 (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser and adds <|tool_response> as EOG token, fixing the infinite repetition in llama-server. Fix confirmed by multiple users. Included in build b8721 (2026-04-09). - LAB_NOTEBOOK.md Entry 007: BLOCKED → UNBLOCKED, updated blocking issue section with fix details and next steps - EXPERIMENT_PLAN_gemma4.md: P1 marked RESOLVED, bug table updated with statuses, execution sequence updated (P2 is now CURRENT) https://claude.ai/code/session_011ZNmsVomjKANcnjWJibxnH
diff --git a/EXPERIMENT_PLAN_gemma4.md b/EXPERIMENT_PLAN_gemma4.md
@@ -2,7 +2,7 @@
 
 **Date:** 2026-04-03
 **Author:** Troy Davis
-**Status:** BLOCKED — see Prerequisites
+**Status:** UNBLOCKED (2026-04-09) — P1 blocker resolved, proceed to P2
 **Hardware:** NVIDIA Jetson Orin Nano Super 8GB (Orin SoC, Ampere GA10B sm_87, 7.4 GB LPDDR5 unified)
 **Current champion:** Qwen3.5-4B Q4_K_M — 2.6 GB model, 14.0 tok/s, 32K context, stable
 **Reference:** LAB_NOTEBOOK.md entries 001-006, JETSON_CONFIG.md
@@ -71,22 +71,22 @@ Two NVIDIA forum threads document Gemma experiences on Orin Nano Super:
 
 ## Prerequisites (Must Clear Before Experiments Begin)
 
-### P1: llama-server Infinite Repetition Bug — BLOCKER
+### P1: llama-server Infinite Repetition Bug — RESOLVED ✓
 
-**GitHub issue:** [llama.cpp #21365](https://github.com/ggerganov/llama.cpp/issues/21365)
+**GitHub issue:** [llama.cpp #21365](https://github.com/ggml-org/llama.cpp/issues/21365) (open but resolved in practice)
 
-Gemma 4 models produce infinite repetition when served via `llama-server` but work correctly in `llama-cli`. Our Jetson runs llama-server via systemd. This is a **showstopper** for server deployment.
+**RESOLVED 2026-04-09:** PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser and adds `<|tool_response>` as an EOG token, eliminating the infinite repetition in llama-server. Multiple users confirmed the fix. Included in build **b8721** (released 2026-04-09).
 
-**Additional known bugs (as of 2026-04-03):**
+**Known bugs status (as of 2026-04-09):**
 
-| Issue | Problem | Impact on Us |
-|-------|---------|-------------|
-| [#21365](https://github.com/ggerganov/llama.cpp/issues/21365) | Infinite repetition in llama-server | **Blocker** — our deployment is llama-server |
-| [#21329](https://github.com/ggerganov/llama.cpp/issues/21329) | `--parallel` crashes with Gemma 4 | Medium — we run single-slot |
-| [#21375](https://github.com/ggerganov/llama.cpp/issues/21375) | Infinite loop in tool-call parser | Medium — affects function calling tests |
-| [#21321](https://github.com/ggerganov/llama.cpp/issues/21321) | Generates `<unused24>` tokens | Low — may be fixed by tokenizer update |
+| Issue | Problem | Status |
+|-------|---------|--------|
+| [#21365](https://github.com/ggml-org/llama.cpp/issues/21365) | Infinite repetition in llama-server | **FIXED** by PR #21418 (b8721+) |
+| [#21329](https://github.com/ggml-org/llama.cpp/issues/21329) | `--parallel` crashes with Gemma 4 | Unknown — we run single-slot, low risk |
+| [#21375](https://github.com/ggml-org/llama.cpp/issues/21375) | Infinite loop in tool-call parser | **FIXED** by PR #21418 per user reports |
+| [#21321](https://github.com/ggml-org/llama.cpp/issues/21321) | Generates `<unused24>` tokens | Likely fixed by tokenizer PR #21343 |
 
-**Action:** Monitor #21365 daily. Do not download models or rebuild llama.cpp until this is confirmed fixed.
+**Action:** Proceed to P2. Target build b8721 or later.
 
 ### P2: Rebuild llama.cpp
 
@@ -358,10 +358,10 @@ After all experiments, score each viable configuration:
 ## Execution Sequence
 
 ```
-P1: Monitor llama.cpp #21365          ← CURRENT (check daily)
+P1: Monitor llama.cpp #21365          ← DONE (fixed in b8721, PR #21418)
     │
     ▼ (bug fixed)
-P2: Rebuild llama.cpp from master
+P2: Rebuild llama.cpp from master     ← CURRENT (target b8721+)
 P3: Regression test Qwen3.5-4B
 P4: Download Gemma 4 E2B Q4_K_M
     │
diff --git a/LAB_NOTEBOOK.md b/LAB_NOTEBOOK.md
@@ -553,9 +553,9 @@ Two forum threads document Gemma experiences on Orin Nano Super:
 
 ### Blocking Issue
 
-**llama-server infinite repetition bug ([llama.cpp #21365](https://github.com/ggerganov/llama.cpp/issues/21365)):** Gemma 4 produces infinite repetition in `llama-server` but works correctly in `llama-cli`. Our deployment is llama-server via systemd. This is a showstopper — must be resolved before any experiments can proceed.
+**llama-server infinite repetition bug ([llama.cpp #21365](https://github.com/ggml-org/llama.cpp/issues/21365)):** ~~Gemma 4 produces infinite repetition in `llama-server` but works correctly in `llama-cli`.~~ **RESOLVED (2026-04-09):** PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser, adds `<|tool_response>` as an EOG token, and removes Gemma 4 from the generic autoparser. Multiple users confirmed the fix resolves the infinite repetition in llama-server. Fix is included in build **b8721** (released 2026-04-09). Issue #21365 remains formally open but the bug is resolved in practice.
 
-Additional bugs: `--parallel` crash ([#21329](https://github.com/ggerganov/llama.cpp/issues/21329)), tool-call parser loop ([#21375](https://github.com/ggerganov/llama.cpp/issues/21375)), `<unused24>` token generation ([#21321](https://github.com/ggerganov/llama.cpp/issues/21321)).
+Additional bugs status (2026-04-09): `--parallel` crash ([#21329](https://github.com/ggml-org/llama.cpp/issues/21329)) — status unknown; tool-call parser loop ([#21375](https://github.com/ggml-org/llama.cpp/issues/21375)) — confirmed fixed by PR #21418 per user reports; `<unused24>` token generation ([#21321](https://github.com/ggml-org/llama.cpp/issues/21321)) — likely fixed by tokenizer PR #21343.
 
 ### Experiment Plan
 
@@ -571,10 +571,10 @@ Prerequisites before any testing: rebuild llama.cpp (need b8641+ for Gemma 4 arc
 
 ### Status
 
-**BLOCKED** on llama.cpp #21365. Monitoring daily.
+**UNBLOCKED** (2026-04-09) — PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) merged 2026-04-04, fixing the llama-server infinite repetition bug. Build b8721 (2026-04-09) is the latest and includes the fix. Next step: proceed to P2 (rebuild llama.cpp to b8721+), then P3 regression test, then P4 model downloads.
 
 ### Decision
 
-No changes to current configuration. Qwen3.5-4B Q4_K_M remains the active default.
+No changes to current configuration yet. Qwen3.5-4B Q4_K_M remains the active default pending the rebuild and Gemma 4 experiments.
 
 ---