|
2 | 2 |
|
3 | 3 | **Date:** 2026-04-03 |
4 | 4 | **Author:** Troy Davis |
5 | | -**Status:** BLOCKED — see Prerequisites |
| 5 | +**Status:** UNBLOCKED (2026-04-09) — P1 blocker resolved, proceed to P2 |
6 | 6 | **Hardware:** NVIDIA Jetson Orin Nano Super 8GB (Orin SoC, Ampere GA10B sm_87, 7.4 GB LPDDR5 unified) |
7 | 7 | **Current champion:** Qwen3.5-4B Q4_K_M — 2.6 GB model, 14.0 tok/s, 32K context, stable |
8 | 8 | **Reference:** LAB_NOTEBOOK.md entries 001-006, JETSON_CONFIG.md |
@@ -71,22 +71,22 @@ Two NVIDIA forum threads document Gemma experiences on Orin Nano Super: |
71 | 71 |
|
72 | 72 | ## Prerequisites (Must Clear Before Experiments Begin) |
73 | 73 |
|
74 | | -### P1: llama-server Infinite Repetition Bug — BLOCKER |
| 74 | +### P1: llama-server Infinite Repetition Bug — RESOLVED ✓ |
75 | 75 |
|
76 | | -**GitHub issue:** [llama.cpp #21365](https://github.com/ggerganov/llama.cpp/issues/21365) |
| 76 | +**GitHub issue:** [llama.cpp #21365](https://github.com/ggml-org/llama.cpp/issues/21365) (open but resolved in practice) |
77 | 77 |
|
78 | | -Gemma 4 models produce infinite repetition when served via `llama-server` but work correctly in `llama-cli`. Our Jetson runs llama-server via systemd. This is a **showstopper** for server deployment. |
| 78 | +**RESOLVED 2026-04-09:** PR [#21418](https://github.com/ggml-org/llama.cpp/pull/21418) (merged 2026-04-04) introduces a dedicated Gemma 4 PEG parser and adds `<|tool_response>` as an EOG token, eliminating the infinite repetition in llama-server. Multiple users confirmed the fix. Included in build **b8721** (released 2026-04-09). |
79 | 79 |
|
80 | | -**Additional known bugs (as of 2026-04-03):** |
| 80 | +**Known bugs status (as of 2026-04-09):** |
81 | 81 |
|
82 | | -| Issue | Problem | Impact on Us | |
83 | | -|-------|---------|-------------| |
84 | | -| [#21365](https://github.com/ggerganov/llama.cpp/issues/21365) | Infinite repetition in llama-server | **Blocker** — our deployment is llama-server | |
85 | | -| [#21329](https://github.com/ggerganov/llama.cpp/issues/21329) | `--parallel` crashes with Gemma 4 | Medium — we run single-slot | |
86 | | -| [#21375](https://github.com/ggerganov/llama.cpp/issues/21375) | Infinite loop in tool-call parser | Medium — affects function calling tests | |
87 | | -| [#21321](https://github.com/ggerganov/llama.cpp/issues/21321) | Generates `<unused24>` tokens | Low — may be fixed by tokenizer update | |
| 82 | +| Issue | Problem | Status | |
| 83 | +|-------|---------|--------| |
| 84 | +| [#21365](https://github.com/ggml-org/llama.cpp/issues/21365) | Infinite repetition in llama-server | **FIXED** by PR #21418 (b8721+) | |
| 85 | +| [#21329](https://github.com/ggml-org/llama.cpp/issues/21329) | `--parallel` crashes with Gemma 4 | Unknown — we run single-slot, low risk | |
| 86 | +| [#21375](https://github.com/ggml-org/llama.cpp/issues/21375) | Infinite loop in tool-call parser | **FIXED** by PR #21418 per user reports | |
| 87 | +| [#21321](https://github.com/ggml-org/llama.cpp/issues/21321) | Generates `<unused24>` tokens | Likely fixed by tokenizer PR #21343 | |
88 | 88 |
|
89 | | -**Action:** Monitor #21365 daily. Do not download models or rebuild llama.cpp until this is confirmed fixed. |
| 89 | +**Action:** Proceed to P2. Target build b8721 or later. |
90 | 90 |
|
91 | 91 | ### P2: Rebuild llama.cpp |
92 | 92 |
|
@@ -358,10 +358,10 @@ After all experiments, score each viable configuration: |
358 | 358 | ## Execution Sequence |
359 | 359 |
|
360 | 360 | ``` |
361 | | -P1: Monitor llama.cpp #21365 ← CURRENT (check daily) |
| 361 | +P1: Monitor llama.cpp #21365 ← DONE (fixed in b8721, PR #21418) |
362 | 362 | │ |
363 | 363 | ▼ (bug fixed) |
364 | | -P2: Rebuild llama.cpp from master |
| 364 | +P2: Rebuild llama.cpp from master ← CURRENT (target b8721+) |
365 | 365 | P3: Regression test Qwen3.5-4B |
366 | 366 | P4: Download Gemma 4 E2B Q4_K_M |
367 | 367 | │ |
|
0 commit comments