Skip to content

Fix singleton session demand identity#2053

Merged
julianknutsen merged 2 commits into
mainfrom
fix/1439-duplicate-nonpool-session
May 16, 2026
Merged

Fix singleton session demand identity#2053
julianknutsen merged 2 commits into
mainfrom
fix/1439-duplicate-nonpool-session

Conversation

@julianknutsen
Copy link
Copy Markdown
Collaborator

Fixes #1439.

Summary

  • Keep max_active_sessions = 1 demand on the configured canonical agent identity instead of synthesizing a -1 pool instance.
  • Preserve concrete slot identities and pool_slot metadata for agents that actually support instance expansion.
  • Add regression coverage for the cashmaster/refinery-style singleton demand path.

Tests

  • go test ./cmd/gc -run 'TestBuildDesiredState_(MaxOneAgentDemandUsesCanonicalIdentity|NewPoolSessionBeadCreatedWithConcreteIdentity)$' -count=1
  • go test ./cmd/gc -run 'Test(BuildDesiredState_(MaxOneAgentDemandUsesCanonicalIdentity|NewPoolSessionBeadCreatedWithConcreteIdentity|NewPoolSessionBeadDefersAliasWhenConcreteAliasTaken|DoesNotCreateDuplicatePoolBeadForDiscoveredSession|PoolBeadIdentityAgreesAcrossRealizeAndCanonicalHelper)|ComputePoolDesiredStates_MaxOneTemplatesStillParticipateInDemand|CanonicalSessionIdentity)$' -count=1
  • make test
  • Pre-commit hook: lint-changed, generated docs/spec checks, go vet ./..., make test

@julianknutsen julianknutsen added status/needs-review-auto PR review requested with auto approval status/reviewing and removed status/needs-review-auto PR review requested with auto approval labels May 13, 2026
@github-actions github-actions Bot added the status/needs-triage Inbox — we haven't looked at it yet label May 13, 2026
@julianknutsen julianknutsen added status/needs-review-auto PR review requested with auto approval status/reviewing and removed status/needs-review-auto PR review requested with auto approval status/needs-triage Inbox — we haven't looked at it yet status/reviewing labels May 13, 2026
@blacksmith-sh

This comment has been minimized.

Copy link
Copy Markdown
Collaborator

@quad341 quad341 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @julianknutsen — really sorry to bounce this back after taking a swing at the merge on our side. The structural fix (poolDesiredRequestIdentity / claimDesiredPoolSlot) is genuinely good, the unit-test surfaces you picked are exactly the right pins, and the integration test reconciled cleanly. But the rebase surfaced something that's a call only you can make, so I'm flagging it rather than guessing.

What changed underneath you: main commit ff7fff79 (the Kimi merge) reworked config.Agent.SupportsInstanceExpansion() in internal/config/config.go:2360-2381 to give max_active_sessions=1 two distinct flavors:

  • MinActiveSessions set OR ScaleCheck set → "pool flavor" → still expands to name-1
  • Neither set → "named-session flavor" → canonical name

Your new test TestBuildDesiredState_MaxOneAgentDemandUsesCanonicalIdentity uses fixture {MaxActiveSessions: 1, ScaleCheck: "printf 1"} and asserts canonical identity. Under the new SupportsInstanceExpansion(), that fixture is now explicitly pool-flavor, so your own poolDesiredRequestIdentity correctly routes it to the slot path and emits cashmaster/refinery-1 — failing the assertion. The two intents are direct opposites and there's no fixture I can substitute on your behalf that would still capture what you meant: any demand source (Min or ScaleCheck) trips pool-flavor on main, and without a demand source there's no session to inspect.

What we'd like from you — pick one:

  1. Narrow the scope to true singletons (the path that's still buggy on main): keep the helper extraction, change the test fixture to max=1 without ScaleCheck and without MinActiveSessions, and drive the assertion through a different demand source (e.g. an assigned work bead or a [[named_session]] entry). Your claimDesiredPoolSlot short-circuit to 0 is real value here — main currently still stamps a phantom pool_slot="1" on true singletons via claimPoolSlotWithConfig, and your factoring removes that cleanly.

  2. Make the case for canonical identity under max=1 + ScaleCheck — i.e. argue that the Kimi merge's pool-flavor carve-out is wrong and SupportsInstanceExpansion() should be changed too. That's a real conversation worth having if you think the new semantics are off, but it expands the scope of this PR substantially and would want its own discussion thread first.

Whichever direction you go, the helper extraction itself is keepable as-is and the integration coverage already passes — so option (1) is a fairly small follow-up. Truly appreciate the precision of the original fix and the regression-pin shape you chose; I want to ship the value here, just don't want to guess at the design call for you.

@julianknutsen julianknutsen force-pushed the fix/1439-duplicate-nonpool-session branch from 6c095ff to fe042ee Compare May 16, 2026 12:56
Prevent max_active_sessions=1 agents from materializing generic demand as synthetic -1 pool instances. True expanding agents still use slot identities and pool_slot metadata.
Keep canonical singleton demand and sync metadata slot-free, adopt stale suffixed singleton sessions under the canonical agent identity, and route lifecycle/status enumeration through the canonical runtime identity for max-one scale-check pools.

Add focused regressions for pool sessions, death handlers, idle and max-age trackers, rig status, city status, adoption, sync, and deferred alias recovery.
@julianknutsen julianknutsen force-pushed the fix/1439-duplicate-nonpool-session branch from fe042ee to aad66d5 Compare May 16, 2026 13:34
Copy link
Copy Markdown
Collaborator

@quad341 quad341 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @julianknutsen — quick update. The CI red on TestGCLiveContract_BeadsAndEvents (rest-full-2-of-16) turned out to be a pre-existing dolt transaction flake on POST /session/{id}/suspend — not anything caused by this PR. Same flake reproduced on main earlier that morning, and the parallel rerun on your exact SHA came back clean. Your pinned tests all pass locally. The PR code is in good shape.

We're going to re-run the flaky shard and merge when it's green. Nothing for you to do on this one. The two-tick production-order reclaim and the snapshot-label safety case in particular were a really nice touch — those will keep us honest going forward.

Thanks for the rigor, and for taking the harder option-2 path. This is a real correctness win for singleton demand on refinery/cashmaster.

@julianknutsen
Copy link
Copy Markdown
Collaborator Author

Maintainer Adoption Review

Thanks for the contribution, @julianknutsen! This PR fixes singleton session demand identity for max-one non-namepool pool agents, which matters because the controller, dependency-floor reuse, lifecycle cleanup, and status surfaces need to converge on the same canonical session name instead of creating or chasing phantom -1 pool instances.

This PR was reviewed and adopted with maintainer fixes pushed directly to the PR branch.

Original PR Review

Decision: approve.

Specific gaps fixed:

  • The review found that max_active_sessions=1 pool configurations could still materialize or look up <agent>-1 demand in dependency-floor and lifecycle/status paths; the final patch routes those singleton cases through the canonical session identity and keeps true expanding pools on numbered slots.
  • Stale -N session beads and live stale suffix sessions now have recovery coverage: stale beads are normalized or safely deferred, manual sessions with assigned work stay discoverable, and lifecycle cleanup can still find legacy runtime names during cutover.
  • Dependency-only singleton reuse now uses deterministic candidate ordering through the reusable candidate path, avoiding implementation-order selection under historical drift.
  • The finalization step rebased the reviewed patch onto current main, preserved the stable patch-id, pushed the refreshed head to the PR branch, and reran the visible GitHub checks successfully.

Review findings addressed:

  • Session identity (blocker, high confidence / multi-reviewer): the review found phantom {name}-1 demand for max-one singleton agents; the final patch uses canonical singleton demand and adds focused regressions across desired state, dependency-floor creation, lookup, status, and integration coverage.
  • Stale-state recovery (major, high confidence / multi-reviewer): the review found stale singleton beads and alias-conflict paths could drop assigned work or remain opaque; the final patch keeps manual and stale identities resumable, records conflict state, and covers multi-tick recovery behavior.
  • Merge readiness (major, high confidence / verified): the reviewed branch was previously stale against main; finalization refreshed the base, pushed head aad66d567, and confirmed CI / required plus the supporting GitHub checks are passing on that exact head.

Remaining non-gating notes:

  • A few auxiliary identity-resolution helpers still have older expansion semantics for disabled-agent or stale-suffix edge cases, but the merge-critical routing surfaces are fixed.
  • Sync-time alias reclaim can briefly expose mixed legacy metadata while converging; the reviewed behavior is acceptable for this PR and can be hardened separately if it becomes operationally noisy.
  • No remaining gating issues were found in the latest review pass.

Maintainer Changes

Maintainer-side adoption preserved contributor authorship and pushed the reviewed fix set directly to the PR branch:

  • 82e73401c fix: keep singleton session demand canonical
  • aad66d567 fix: address singleton review findings

Short diff stat: 33 files changed, 3424 insertions(+), 256 deletions(-).

Final Review Status

Ready for the merge queue: final head aad66d567 has passing required GitHub checks, and the merge-ready metadata is in place. I am marking this PR status/merge-ready so the merge queue can pick it up.

CI: https://github.com/gastownhall/gascity/actions/runs/25963287187

Review Iterations

23 review passes were performed. The review/fix loop resolved the singleton identity contract, dependency-floor creation and reuse, stale suffix recovery, lifecycle/status routing, deterministic singleton candidate selection, and final base-refresh/CI readiness before approval.


Adopted via /adopt-pr workflow. Original contributor commits preserved.

@julianknutsen julianknutsen merged commit 2e9a54e into main May 16, 2026
99 checks passed
@julianknutsen julianknutsen deleted the fix/1439-duplicate-nonpool-session branch May 16, 2026 13:46
@julianknutsen julianknutsen removed the status/merge-queued Queued for deterministic PR-review merge label May 16, 2026
Copy link
Copy Markdown
Collaborator

@quad341 quad341 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julianknutsen — thank you for taking the harder path on this one. Generalizing the fix into UsesCanonicalSingletonPoolIdentity and SupportsExpandedSessionIdentities as a single predicate that api/agentutil/dispatcher/lifecycle trackers all key off turned what could have been a narrow cashmaster/refinery patch into a real correctness story across the codebase.

The deferred-alias recovery is what makes this safe in production — the two-tick ordering where the reconciler closes the unselected canonical duplicate before sync reclaims the alias, with TestProductionOrderDeferredSingletonAliasReclaimsOnSecondTick pinning the exact sequence, gives us confidence the migration won't strand stale singleton beads. The snapshot-non-mutation pin and the add-only append guard are both subtle invariants future readers will appreciate having spelled out.

CI is green on every completed shard with a couple of integration shards still in flight and nothing failed. Merging this as soon as the rollup settles — nothing for you to do. Beautiful work.


Reviewers: Qwen — failed · Claude (claude-opus-4-7[1m]) — ok · Codex (gpt-5.5) — ok
Synthesis: claude-opus-4-7[1m]

Review coverage:

  • completed: claude codex
  • unavailable: qwen

@quad341 quad341 added the status/merge-queued Queued for deterministic PR-review merge label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status/merge-queued Queued for deterministic PR-review merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate refinery session spawned with -1 suffix despite max_active_sessions=1

2 participants