fix: bid is not closed after the manifest timeout#371
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
WalkthroughWatchdog now accepts a broadcast timeout and runs the close-bid broadcast with a fresh timeout-bound context. Shutdown handling was changed to consume intermediate shutdown requests while waiting for broadcast completion. Provider and manifest configs expose a 12s BroadcastTimeout and CLI flag wiring validates/uses it. Changes
Sequence Diagram(s)sequenceDiagram
participant Watchdog as Watchdog
participant Parent as Lifecycle/Parent
participant Broadcaster as Session/RPC
Watchdog->>Parent: monitor lease / receive timeout
Watchdog->>Broadcaster: send MsgCloseBid (ctx = context.WithTimeout(context.Background(), broadcastTimeout))
Note right of Watchdog: if ShutdownRequest arrives while waiting\nconsume subsequent ShutdownRequest events to unblock sender
Broadcaster-->>Watchdog: broadcast result (success / timeout / error)
Watchdog->>Parent: emit ShutdownInitiated after broadcast completes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
dae726f to
15977fa
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
manifest/watchdog_test.go (1)
35-50:⚠️ Potential issue | 🟡 MinorReplace the sleep with an explicit “broadcast started” handshake.
Line 137 can call
wd.stop()before the mock has actually enteredBroadcastMsgs, so this test may pass without covering the in-flight-broadcast path at all. Signal entry from the mock and wait on that instead of sleeping.🧪 Suggested change
type watchdogTestScaffold struct { client *clientmocks.Client parentCh chan struct{} doneCh chan dtypes.DeploymentID broadcasts chan []sdk.Msg + broadcastStarted chan struct{} leaseID mtypes.LeaseID provider ptypes.Provider } @@ scaffold.doneCh = make(chan dtypes.DeploymentID, 1) scaffold.provider = testutil.Provider(t) scaffold.leaseID = testutil.LeaseID(t) scaffold.leaseID.Provider = scaffold.provider.Owner scaffold.broadcasts = make(chan []sdk.Msg, 1) + scaffold.broadcastStarted = make(chan struct{}, 1) txClientMock := &clientmocks.TxClient{} txClientMock.On("BroadcastMsgs", mock.Anything, mock.Anything, mock.Anything).Run(func(args mock.Arguments) { + select { + case scaffold.broadcastStarted <- struct{}{}: + default: + } if blockUntilRelease != nil { <-blockUntilRelease } scaffold.broadcasts <- args.Get(1).([]sdk.Msg) }).Return(&sdk.Result{}, nil) @@ - <-time.After(200 * time.Millisecond) + select { + case <-scaffold.broadcastStarted: + case <-time.After(5 * time.Second): + t.Fatal("timed out waiting for BroadcastMsgs to start") + } wd.stop()Also applies to: 133-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@manifest/watchdog_test.go` around lines 35 - 50, The test uses a sleep to wait for BroadcastMsgs to start; instead add an explicit handshake: add a broadcastStarted chan struct{} (e.g., on watchdogTestScaffold) and in the txClientMock.Run for BroadcastMsgs signal entry by doing broadcastStarted <- struct{}{} immediately before (or instead of) blocking on blockUntilRelease, then in the test replace the sleep with a receive from broadcastStarted to ensure BroadcastMsgs is in-flight before calling wd.stop(); update makeWatchdogTestScaffoldWithBlocking to create and return/attach broadcastStarted and alter the mock and callers accordingly (affects the BroadcastMsgs mock Run and the test code that currently sleeps around lines 133-149).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@manifest/watchdog.go`:
- Around line 84-93: The select currently treats any wd.lc.ShutdownRequest() as
a signal to abort waiting and call wd.lc.ShutdownInitiated(err), which causes
wd.ctx to be cancelled even for a manifest-triggered stop() — racey when runch
is still closing; change the logic so that when you receive err from
wd.lc.ShutdownRequest() you inspect the error and only short-circuit if it
represents a parent/service shutdown (e.g., not the local manifest stop error);
otherwise ignore that ShutdownRequest and continue waiting on runch. Concretely:
in the select that reads runch and wd.lc.ShutdownRequest(), on receiving err
from ShutdownRequest() call a small predicate (e.g., isParentShutdown(err) or
compare against the manifest-stop sentinel) and only break out to call
wd.lc.ShutdownInitiated(err) when it’s a parent shutdown; otherwise loop back
and keep waiting for result from runch so the close-bid can finish (refer to
runch, wd.lc.ShutdownRequest(), wd.lc.ShutdownInitiated(err), BroadcastMsgs and
stop()).
---
Outside diff comments:
In `@manifest/watchdog_test.go`:
- Around line 35-50: The test uses a sleep to wait for BroadcastMsgs to start;
instead add an explicit handshake: add a broadcastStarted chan struct{} (e.g.,
on watchdogTestScaffold) and in the txClientMock.Run for BroadcastMsgs signal
entry by doing broadcastStarted <- struct{}{} immediately before (or instead of)
blocking on blockUntilRelease, then in the test replace the sleep with a receive
from broadcastStarted to ensure BroadcastMsgs is in-flight before calling
wd.stop(); update makeWatchdogTestScaffoldWithBlocking to create and
return/attach broadcastStarted and alter the mock and callers accordingly
(affects the BroadcastMsgs mock Run and the test code that currently sleeps
around lines 133-149).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e8fb9458-c1c3-4e3e-9b3e-c2629bbf521c
📒 Files selected for processing (3)
gateway/rest/router.gomanifest/watchdog.gomanifest/watchdog_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@config.go`:
- Line 24: NewDefaultConfig() currently leaves the new BroadcastTimeout field at
zero which causes context.WithTimeout to expire immediately; set
BroadcastTimeout to a sensible default (e.g. 30 * time.Second to match the CLI
flag) inside NewDefaultConfig so programmatic users get a non-zero timeout, and
ensure any related constructors (e.g. places calling NewDefaultConfig or merging
config values) respect/override this default as needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1dee5f41-2253-463c-a4a3-13a3180081ba
📒 Files selected for processing (7)
cmd/provider-services/cmd/run.goconfig.gomanifest/config.gomanifest/service.gomanifest/watchdog.gomanifest/watchdog_test.goservice.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/provider-services/cmd/run.go`:
- Line 477: The code reads broadcastTimeout from viper (FlagTxBroadcastTimeout)
and then assigns it to config.BroadcastTimeout which is later passed into
context.WithTimeout (manifest/watchdog.go), so validate the value after reading
it: ensure broadcastTimeout > 0 (or >= a sensible minimum) and if not, log/warn
and leave config.BroadcastTimeout as the existing default or set it to a safe
fallback; update the assignment site that writes to config.BroadcastTimeout to
enforce this guard so an expired/zero/negative duration cannot be propagated
into context.WithTimeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c337a88d-dc6a-45ba-950f-460218383b75
📒 Files selected for processing (5)
cmd/provider-services/cmd/flags.gocmd/provider-services/cmd/run.goconfig.gomanifest/service.gomanifest/watchdog.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/provider-services/cmd/flags.go
- config.go
| // Create watchdog if it does not exist AND a manifest has not been received yet | ||
| if watchdog := s.watchdogs[ev.LeaseID.DeploymentID()]; watchdog == nil { | ||
| watchdog = newWatchdog(s.session, s.lc.ShuttingDown(), s.watchdogch, ev.LeaseID, s.config.ManifestTimeout) | ||
| watchdog = newWatchdog(s.session, s.lc.ShuttingDown(), s.watchdogch, ev.LeaseID, s.config.ManifestTimeout, s.config.BroadcastTimeout) |
There was a problem hiding this comment.
if provider service restarts, does it start counting from 0, or takes into account when lease was created?
There was a problem hiding this comment.
No, the lease creation time is not taking into an account.
I can improve the solution by taking Lease.CreatedAt and computing the remaining time by converting currentBlock to the time(the same we did in isStaleBid)
Not doing this since we discussed the refactor of the serial broadcaster.
Resolves:
akash-network/support#435
Root of the issue:
Changes
I moved wd.lc.ShutdownInitiated(err) after
case result := <-runch:, so it will be called right after runner will finish its job.Added broadcast timeout, that will cancel the execution in FlagTxBroadcastTimeout and return.
I added a select case for the
case err = <-wd.lc.ShutdownRequest():in case shutdown will be initiated outside to prevent deadlock if stop() is called during the broadcast.Is that a regression that we got recently?
The bug is long-standing. The v1 upgrade likely made it more visible by changing client behavior and timing, not by introducing new logic. So may this issue may appeared after pkg.akt.dev client changes.
Testing
Created a lease and waited 5min.
Provider logs:
Also tested a cancelation timeout of 10ms:
Which is expected.