fix: bid is not closed after the manifest timeout by vertex451 · Pull Request #371 · akash-network/provider

vertex451 · 2026-03-10T19:35:33Z

Resolves:

Root of the issue:

The broadcast of bid close runs with wd.ctx, but we call ShutdownInitiated before it finishes.
That closes stoppingch, so WatchChannel returns and cancels wd.ctx while the broadcast is still in progress.
The broadcast then fails with "context canceled" and the bid is not closed on-chain.

Changes

Commit to the broadcast once started. Even if stop() signal received during the wait of the broadcast result - we don't cancel the broadcast.

I moved wd.lc.ShutdownInitiated(err) after case result := <-runch:, so it will be called right after runner will finish its job.

Added broadcast timeout, that will cancel the execution in FlagTxBroadcastTimeout and return.
I added a select case for the case err = <-wd.lc.ShutdownRequest(): in case shutdown will be initiated outside to prevent deadlock if stop() is called during the broadcast.

Is that a regression that we got recently?

The bug is long-standing. The v1 upgrade likely made it more visible by changing client behavior and timing, not by introducing new logic. So may this issue may appeared after pkg.akt.dev client changes.

Testing

Created a lease and waited 5min.
Provider logs:

5:32PM INF watchdog closing bid leaseID=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-manifest
5:32PM INF watchdog done lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7 module=provider-manifest
5:32PM INF lease closed lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-manifest
5:32PM INF unreserving unmanaged order cmp=service lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-cluster
5:32PM INF lease removed deployment=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7 lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=manifest-manager
5:32PM DBG unreserving capacity cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1
5:32PM INF attempting to removing reservation cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1
5:32PM INF removing reservation cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1
5:32PM INF unreserve capacity complete cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1

Also tested a cancelation timeout of 10ms:

5:24PM ERR failed closing bid err="context deadline exceeded" leaseID=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/6/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-manifest

Which is expected.

…nished

coderabbitai · 2026-03-10T19:35:40Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4cec2a6-3050-4cf1-b335-7725f0bb5450

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01fa and 0cc9eb7.

📒 Files selected for processing (2)

cmd/provider-services/cmd/flags.go
cmd/provider-services/cmd/run.go

🚧 Files skipped from review as they are similar to previous changes (2)

cmd/provider-services/cmd/flags.go
cmd/provider-services/cmd/run.go

Walkthrough

Watchdog now accepts a broadcast timeout and runs the close-bid broadcast with a fresh timeout-bound context. Shutdown handling was changed to consume intermediate shutdown requests while waiting for broadcast completion. Provider and manifest configs expose a 12s BroadcastTimeout and CLI flag wiring validates/uses it.

Changes

Cohort / File(s)	Summary
Watchdog core & tests `manifest/watchdog.go`, `manifest/watchdog_test.go`	Watchdog now takes `broadcastTimeout time.Duration`, removes a persistent internal broadcast context, runs `MsgCloseBid` with `context.WithTimeout(..., broadcastTimeout)`, adds `stop()` that can exit without closing, changes shutdown flow to consume subsequent `ShutdownRequest` events while awaiting broadcast completion. Tests add blocking broadcast scaffolds and new tests for broadcast timeout and stop-while-waiting scenarios.
Provider config surface `config.go`, `manifest/config.go`, `manifest/service.go`, `service.go`	Added `BroadcastTimeout time.Duration` to public configs; default initialized to `12 * time.Second` in `NewDefaultConfig()` and threaded into manifest `ServiceConfig` and watchdog construction.
Provider CLI wiring & flags `cmd/provider-services/cmd/run.go`, `cmd/provider-services/cmd/flags.go`	CLI reads `tx-broadcast-timeout` into `broadcastTimeout`, validates/normalizes non-positive values to 12s, assigns to `config.BroadcastTimeout`, and default flag value changed to 12s (help text simplified).

Sequence Diagram(s)

sequenceDiagram
    participant Watchdog as Watchdog
    participant Parent as Lifecycle/Parent
    participant Broadcaster as Session/RPC

    Watchdog->>Parent: monitor lease / receive timeout
    Watchdog->>Broadcaster: send MsgCloseBid (ctx = context.WithTimeout(context.Background(), broadcastTimeout))
    Note right of Watchdog: if ShutdownRequest arrives while waiting\nconsume subsequent ShutdownRequest events to unblock sender
    Broadcaster-->>Watchdog: broadcast result (success / timeout / error)
    Watchdog->>Parent: emit ShutdownInitiated after broadcast completes

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped to the timeout and tapped my paw,
A bounded broadcast sealed the law,
Twelve seconds to murmur the close-bid plea,
I waited and nibbled — patient as can be,
The warren quieted, tidy and free. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: fixing the issue where bids are not closed after manifest timeout due to a context cancellation race condition.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, clearly explaining the root cause, the fix approach, and testing validation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch artem/fix-bid-close-if-no-manfiest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cloud-j-luna

LGTM

troian

address request changes

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

manifest/watchdog_test.go (1)

35-50: ⚠️ Potential issue | 🟡 Minor

Replace the sleep with an explicit “broadcast started” handshake.

Line 137 can call wd.stop() before the mock has actually entered BroadcastMsgs, so this test may pass without covering the in-flight-broadcast path at all. Signal entry from the mock and wait on that instead of sleeping.

🧪 Suggested change

 type watchdogTestScaffold struct {
 	client     *clientmocks.Client
 	parentCh   chan struct{}
 	doneCh     chan dtypes.DeploymentID
 	broadcasts chan []sdk.Msg
+	broadcastStarted chan struct{}
 	leaseID    mtypes.LeaseID
 	provider   ptypes.Provider
 }
@@
 	scaffold.doneCh = make(chan dtypes.DeploymentID, 1)
 	scaffold.provider = testutil.Provider(t)
 	scaffold.leaseID = testutil.LeaseID(t)
 	scaffold.leaseID.Provider = scaffold.provider.Owner
 	scaffold.broadcasts = make(chan []sdk.Msg, 1)
+	scaffold.broadcastStarted = make(chan struct{}, 1)

 	txClientMock := &clientmocks.TxClient{}
 	txClientMock.On("BroadcastMsgs", mock.Anything, mock.Anything, mock.Anything).Run(func(args mock.Arguments) {
+		select {
+		case scaffold.broadcastStarted <- struct{}{}:
+		default:
+		}
 		if blockUntilRelease != nil {
 			<-blockUntilRelease
 		}
 		scaffold.broadcasts <- args.Get(1).([]sdk.Msg)
 	}).Return(&sdk.Result{}, nil)
@@
-	<-time.After(200 * time.Millisecond)
+	select {
+	case <-scaffold.broadcastStarted:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for BroadcastMsgs to start")
+	}
 	wd.stop()

Also applies to: 133-149

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@manifest/watchdog_test.go` around lines 35 - 50, The test uses a sleep to
wait for BroadcastMsgs to start; instead add an explicit handshake: add a
broadcastStarted chan struct{} (e.g., on watchdogTestScaffold) and in the
txClientMock.Run for BroadcastMsgs signal entry by doing broadcastStarted <-
struct{}{} immediately before (or instead of) blocking on blockUntilRelease,
then in the test replace the sleep with a receive from broadcastStarted to
ensure BroadcastMsgs is in-flight before calling wd.stop(); update
makeWatchdogTestScaffoldWithBlocking to create and return/attach
broadcastStarted and alter the mock and callers accordingly (affects the
BroadcastMsgs mock Run and the test code that currently sleeps around lines
133-149).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@manifest/watchdog.go`:
- Around line 84-93: The select currently treats any wd.lc.ShutdownRequest() as
a signal to abort waiting and call wd.lc.ShutdownInitiated(err), which causes
wd.ctx to be cancelled even for a manifest-triggered stop() — racey when runch
is still closing; change the logic so that when you receive err from
wd.lc.ShutdownRequest() you inspect the error and only short-circuit if it
represents a parent/service shutdown (e.g., not the local manifest stop error);
otherwise ignore that ShutdownRequest and continue waiting on runch. Concretely:
in the select that reads runch and wd.lc.ShutdownRequest(), on receiving err
from ShutdownRequest() call a small predicate (e.g., isParentShutdown(err) or
compare against the manifest-stop sentinel) and only break out to call
wd.lc.ShutdownInitiated(err) when it’s a parent shutdown; otherwise loop back
and keep waiting for result from runch so the close-bid can finish (refer to
runch, wd.lc.ShutdownRequest(), wd.lc.ShutdownInitiated(err), BroadcastMsgs and
stop()).

---

Outside diff comments:
In `@manifest/watchdog_test.go`:
- Around line 35-50: The test uses a sleep to wait for BroadcastMsgs to start;
instead add an explicit handshake: add a broadcastStarted chan struct{} (e.g.,
on watchdogTestScaffold) and in the txClientMock.Run for BroadcastMsgs signal
entry by doing broadcastStarted <- struct{}{} immediately before (or instead of)
blocking on blockUntilRelease, then in the test replace the sleep with a receive
from broadcastStarted to ensure BroadcastMsgs is in-flight before calling
wd.stop(); update makeWatchdogTestScaffoldWithBlocking to create and
return/attach broadcastStarted and alter the mock and callers accordingly
(affects the BroadcastMsgs mock Run and the test code that currently sleeps
around lines 133-149).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8fb9458-c1c3-4e3e-9b3e-c2629bbf521c

📥 Commits

Reviewing files that changed from the base of the PR and between eb8eecf and 09dbce0.

📒 Files selected for processing (3)

gateway/rest/router.go
manifest/watchdog.go
manifest/watchdog_test.go

…ix-bid-close-if-no-manfiest

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config.go`:
- Line 24: NewDefaultConfig() currently leaves the new BroadcastTimeout field at
zero which causes context.WithTimeout to expire immediately; set
BroadcastTimeout to a sensible default (e.g. 30 * time.Second to match the CLI
flag) inside NewDefaultConfig so programmatic users get a non-zero timeout, and
ensure any related constructors (e.g. places calling NewDefaultConfig or merging
config values) respect/override this default as needed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1dee5f41-2253-463c-a4a3-13a3180081ba

📥 Commits

Reviewing files that changed from the base of the PR and between b1a78b6 and 3514dd1.

📒 Files selected for processing (7)

cmd/provider-services/cmd/run.go
config.go
manifest/config.go
manifest/service.go
manifest/watchdog.go
manifest/watchdog_test.go
service.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/provider-services/cmd/run.go`:
- Line 477: The code reads broadcastTimeout from viper (FlagTxBroadcastTimeout)
and then assigns it to config.BroadcastTimeout which is later passed into
context.WithTimeout (manifest/watchdog.go), so validate the value after reading
it: ensure broadcastTimeout > 0 (or >= a sensible minimum) and if not, log/warn
and leave config.BroadcastTimeout as the existing default or set it to a safe
fallback; update the assignment site that writes to config.BroadcastTimeout to
enforce this guard so an expired/zero/negative duration cannot be propagated
into context.WithTimeout.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c337a88d-dc6a-45ba-950f-460218383b75

📥 Commits

Reviewing files that changed from the base of the PR and between d4c2e18 and baee021.

📒 Files selected for processing (5)

cmd/provider-services/cmd/flags.go
cmd/provider-services/cmd/run.go
config.go
manifest/service.go
manifest/watchdog.go

🚧 Files skipped from review as they are similar to previous changes (2)

cmd/provider-services/cmd/flags.go
config.go

troian · 2026-04-10T15:56:52Z

 		// Create watchdog if it does not exist AND a manifest has not been received yet
 		if watchdog := s.watchdogs[ev.LeaseID.DeploymentID()]; watchdog == nil {
-			watchdog = newWatchdog(s.session, s.lc.ShuttingDown(), s.watchdogch, ev.LeaseID, s.config.ManifestTimeout)
+			watchdog = newWatchdog(s.session, s.lc.ShuttingDown(), s.watchdogch, ev.LeaseID, s.config.ManifestTimeout, s.config.BroadcastTimeout)


if provider service restarts, does it start counting from 0, or takes into account when lease was created?

No, the lease creation time is not taking into an account.
I can improve the solution by taking Lease.CreatedAt and computing the remaining time by converting currentBlock to the time(the same we did in isStaleBid)

Not doing this since we discussed the refactor of the serial broadcaster.

fix: coordinated ShutdownInitiated after bid close tx broadcast is fi…

a9791b1

…nished

vertex451 commented Mar 10, 2026

View reviewed changes

Comment thread gateway/rest/router.go Outdated

vertex451 commented Mar 10, 2026

View reviewed changes

Comment thread manifest/service.go Outdated

vertex451 marked this pull request as ready for review March 10, 2026 20:54

vertex451 requested a review from a team as a code owner March 10, 2026 20:54

fix: test watchdog stop while witing for broadcast

15977fa

vertex451 force-pushed the artem/fix-bid-close-if-no-manfiest branch from dae726f to 15977fa Compare March 10, 2026 21:07

vertex451 self-assigned this Mar 10, 2026

cloud-j-luna previously approved these changes Mar 11, 2026

View reviewed changes

troian approved these changes Mar 11, 2026

View reviewed changes

Comment thread manifest/service.go Outdated

Comment thread manifest/watchdog.go

troian requested changes Mar 11, 2026

View reviewed changes

chore: reverted original log.Info for lease won

09dbce0

vertex451 dismissed cloud-j-luna’s stale review via 09dbce0 March 11, 2026 19:24

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread manifest/watchdog.go

chore: commit to the broadcast once started

b1a78b6

vertex451 requested review from cloud-j-luna and troian March 12, 2026 15:26

vertex451 added 2 commits March 13, 2026 17:57

fix: added context.WithTimeout to the broadcast message

85f8226

Merge branch 'main' of github.com:akash-network/provider into artem/f…

3514dd1

…ix-bid-close-if-no-manfiest

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread config.go

vertex451 and others added 3 commits March 13, 2026 18:30

fix: added default BroadcastTimeout

d4c2e18

fix: set cancel timeout to 12s

914b5b6

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

baee021

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread cmd/provider-services/cmd/run.go

Zblocker64 and others added 5 commits March 20, 2026 07:59

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

3388a8e

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

5643880

fix: added broadcastTimeout validation

199c391

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

a252e0c

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

87728f3

Zblocker64 added 2 commits March 24, 2026 11:51

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

e404abd

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

1bf9bf0

Zblocker64 added this to the provider 0.12.0 milestone Mar 25, 2026

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

a9a2bc4

vertex451 changed the title ~~fix: coordinated ShutdownInitiated after bid close tx broadcast is finished~~ fix: bid is not closed after the manifest timeout Mar 31, 2026

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

f4a01fa

troian reviewed Apr 10, 2026

View reviewed changes

Zblocker64 removed this from the provider 0.12.0 milestone Apr 15, 2026

Merge branch 'main' into artem/fix-bid-close-if-no-manfiest

0cc9eb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bid is not closed after the manifest timeout#371

fix: bid is not closed after the manifest timeout#371
vertex451 wants to merge 19 commits into
mainfrom
artem/fix-bid-close-if-no-manfiest

vertex451 commented Mar 10, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

cloud-j-luna left a comment

Uh oh!

Uh oh!

Uh oh!

troian left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

troian Apr 10, 2026

Uh oh!

vertex451 Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vertex451 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Resolves:

Root of the issue:

Changes

Is that a regression that we got recently?

Testing

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Uh oh!

cloud-j-luna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

troian left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

troian Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

vertex451 Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vertex451 commented Mar 10, 2026 •

edited

Loading

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading

vertex451 Apr 10, 2026 •

edited

Loading