Skip to content

fix: bid is not closed after the manifest timeout#371

Open
vertex451 wants to merge 19 commits into
mainfrom
artem/fix-bid-close-if-no-manfiest
Open

fix: bid is not closed after the manifest timeout#371
vertex451 wants to merge 19 commits into
mainfrom
artem/fix-bid-close-if-no-manfiest

Conversation

@vertex451
Copy link
Copy Markdown
Contributor

@vertex451 vertex451 commented Mar 10, 2026

Resolves:

akash-network/support#435

Root of the issue:

  1. The broadcast of bid close runs with wd.ctx, but we call ShutdownInitiated before it finishes.
  2. That closes stoppingch, so WatchChannel returns and cancels wd.ctx while the broadcast is still in progress.
  3. The broadcast then fails with "context canceled" and the bid is not closed on-chain.

Changes

  1. Commit to the broadcast once started. Even if stop() signal received during the wait of the broadcast result - we don't cancel the broadcast.

I moved wd.lc.ShutdownInitiated(err) after case result := <-runch:, so it will be called right after runner will finish its job.

  1. Added broadcast timeout, that will cancel the execution in FlagTxBroadcastTimeout and return.

  2. I added a select case for the case err = <-wd.lc.ShutdownRequest(): in case shutdown will be initiated outside to prevent deadlock if stop() is called during the broadcast.

Is that a regression that we got recently?

The bug is long-standing. The v1 upgrade likely made it more visible by changing client behavior and timing, not by introducing new logic. So may this issue may appeared after pkg.akt.dev client changes.

Testing

Created a lease and waited 5min.
Provider logs:

5:32PM INF watchdog closing bid leaseID=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-manifest
5:32PM INF watchdog done lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7 module=provider-manifest
5:32PM INF lease closed lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-manifest
5:32PM INF unreserving unmanaged order cmp=service lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-cluster
5:32PM INF lease removed deployment=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7 lease=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=manifest-manager
5:32PM DBG unreserving capacity cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1
5:32PM INF attempting to removing reservation cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1
5:32PM INF removing reservation cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1
5:32PM INF unreserve capacity complete cmp=inventory-service module=provider-cluster order=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/7/1/1

Also tested a cancelation timeout of 10ms:

5:24PM ERR failed closing bid err="context deadline exceeded" leaseID=akash1nfyvsjhl7wcyduq7hkx87jja7mu5qyfmcrf0sv/6/1/1/akash1ff4s67p9xjykezkn7gnmdzpz8vdw3kn0ewx7ss module=provider-manifest

Which is expected.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4cec2a6-3050-4cf1-b335-7725f0bb5450

📥 Commits

Reviewing files that changed from the base of the PR and between f4a01fa and 0cc9eb7.

📒 Files selected for processing (2)
  • cmd/provider-services/cmd/flags.go
  • cmd/provider-services/cmd/run.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/provider-services/cmd/flags.go
  • cmd/provider-services/cmd/run.go

Walkthrough

Watchdog now accepts a broadcast timeout and runs the close-bid broadcast with a fresh timeout-bound context. Shutdown handling was changed to consume intermediate shutdown requests while waiting for broadcast completion. Provider and manifest configs expose a 12s BroadcastTimeout and CLI flag wiring validates/uses it.

Changes

Cohort / File(s) Summary
Watchdog core & tests
manifest/watchdog.go, manifest/watchdog_test.go
Watchdog now takes broadcastTimeout time.Duration, removes a persistent internal broadcast context, runs MsgCloseBid with context.WithTimeout(..., broadcastTimeout), adds stop() that can exit without closing, changes shutdown flow to consume subsequent ShutdownRequest events while awaiting broadcast completion. Tests add blocking broadcast scaffolds and new tests for broadcast timeout and stop-while-waiting scenarios.
Provider config surface
config.go, manifest/config.go, manifest/service.go, service.go
Added BroadcastTimeout time.Duration to public configs; default initialized to 12 * time.Second in NewDefaultConfig() and threaded into manifest ServiceConfig and watchdog construction.
Provider CLI wiring & flags
cmd/provider-services/cmd/run.go, cmd/provider-services/cmd/flags.go
CLI reads tx-broadcast-timeout into broadcastTimeout, validates/normalizes non-positive values to 12s, assigns to config.BroadcastTimeout, and default flag value changed to 12s (help text simplified).

Sequence Diagram(s)

sequenceDiagram
    participant Watchdog as Watchdog
    participant Parent as Lifecycle/Parent
    participant Broadcaster as Session/RPC

    Watchdog->>Parent: monitor lease / receive timeout
    Watchdog->>Broadcaster: send MsgCloseBid (ctx = context.WithTimeout(context.Background(), broadcastTimeout))
    Note right of Watchdog: if ShutdownRequest arrives while waiting\nconsume subsequent ShutdownRequest events to unblock sender
    Broadcaster-->>Watchdog: broadcast result (success / timeout / error)
    Watchdog->>Parent: emit ShutdownInitiated after broadcast completes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped to the timeout and tapped my paw,
A bounded broadcast sealed the law,
Twelve seconds to murmur the close-bid plea,
I waited and nibbled — patient as can be,
The warren quieted, tidy and free. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing the issue where bids are not closed after manifest timeout due to a context cancellation race condition.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, clearly explaining the root cause, the fix approach, and testing validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch artem/fix-bid-close-if-no-manfiest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread gateway/rest/router.go Outdated
Comment thread manifest/service.go Outdated
@vertex451 vertex451 marked this pull request as ready for review March 10, 2026 20:54
@vertex451 vertex451 requested a review from a team as a code owner March 10, 2026 20:54
@vertex451 vertex451 force-pushed the artem/fix-bid-close-if-no-manfiest branch from dae726f to 15977fa Compare March 10, 2026 21:07
@vertex451 vertex451 self-assigned this Mar 10, 2026
cloud-j-luna
cloud-j-luna previously approved these changes Mar 11, 2026
Copy link
Copy Markdown
Member

@cloud-j-luna cloud-j-luna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread manifest/service.go Outdated
Comment thread manifest/watchdog.go
Copy link
Copy Markdown
Member

@troian troian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

address request changes

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
manifest/watchdog_test.go (1)

35-50: ⚠️ Potential issue | 🟡 Minor

Replace the sleep with an explicit “broadcast started” handshake.

Line 137 can call wd.stop() before the mock has actually entered BroadcastMsgs, so this test may pass without covering the in-flight-broadcast path at all. Signal entry from the mock and wait on that instead of sleeping.

🧪 Suggested change
 type watchdogTestScaffold struct {
 	client     *clientmocks.Client
 	parentCh   chan struct{}
 	doneCh     chan dtypes.DeploymentID
 	broadcasts chan []sdk.Msg
+	broadcastStarted chan struct{}
 	leaseID    mtypes.LeaseID
 	provider   ptypes.Provider
 }
@@
 	scaffold.doneCh = make(chan dtypes.DeploymentID, 1)
 	scaffold.provider = testutil.Provider(t)
 	scaffold.leaseID = testutil.LeaseID(t)
 	scaffold.leaseID.Provider = scaffold.provider.Owner
 	scaffold.broadcasts = make(chan []sdk.Msg, 1)
+	scaffold.broadcastStarted = make(chan struct{}, 1)

 	txClientMock := &clientmocks.TxClient{}
 	txClientMock.On("BroadcastMsgs", mock.Anything, mock.Anything, mock.Anything).Run(func(args mock.Arguments) {
+		select {
+		case scaffold.broadcastStarted <- struct{}{}:
+		default:
+		}
 		if blockUntilRelease != nil {
 			<-blockUntilRelease
 		}
 		scaffold.broadcasts <- args.Get(1).([]sdk.Msg)
 	}).Return(&sdk.Result{}, nil)
@@
-	<-time.After(200 * time.Millisecond)
+	select {
+	case <-scaffold.broadcastStarted:
+	case <-time.After(5 * time.Second):
+		t.Fatal("timed out waiting for BroadcastMsgs to start")
+	}
 	wd.stop()

Also applies to: 133-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@manifest/watchdog_test.go` around lines 35 - 50, The test uses a sleep to
wait for BroadcastMsgs to start; instead add an explicit handshake: add a
broadcastStarted chan struct{} (e.g., on watchdogTestScaffold) and in the
txClientMock.Run for BroadcastMsgs signal entry by doing broadcastStarted <-
struct{}{} immediately before (or instead of) blocking on blockUntilRelease,
then in the test replace the sleep with a receive from broadcastStarted to
ensure BroadcastMsgs is in-flight before calling wd.stop(); update
makeWatchdogTestScaffoldWithBlocking to create and return/attach
broadcastStarted and alter the mock and callers accordingly (affects the
BroadcastMsgs mock Run and the test code that currently sleeps around lines
133-149).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@manifest/watchdog.go`:
- Around line 84-93: The select currently treats any wd.lc.ShutdownRequest() as
a signal to abort waiting and call wd.lc.ShutdownInitiated(err), which causes
wd.ctx to be cancelled even for a manifest-triggered stop() — racey when runch
is still closing; change the logic so that when you receive err from
wd.lc.ShutdownRequest() you inspect the error and only short-circuit if it
represents a parent/service shutdown (e.g., not the local manifest stop error);
otherwise ignore that ShutdownRequest and continue waiting on runch. Concretely:
in the select that reads runch and wd.lc.ShutdownRequest(), on receiving err
from ShutdownRequest() call a small predicate (e.g., isParentShutdown(err) or
compare against the manifest-stop sentinel) and only break out to call
wd.lc.ShutdownInitiated(err) when it’s a parent shutdown; otherwise loop back
and keep waiting for result from runch so the close-bid can finish (refer to
runch, wd.lc.ShutdownRequest(), wd.lc.ShutdownInitiated(err), BroadcastMsgs and
stop()).

---

Outside diff comments:
In `@manifest/watchdog_test.go`:
- Around line 35-50: The test uses a sleep to wait for BroadcastMsgs to start;
instead add an explicit handshake: add a broadcastStarted chan struct{} (e.g.,
on watchdogTestScaffold) and in the txClientMock.Run for BroadcastMsgs signal
entry by doing broadcastStarted <- struct{}{} immediately before (or instead of)
blocking on blockUntilRelease, then in the test replace the sleep with a receive
from broadcastStarted to ensure BroadcastMsgs is in-flight before calling
wd.stop(); update makeWatchdogTestScaffoldWithBlocking to create and
return/attach broadcastStarted and alter the mock and callers accordingly
(affects the BroadcastMsgs mock Run and the test code that currently sleeps
around lines 133-149).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e8fb9458-c1c3-4e3e-9b3e-c2629bbf521c

📥 Commits

Reviewing files that changed from the base of the PR and between eb8eecf and 09dbce0.

📒 Files selected for processing (3)
  • gateway/rest/router.go
  • manifest/watchdog.go
  • manifest/watchdog_test.go

Comment thread manifest/watchdog.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@config.go`:
- Line 24: NewDefaultConfig() currently leaves the new BroadcastTimeout field at
zero which causes context.WithTimeout to expire immediately; set
BroadcastTimeout to a sensible default (e.g. 30 * time.Second to match the CLI
flag) inside NewDefaultConfig so programmatic users get a non-zero timeout, and
ensure any related constructors (e.g. places calling NewDefaultConfig or merging
config values) respect/override this default as needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1dee5f41-2253-463c-a4a3-13a3180081ba

📥 Commits

Reviewing files that changed from the base of the PR and between b1a78b6 and 3514dd1.

📒 Files selected for processing (7)
  • cmd/provider-services/cmd/run.go
  • config.go
  • manifest/config.go
  • manifest/service.go
  • manifest/watchdog.go
  • manifest/watchdog_test.go
  • service.go

Comment thread config.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/provider-services/cmd/run.go`:
- Line 477: The code reads broadcastTimeout from viper (FlagTxBroadcastTimeout)
and then assigns it to config.BroadcastTimeout which is later passed into
context.WithTimeout (manifest/watchdog.go), so validate the value after reading
it: ensure broadcastTimeout > 0 (or >= a sensible minimum) and if not, log/warn
and leave config.BroadcastTimeout as the existing default or set it to a safe
fallback; update the assignment site that writes to config.BroadcastTimeout to
enforce this guard so an expired/zero/negative duration cannot be propagated
into context.WithTimeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c337a88d-dc6a-45ba-950f-460218383b75

📥 Commits

Reviewing files that changed from the base of the PR and between d4c2e18 and baee021.

📒 Files selected for processing (5)
  • cmd/provider-services/cmd/flags.go
  • cmd/provider-services/cmd/run.go
  • config.go
  • manifest/service.go
  • manifest/watchdog.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/provider-services/cmd/flags.go
  • config.go

Comment thread cmd/provider-services/cmd/run.go
@Zblocker64 Zblocker64 added this to the provider 0.12.0 milestone Mar 25, 2026
@vertex451 vertex451 changed the title fix: coordinated ShutdownInitiated after bid close tx broadcast is finished fix: bid is not closed after the manifest timeout Mar 31, 2026
Comment thread manifest/service.go
// Create watchdog if it does not exist AND a manifest has not been received yet
if watchdog := s.watchdogs[ev.LeaseID.DeploymentID()]; watchdog == nil {
watchdog = newWatchdog(s.session, s.lc.ShuttingDown(), s.watchdogch, ev.LeaseID, s.config.ManifestTimeout)
watchdog = newWatchdog(s.session, s.lc.ShuttingDown(), s.watchdogch, ev.LeaseID, s.config.ManifestTimeout, s.config.BroadcastTimeout)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if provider service restarts, does it start counting from 0, or takes into account when lease was created?

Copy link
Copy Markdown
Contributor Author

@vertex451 vertex451 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the lease creation time is not taking into an account.
I can improve the solution by taking Lease.CreatedAt and computing the remaining time by converting currentBlock to the time(the same we did in isStaleBid)

Not doing this since we discussed the refactor of the serial broadcaster.

@Zblocker64 Zblocker64 removed this from the provider 0.12.0 milestone Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants