Skip to content

[inventory] Add stale saga information#10027

Open
karencfv wants to merge 19 commits intooxidecomputer:mainfrom
karencfv:stale-saga-info-in-inventory
Open

[inventory] Add stale saga information#10027
karencfv wants to merge 19 commits intooxidecomputer:mainfrom
karencfv:stale-saga-info-in-inventory

Conversation

@karencfv
Copy link
Copy Markdown
Contributor

@karencfv karencfv commented Mar 11, 2026

As per #9876, this PR adds information about stale sagas to inventory. Any saga that is older than 15 minutes which is in a running or unwinding state.

The 15 minute cutoff is arbitrary, the update status endpoint will only consider a saga older than 1 hour as a health concern, but it would be good to store more data on sagas that are in one of these two states for a shorter time for debugging purposes. I can change this threshold if people don't agree though

A stale saga is present:

$ ./target/debug/omdb db inventory collections show latest
<...>
STALE SAGAS
    Found 1 running or unwinding for longer than 15 minutes
        NAME     ID                                       STATE       CREATOR                                  CURRENT_SEC                              TIME_CREATED                   
        demo     457c4832-1a87-4bd3-a6d1-0e98105b1a5c     running     a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f     a4ef738a-1fb0-47b1-9da2-4919c7ec7c7f     2026-04-13 07:39:27.357491 UTC 

Everything is happy:

$ ./target/debug/omdb db inventory collections show latest

<...>
STALE SAGAS
    No sagas have been running or unwinding for longer than 15 minutes

Closes: #9412

@karencfv karencfv marked this pull request as ready for review April 14, 2026 07:29
@karencfv
Copy link
Copy Markdown
Contributor Author

karencfv commented Apr 14, 2026

From the build-and-test (helios) job https://buildomat.eng.oxide.computer/wg/0/details/01KP5DY5F65D03T0G7EHK3D0VX/3HTS7wQUErwzoUUVG6juJt1SdvKCVqdRoL8vOvWb0bSdknVd/01KP5DYFQWBTSMENHD0BFYC31X#S6677

6677	2026-04-14T07:40:41.181Z	error: creating test list failed
6678	2026-04-14T07:40:41.181Z	
6679	2026-04-14T07:40:41.181Z	Caused by:
6680	2026-04-14T07:40:41.181Z	  for `nexus-fm`, command `/work/oxidecomputer/omicron/target/debug/deps/nexus_fm-618b2e4b6bedc71d --list --format terse` aborted with signal 9 (SIGKILL)
6681	2026-04-14T07:40:41.181Z	  --- stdout:
6682	2026-04-14T07:40:41.181Z	
6683	2026-04-14T07:40:41.181Z	  --- stderr:
6684	2026-04-14T07:40:41.181Z	  ld.so.1: nexus_fm-618b2e4b6bedc71d: fatal: libpq.so.5: open failed: No such file or directory
6685	2026-04-14T07:40:41.181Z	
6686	2026-04-14T07:40:41.181Z	  ---

I am legitimately stumped by this. I didn't even touch that crate. The build-and-test (ubuntu-22.04) tests run fine, and the problem isn't showing on main either!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add health check information to inventory

1 participant