Skip to content

DAOS-17306 doc: self-healing properties, interactive rebuild#18023

Open
kccain wants to merge 6 commits intomasterfrom
kccain/daos_17306_doc
Open

DAOS-17306 doc: self-healing properties, interactive rebuild#18023
kccain wants to merge 6 commits intomasterfrom
kccain/daos_17306_doc

Conversation

@kccain
Copy link
Copy Markdown
Contributor

@kccain kccain commented Apr 15, 2026

For the DAOS version 2.8 release, add two major sections to the DAOS Administrator's Guide:

  • self-healing properties / policy controls (DAOS-17306)
  • explicit / interactive rebuild control (DAOS-17281)

Doc-only: true

Signed-off-by: Kenneth Cain kenneth.cain@hpe.com
Signed-off-by: Tom Nabarro thomas.nabarro@hpe.com

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@kccain kccain force-pushed the kccain/daos_17306_doc branch from b137cd4 to 185c293 Compare April 15, 2026 19:07
@github-actions
Copy link
Copy Markdown

Ticket title is 'Enable/disable auto recovery'
Status is 'Resolved'
Labels: '2.8pp'
https://daosio.atlassian.net/browse/DAOS-17306

kccain and others added 2 commits April 16, 2026 13:44
For the DAOS version 2.8 release, add two major sections to the
DAOS Administrator's Guide:
- self-healing properties / policy controls (DAOS-17306)
- explicit / interactive rebuild control (DAOS-17281)

Doc-only: true

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
output (introduced in PR #17371 / DAOS 2.6).

The new section explains:
- Field values (normal vs degraded)
- When to check this field
- Example usage with exclude-only self-heal policies
- How to verify exclusion completed when auto-rebuild is disabled

Updated pool query examples throughout to show the Data redundancy
field for consistency with DAOS 2.6+ output.

Particularly useful for scenarios where system.self_heal is set to
exclude,pool_exclude or pool self_heal has exclude bit set without
rebuild, to confirm exclusion has occurred.

Related-to: #17371
Doc-only: true
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the kccain/daos_17306_doc branch from 4cc9216 to caf846e Compare April 16, 2026 12:46
@tanabarr
Copy link
Copy Markdown
Contributor

had to force push to clean up the commit list, Hope that's okay

Comment thread docs/admin/self_healing.md Outdated
Comment thread docs/admin/self_healing.md Outdated
- Rebuild busy, 42 objs, 21 recs
- Data redundancy: degraded
```

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I thought of, but no change requested
there are other cases like drain that would show Rebuild busy but Data redundancy: normal
And I guess extend might show something similar. But it's probably not worth enumerating too many cases actually.

Comment thread docs/admin/self_healing.md Outdated
Comment thread docs/admin/self_healing.md Outdated
tanabarr
tanabarr previously approved these changes Apr 18, 2026
Comment thread docs/admin/rebuild_controls.md
Comment thread docs/admin/rebuild_controls.md
Comment thread docs/admin/rebuild_controls.md
Comment thread docs/admin/rebuild_controls.md Outdated
Comment thread docs/admin/rebuild_controls.md Outdated
Comment thread docs/admin/rebuild_controls.md
Comment thread docs/admin/self_healing.md Outdated
@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18023/5/testReport/

Co-authored-by: Ken Cain <kenneth.cain@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the kccain/daos_17306_doc branch from b61f74c to 8b9ad4b Compare April 23, 2026 11:07
Doc-only: true

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the kccain/daos_17306_doc branch from 8b9ad4b to 12c2f3a Compare April 23, 2026 13:05
@kccain kccain marked this pull request as ready for review April 23, 2026 15:40
@kccain kccain requested a review from a team as a code owner April 23, 2026 15:40
tanabarr
tanabarr previously approved these changes Apr 24, 2026
Copy link
Copy Markdown
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the rebuild_controls file only.
will get to self_healing next.

Comment on lines +40 to +43
* A system with many tens of pools are all rebuilding simultaneously (requiring
substantial CPU and network resources in the DAOS system). And the
administrator deciding in such cases to stagger an overall recovery by
manually commanding pools to rebuild in smaller batches.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a thought for future improvement: this is a nice use case for having dmg pool rebuild stop/start accepting a list of pools.

Comment on lines +34 to +39
* When a pool has insufficient free space to accommodate relocation of affected
data upon engine(s) excluded/drained. For example, if a rebuild fails with
`status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic
retries). Stopping such a rebuild allows an administrator to perform alternate
actions (e.g., directly reintegrate the lost engine(s); and/or pool expansion
to more engines).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you get DER_NOSPACE on reintegrate rebuild ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to expect. I imagine if it is a direct reintegration of the excluded targets shortly before, then getting -DER_NOSPACE may be unlikely (since the data was placed there originally)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it definitely is likely and ran into that issue several times :-)
probably a bug, but sometimes we get aroudn that issue by restarting the system or reintegrate only 1 pool at a time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if reint got DER_NOSPACE it should fail and revert pool map change of reint and will not automatically retry, admin can redo the reintegrate later.

"it definitely is likely and ran into that issue several times"
probably it was other err caused retry, and finally hit ENOSPACE? from code reading it should not automatically retry ENOSPACE case anyway.

Comment thread docs/admin/rebuild_controls.md Outdated
(e.g., system stop, system/pool exclude, reintegrate, drain, pool extend) can
trigger new rebuild(s). Also, if a rebuild is stopped, whatever progress it had
made in reconstructing the data in the pool is not retained — a subsequent
"rebuild start" command will start the rebuild from the beginning (i.e., this is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if a pool has failed rebuild with something random, let's say DER_STALE, or some network issue.
can someone just call rebuild start and it should restart, or would one need to call rebuild stop then rebuild start? or is that sort of use case not handled by this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebuild start invokes the same logic as a pool leader switch or system restart — it launches rebuilds for any targets found in relevant states (DOWN, DRAINING, or UP). But I don't think it's needed here.

For transient errors like -DER_STALE or network errors, the retry is automatic. No pool map revert occurs, so the retry finds targets in the same state as the original operation (possibly with additional targets in new states from pool map changes that occurred during the failed rebuild). No rebuild start needed.

For other (general) errors, the pool map is reverted after failure (excluded targets stay DOWN; reintegrating UP → DOWN/DOWNOUT; draining DRAIN → UPIN; extending UP → NEW). A retry rebuild is also launched automatically, but for reintegrate/drain/extend the targets are no longer in a rebuildable state, so the retry quickly self-cancels. To redo those operations, the original dmg command (reintegrate, drain, extend) would need to be reissued.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would not agree that all network errors the retry is automatic. we have seen some failure of rebuild with DER_HG or DER_NOMEM (which come s from the network).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my best understanding of the code. It's based on looking at the behavior of retry_rebuild_task() and the context in which it is invoked. I'd have to see particular failure instances in which a rebuild retry was not actually launched as a result of one of these failures, to be able to better see why it is not true.

Comment thread docs/admin/rebuild_controls.md Outdated
Comment on lines +65 to +69
The `rebuild stop` commands are not typically allowed to terminate a rebuild in
the `op:Reclaim` and `op:Fail_reclaim` phases — instead the command must be
issued during the `op:Rebuild` execution. An exception is available with the
`--force` option to `rebuild stop`, intended to be applied for rebuilds that
repeatedly fail and possibly may even be looping `Fail_reclaim` operations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to do it this way? why not just make it always stop (the force being the default)?
just from experience, usually when you want to stop a rebuild, there is usually always something wrong going on either in the rebuild phase, or in the reclaim_fail phase.

Copy link
Copy Markdown
Contributor Author

@kccain kccain Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I got the documentation wrong here and in the last section describing command errors. I'll fix it up. When rebuild stop occurs during Fail_reclaim, it succeeds silently (no error returned to the administrator), though it does allow the Fail_reclaim to continue to its own (hopefully successful) completion. After which, any normal retry mechanism will be suppressed (it remembers that the command was issued during Fail_Reclaim).

The more pathological case of Fail_reclaim itself failing and retrying itself is what the --force option is for. This one will only allow the Fail_reclaim to be forcibly terminated if it has failed at least once (i.e, an admin can't stop the first Fail_reclaim before we even know if that one will fail, and get into the pathological case).

I was being very conservative in the implementation to make sure to run any reclaim behavior during stop, to avoid causing potential further problems due to not freeing up space after a rebuild. One idea for the future may be to always force stop Fail_reclaim, with the idea that it "should be ok" to combine its cleanup work with a reclaim following the next rebuild. Obviously, would require some more careful code analysis and testing to be sure.

Comment thread docs/admin/rebuild_controls.md Outdated
Comment on lines +71 to +75
Because of these details, carefully timing the execution of `rebuild stop`
commands is needed, which can be facilitated with pool rebuild state querying
with `dmg pool query`. See the section
[Rebuild Stop Command Errors](#rebuild-stop-command-errors) for examples of
errors returned by "rebuild stop" in different timing circumstances.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race condition here, and you cannot expect an admin to "time" commands to be issued.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the typical case (Rebuild fails, Fail_reclaim runs successfully, Rebuild is retried) - if the stop command is received during Fail_reclaim it will be handled gracefully without undue burden on the administrator to time the command. I'll update the text to fix my inaccurate description.

Comment thread docs/admin/rebuild_controls.md Outdated
Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded
Pool health info:
- Disabled ranks: 3
- Rebuild busy, 1 objs, 0 recs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for the sake of this example, i would put something in recs and not 0. usually if it is 0 and objs is not 0, it means it is in reclaim or reclaim_fail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in latest push

Comment thread docs/admin/rebuild_controls.md Outdated
$ dmg pool rebuild stop p1
```

Run `dmg pool query` in a loop (with short delays between commands).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really not a good practice. other than pool query sometimes hangs or takes a long time when rebuild is running, this also a pretty ugly way to administer a system.
also one can just cause a problem running multiple queries in large loops.

anyway, i just think that we should not be asking anyone to do any pool operations in a loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix to emphasize pool query can be used to inspect rebuild state, and with advice to not run it too frequently, and to consider using --health-only option.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not think you need to get into that. just avoid really saying that one needs to query in a loop.

- Disabled ranks: 3
- Rebuild stopped (state=idle, status=-2027)
- Data redundancy: degraded
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternate outcome is that fail_reclaim fails or actually stuck.. what is the course of action in this case?
stop the rebuild with --force and start it again?
will reclaim_fail restart first in this case before rebuild?
maybe add that example here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action is to run rebuild stop with --force and start it again. When restarted, it will start from the beginning with Rebuild (not Fail_reclaim).

I can try to put a small example in for this, though it may be thin on command outputs
and may just have to describe the conditions seen in pool query output.

Rank State
---- -----
[0-2,4-7] Joined
3 Stopped
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopped will transition to Excluded and this is when rebuild kicks in.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in latest push

Comment thread docs/admin/self_healing.md Outdated
Comment on lines +12 to +13
By default, the system works as previous versions (per-pool `self_heal`
property) if one doesn't modify the "new" system `self_heal` property. An
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] This assumes the current release to 2.8, which is not the case for master. Might reword 'the "new" system self_heal property' to something like 'the system self_heal property introduced in 2.8', and "previous versions" to something like "pre-2.8 versions".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


Pool-level properties apply to pool-specific behavior and each pool has its own
set of pool-properties, which can be read and written by the system
administrator using `dmg pool get-prop` / `dmg pool set-prop`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Might want to remind users about the need to escape ; (as in exclude\;rebuild).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment on lines +42 to +46
* **`exclude`** — Whether engines get excluded from the system membership based
on (SWIM) activity detection.
* **`pool_exclude`** — Whether when engine/target states change (e.g., as a
result of engine exclusion from the system), all relevant affected pools' pool
maps will be automatically updated (or not).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] For both "exclude"s, I feel that for users an important info is to know this: When disabled, future operations (pool create, object update, rebuild tasks, etc.) may time out and retry until either the exclusion happens or the unavailable targets/engines become available again.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re. "exclusion happens" isn't this dependent on the exclusion flag in question being enabled? "or the unavailable targets/engines become available again" doesn't immediately make sense to me in the context of the "exclude" flags, isn't that just a general statement?

Comment thread docs/admin/self_healing.md Outdated
rebuild does not necessarily trigger automatically and can be delayed based on
user requirements. Delay rebuild is mostly out of scope for this section.

On starting a DAOS system and pool creation, default `self_heal` flags will be
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you mean "on creating a DAOS system"? If one modifies system self_heal to, say, exclude, then restarting the system will not "reset" this property to exclude;pool_exclude;pool_rebuild.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread docs/admin/self_healing.md Outdated

## Pool Query Data Redundancy Status

**Available in:** DAOS 2.6+
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] 2.6 doesn't have it, IIRC. So perhaps clearer to say something like "2.8 and later" or ">= 2.8".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread docs/admin/self_healing.md Outdated

The `dmg pool query` command displays the pool's data redundancy status as part
of the health information output. This field provides a clear indication of
whether the pool has sufficient target availability to maintain data redundancy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question] Is it really accurate? My understanding is that there may be sufficient targets available, but the data redundancy can still be degraded, for example, because the rebuild task is still in progress, or is temporarily disabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Doc-only: true
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested a review from liw April 27, 2026 16:30
Doc-only: true

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
@kccain kccain requested a review from mchaarawi April 27, 2026 18:38
The `rebuild stop` command may return errors when it is issued at a time that it
may not be able to handle the request. The following subsections show examples.
The `rebuild stop` command behavior depends on the current rebuild phase.
The following subsections describe the responses in each case.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp: describe->describes

* When a pool has insufficient free space to accommodate relocation of affected
data upon engine(s) excluded/drained. For example, if a rebuild fails with
`status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic
retries). Stopping such a rebuild allows an administrator to perform alternate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mismatch with real processing, it will not automatically retry for ENOSPACE err.
you may change the descripted case to temporarily network error or sth else retryable err.

Comment on lines +34 to +39
* When a pool has insufficient free space to accommodate relocation of affected
data upon engine(s) excluded/drained. For example, if a rebuild fails with
`status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic
retries). Stopping such a rebuild allows an administrator to perform alternate
actions (e.g., directly reintegrate the lost engine(s); and/or pool expansion
to more engines).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if reint got DER_NOSPACE it should fail and revert pool map change of reint and will not automatically retry, admin can redo the reintegrate later.

"it definitely is likely and ran into that issue several times"
probably it was other err caused retry, and finally hit ENOSPACE? from code reading it should not automatically retry ENOSPACE case anyway.

1. Run `op:Rebuild`
2. Run `op:Fail_reclaim` to clean up
* If `Fail_reclaim` *itself* failed, retry `Fail_reclaim`
* If `Fail_reclaim` succeeded, retry the original `op:Rebuild`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • if the rebuild was failed with retryable error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants