DAOS-17306 doc: self-healing properties, interactive rebuild by kccain · Pull Request #18023 · daos-stack/daos

kccain · 2026-04-15T18:26:56Z

For the DAOS version 2.8 release, add two major sections to the DAOS Administrator's Guide:

self-healing properties / policy controls (DAOS-17306)
explicit / interactive rebuild control (DAOS-17281)

Doc-only: true

Signed-off-by: Kenneth Cain kenneth.cain@hpe.com
Signed-off-by: Tom Nabarro thomas.nabarro@hpe.com

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2026-04-15T19:35:45Z

Ticket title is 'Enable/disable auto recovery'
Status is 'Resolved'
Labels: '2.8pp'
https://daosio.atlassian.net/browse/DAOS-17306

For the DAOS version 2.8 release, add two major sections to the DAOS Administrator's Guide: - self-healing properties / policy controls (DAOS-17306) - explicit / interactive rebuild control (DAOS-17281) Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

output (introduced in PR #17371 / DAOS 2.6). The new section explains: - Field values (normal vs degraded) - When to check this field - Example usage with exclude-only self-heal policies - How to verify exclusion completed when auto-rebuild is disabled Updated pool query examples throughout to show the Data redundancy field for consistency with DAOS 2.6+ output. Particularly useful for scenarios where system.self_heal is set to exclude,pool_exclude or pool self_heal has exclude bit set without rebuild, to confirm exclusion has occurred. Related-to: #17371 Doc-only: true Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr · 2026-04-16T12:47:30Z

had to force push to clean up the commit list, Hope that's okay

kccain · 2026-04-16T16:13:53Z

+- Rebuild busy, 42 objs, 21 recs
+- Data redundancy: degraded
+```
+


Something I thought of, but no change requested
there are other cases like drain that would show Rebuild busy but Data redundancy: normal
And I guess extend might show something similar. But it's probably not worth enumerating too many cases actually.

daosbuild3 · 2026-04-19T10:01:21Z

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18023/5/testReport/

Co-authored-by: Ken Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

mchaarawi

reviewed the rebuild_controls file only.
will get to self_healing next.

mchaarawi · 2026-04-24T19:38:19Z

+* A system with many tens of pools are all rebuilding simultaneously (requiring
+  substantial CPU and network resources in the DAOS system). And the
+  administrator deciding in such cases to stagger an overall recovery by
+  manually commanding pools to rebuild in smaller batches.


just a thought for future improvement: this is a nice use case for having dmg pool rebuild stop/start accepting a list of pools.

mchaarawi · 2026-04-24T19:38:51Z

+* When a pool has insufficient free space to accommodate relocation of affected
+  data upon engine(s) excluded/drained. For example, if a rebuild fails with
+  `status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic
+  retries). Stopping such a rebuild allows an administrator to perform alternate
+  actions (e.g., directly reintegrate the lost engine(s); and/or pool expansion
+  to more engines).


what if you get DER_NOSPACE on reintegrate rebuild ?

I'm not sure what to expect. I imagine if it is a direct reintegration of the excluded targets shortly before, then getting -DER_NOSPACE may be unlikely (since the data was placed there originally)?

it definitely is likely and ran into that issue several times :-)
probably a bug, but sometimes we get aroudn that issue by restarting the system or reintegrate only 1 pool at a time.

if reint got DER_NOSPACE it should fail and revert pool map change of reint and will not automatically retry, admin can redo the reintegrate later.

"it definitely is likely and ran into that issue several times"
probably it was other err caused retry, and finally hit ENOSPACE? from code reading it should not automatically retry ENOSPACE case anyway.

mchaarawi · 2026-04-24T19:43:53Z

+(e.g., system stop, system/pool exclude, reintegrate, drain, pool extend) can
+trigger new rebuild(s). Also, if a rebuild is stopped, whatever progress it had
+made in reconstructing the data in the pool is not retained — a subsequent
+"rebuild start" command will start the rebuild from the beginning (i.e., this is


what if a pool has failed rebuild with something random, let's say DER_STALE, or some network issue.
can someone just call rebuild start and it should restart, or would one need to call rebuild stop then rebuild start? or is that sort of use case not handled by this?

rebuild start invokes the same logic as a pool leader switch or system restart — it launches rebuilds for any targets found in relevant states (DOWN, DRAINING, or UP). But I don't think it's needed here.

For transient errors like -DER_STALE or network errors, the retry is automatic. No pool map revert occurs, so the retry finds targets in the same state as the original operation (possibly with additional targets in new states from pool map changes that occurred during the failed rebuild). No rebuild start needed.

For other (general) errors, the pool map is reverted after failure (excluded targets stay DOWN; reintegrating UP → DOWN/DOWNOUT; draining DRAIN → UPIN; extending UP → NEW). A retry rebuild is also launched automatically, but for reintegrate/drain/extend the targets are no longer in a rebuildable state, so the retry quickly self-cancels. To redo those operations, the original dmg command (reintegrate, drain, extend) would need to be reissued.

i would not agree that all network errors the retry is automatic. we have seen some failure of rebuild with DER_HG or DER_NOMEM (which come s from the network).

This is my best understanding of the code. It's based on looking at the behavior of retry_rebuild_task() and the context in which it is invoked. I'd have to see particular failure instances in which a rebuild retry was not actually launched as a result of one of these failures, to be able to better see why it is not true.

mchaarawi · 2026-04-24T19:46:21Z

+The `rebuild stop` commands are not typically allowed to terminate a rebuild in
+the `op:Reclaim` and `op:Fail_reclaim` phases — instead the command must be
+issued during the `op:Rebuild` execution. An exception is available with the
+`--force` option to `rebuild stop`, intended to be applied for rebuilds that
+repeatedly fail and possibly may even be looping `Fail_reclaim` operations.


is there a reason to do it this way? why not just make it always stop (the force being the default)?
just from experience, usually when you want to stop a rebuild, there is usually always something wrong going on either in the rebuild phase, or in the reclaim_fail phase.

Sorry, I think I got the documentation wrong here and in the last section describing command errors. I'll fix it up. When rebuild stop occurs during Fail_reclaim, it succeeds silently (no error returned to the administrator), though it does allow the Fail_reclaim to continue to its own (hopefully successful) completion. After which, any normal retry mechanism will be suppressed (it remembers that the command was issued during Fail_Reclaim).

The more pathological case of Fail_reclaim itself failing and retrying itself is what the --force option is for. This one will only allow the Fail_reclaim to be forcibly terminated if it has failed at least once (i.e, an admin can't stop the first Fail_reclaim before we even know if that one will fail, and get into the pathological case).

I was being very conservative in the implementation to make sure to run any reclaim behavior during stop, to avoid causing potential further problems due to not freeing up space after a rebuild. One idea for the future may be to always force stop Fail_reclaim, with the idea that it "should be ok" to combine its cleanup work with a reclaim following the next rebuild. Obviously, would require some more careful code analysis and testing to be sure.

mchaarawi · 2026-04-24T19:47:08Z

+Because of these details, carefully timing the execution of `rebuild stop`
+commands is needed, which can be facilitated with pool rebuild state querying
+with `dmg pool query`. See the section
+[Rebuild Stop Command Errors](#rebuild-stop-command-errors) for examples of
+errors returned by "rebuild stop" in different timing circumstances.


There is a race condition here, and you cannot expect an admin to "time" commands to be issued.

In the typical case (Rebuild fails, Fail_reclaim runs successfully, Rebuild is retried) - if the stop command is received during Fail_reclaim it will be handled gracefully without undue burden on the administrator to time the command. I'll update the text to fix my inaccurate description.

mchaarawi · 2026-04-24T19:48:39Z

+Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded
+Pool health info:
+- Disabled ranks: 3
+- Rebuild busy, 1 objs, 0 recs


just for the sake of this example, i would put something in recs and not 0. usually if it is 0 and objs is not 0, it means it is in reclaim or reclaim_fail.

done in latest push

mchaarawi · 2026-04-24T19:51:54Z

+$ dmg pool rebuild stop p1
+```
+
+Run `dmg pool query` in a loop (with short delays between commands).


this is really not a good practice. other than pool query sometimes hangs or takes a long time when rebuild is running, this also a pretty ugly way to administer a system.
also one can just cause a problem running multiple queries in large loops.

anyway, i just think that we should not be asking anyone to do any pool operations in a loop.

will fix to emphasize pool query can be used to inspect rebuild state, and with advice to not run it too frequently, and to consider using --health-only option.

i do not think you need to get into that. just avoid really saying that one needs to query in a loop.

mchaarawi · 2026-04-24T19:53:39Z

+- Disabled ranks: 3
+- Rebuild stopped (state=idle, status=-2027)
+- Data redundancy: degraded
+```


an alternate outcome is that fail_reclaim fails or actually stuck.. what is the course of action in this case?
stop the rebuild with --force and start it again?
will reclaim_fail restart first in this case before rebuild?
maybe add that example here?

The action is to run rebuild stop with --force and start it again. When restarted, it will start from the beginning with Rebuild (not Fail_reclaim).

I can try to put a small example in for this, though it may be thin on command outputs
and may just have to describe the conditions seen in pool query output.

mchaarawi · 2026-04-24T19:55:17Z

+Rank      State
+----      -----
+[0-2,4-7] Joined
+3         Stopped


Stopped will transition to Excluded and this is when rebuild kicks in.

added in latest push

liw · 2026-04-26T00:46:27Z

+By default, the system works as previous versions (per-pool `self_heal`
+property) if one doesn't modify the "new" system `self_heal` property. An


[Nit] This assumes the current release to 2.8, which is not the case for master. Might reword 'the "new" system self_heal property' to something like 'the system self_heal property introduced in 2.8', and "previous versions" to something like "pre-2.8 versions".

liw · 2026-04-26T00:48:24Z

+
+Pool-level properties apply to pool-specific behavior and each pool has its own
+set of pool-properties, which can be read and written by the system
+administrator using `dmg pool get-prop` / `dmg pool set-prop`.


[Nit] Might want to remind users about the need to escape ; (as in exclude\;rebuild).

liw · 2026-04-26T00:53:35Z

+* **`exclude`** — Whether engines get excluded from the system membership based
+  on (SWIM) activity detection.
+* **`pool_exclude`** — Whether when engine/target states change (e.g., as a
+  result of engine exclusion from the system), all relevant affected pools' pool
+  maps will be automatically updated (or not).


[Nit] For both "exclude"s, I feel that for users an important info is to know this: When disabled, future operations (pool create, object update, rebuild tasks, etc.) may time out and retry until either the exclusion happens or the unavailable targets/engines become available again.

re. "exclusion happens" isn't this dependent on the exclusion flag in question being enabled? "or the unavailable targets/engines become available again" doesn't immediately make sense to me in the context of the "exclude" flags, isn't that just a general statement?

liw · 2026-04-26T00:56:08Z

+  rebuild does not necessarily trigger automatically and can be delayed based on
+  user requirements. Delay rebuild is mostly out of scope for this section.
+
+On starting a DAOS system and pool creation, default `self_heal` flags will be


Hmm, you mean "on creating a DAOS system"? If one modifies system self_heal to, say, exclude, then restarting the system will not "reset" this property to exclude;pool_exclude;pool_rebuild.

liw · 2026-04-26T00:57:43Z

+
+## Pool Query Data Redundancy Status
+
+**Available in:** DAOS 2.6+


[Nit] 2.6 doesn't have it, IIRC. So perhaps clearer to say something like "2.8 and later" or ">= 2.8".

liw · 2026-04-26T01:00:45Z

+
+The `dmg pool query` command displays the pool's data redundancy status as part
+of the health information output. This field provides a clear indication of
+whether the pool has sufficient target availability to maintain data redundancy.


[Question] Is it really accurate? My understanding is that there may be sufficient targets available, but the data redundancy can still be degraded, for example, because the rebuild task is still in progress, or is temporarily disabled.

Doc-only: true Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

tanabarr · 2026-04-28T14:05:26Z

-The `rebuild stop` command may return errors when it is issued at a time that it
-may not be able to handle the request. The following subsections show examples.
+The `rebuild stop` command behavior depends on the current rebuild phase.
+The following subsections describe the responses in each case.


sp: describe->describes

liuxuezhao · 2026-05-06T06:44:05Z

+* When a pool has insufficient free space to accommodate relocation of affected
+  data upon engine(s) excluded/drained. For example, if a rebuild fails with
+  `status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic
+  retries). Stopping such a rebuild allows an administrator to perform alternate


this is mismatch with real processing, it will not automatically retry for ENOSPACE err.
you may change the descripted case to temporarily network error or sth else retryable err.

liuxuezhao · 2026-05-06T06:45:37Z

+* When a pool has insufficient free space to accommodate relocation of affected
+  data upon engine(s) excluded/drained. For example, if a rebuild fails with
+  `status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic
+  retries). Stopping such a rebuild allows an administrator to perform alternate
+  actions (e.g., directly reintegrate the lost engine(s); and/or pool expansion
+  to more engines).


if reint got DER_NOSPACE it should fail and revert pool map change of reint and will not automatically retry, admin can redo the reintegrate later.

"it definitely is likely and ran into that issue several times"
probably it was other err caused retry, and finally hit ENOSPACE? from code reading it should not automatically retry ENOSPACE case anyway.

liuxuezhao · 2026-05-06T06:51:07Z

+1. Run `op:Rebuild`
+2. Run `op:Fail_reclaim` to clean up
+    * If `Fail_reclaim` *itself* failed, retry `Fail_reclaim`
+    * If `Fail_reclaim` succeeded, retry the original `op:Rebuild`


if the rebuild was failed with retryable error.

kccain force-pushed the kccain/daos_17306_doc branch from b137cd4 to 185c293 Compare April 15, 2026 19:07

kccain and others added 2 commits April 16, 2026 13:44

tanabarr force-pushed the kccain/daos_17306_doc branch from 4cc9216 to caf846e Compare April 16, 2026 12:46

kccain commented Apr 16, 2026

View reviewed changes

tanabarr previously approved these changes Apr 18, 2026

View reviewed changes

kccain dismissed tanabarr’s stale review via 1fdc4e6 April 22, 2026 15:46

tanabarr force-pushed the kccain/daos_17306_doc branch from 1fdc4e6 to b61f74c Compare April 22, 2026 20:10

Apply suggestions from code review

eabee59

Co-authored-by: Ken Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr force-pushed the kccain/daos_17306_doc branch from b61f74c to 8b9ad4b Compare April 23, 2026 11:07

Make revisions from Tom's review.

12c2f3a

Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr force-pushed the kccain/daos_17306_doc branch from 8b9ad4b to 12c2f3a Compare April 23, 2026 13:05

kccain marked this pull request as ready for review April 23, 2026 15:40

kccain requested a review from a team as a code owner April 23, 2026 15:40

kccain requested review from gnailzenh, liuxuezhao, liw and mchaarawi April 23, 2026 15:40

tanabarr requested review from Michael-Hennecke and tanabarr April 24, 2026 09:35

tanabarr previously approved these changes Apr 24, 2026

View reviewed changes

mchaarawi reviewed Apr 24, 2026

View reviewed changes

liw reviewed Apr 26, 2026

View reviewed changes

address review feedback from liw

be30a82

Doc-only: true Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>

tanabarr dismissed their stale review via be30a82 April 27, 2026 16:30

tanabarr requested a review from liw April 27, 2026 16:30

Address feedback from Mohamad on rebuild_controls.md

252a59d

Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>

kccain requested a review from mchaarawi April 27, 2026 18:38

tanabarr approved these changes Apr 28, 2026

View reviewed changes

liuxuezhao reviewed May 6, 2026

View reviewed changes

		By default, the system works as previous versions (per-pool `self_heal`
		property) if one doesn't modify the "new" system `self_heal` property. An


		## Pool Query Data Redundancy Status

		Available in: DAOS 2.6+

Conversation

kccain commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

tanabarr commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daosbuild3 commented Apr 19, 2026

Uh oh!

mchaarawi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kccain Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kccain commented Apr 15, 2026 •

edited

Loading

kccain Apr 27, 2026 •

edited

Loading