DAOS-17306 doc: self-healing properties, interactive rebuild#18023
DAOS-17306 doc: self-healing properties, interactive rebuild#18023
Conversation
b137cd4 to
185c293
Compare
|
Ticket title is 'Enable/disable auto recovery' |
For the DAOS version 2.8 release, add two major sections to the DAOS Administrator's Guide: - self-healing properties / policy controls (DAOS-17306) - explicit / interactive rebuild control (DAOS-17281) Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
output (introduced in PR #17371 / DAOS 2.6). The new section explains: - Field values (normal vs degraded) - When to check this field - Example usage with exclude-only self-heal policies - How to verify exclusion completed when auto-rebuild is disabled Updated pool query examples throughout to show the Data redundancy field for consistency with DAOS 2.6+ output. Particularly useful for scenarios where system.self_heal is set to exclude,pool_exclude or pool self_heal has exclude bit set without rebuild, to confirm exclusion has occurred. Related-to: #17371 Doc-only: true Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
4cc9216 to
caf846e
Compare
|
had to force push to clean up the commit list, Hope that's okay |
| - Rebuild busy, 42 objs, 21 recs | ||
| - Data redundancy: degraded | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Something I thought of, but no change requested
there are other cases like drain that would show Rebuild busy but Data redundancy: normal
And I guess extend might show something similar. But it's probably not worth enumerating too many cases actually.
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18023/5/testReport/ |
1fdc4e6 to
b61f74c
Compare
Co-authored-by: Ken Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
b61f74c to
8b9ad4b
Compare
Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com> Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
8b9ad4b to
12c2f3a
Compare
mchaarawi
left a comment
There was a problem hiding this comment.
reviewed the rebuild_controls file only.
will get to self_healing next.
| * A system with many tens of pools are all rebuilding simultaneously (requiring | ||
| substantial CPU and network resources in the DAOS system). And the | ||
| administrator deciding in such cases to stagger an overall recovery by | ||
| manually commanding pools to rebuild in smaller batches. |
There was a problem hiding this comment.
just a thought for future improvement: this is a nice use case for having dmg pool rebuild stop/start accepting a list of pools.
| * When a pool has insufficient free space to accommodate relocation of affected | ||
| data upon engine(s) excluded/drained. For example, if a rebuild fails with | ||
| `status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic | ||
| retries). Stopping such a rebuild allows an administrator to perform alternate | ||
| actions (e.g., directly reintegrate the lost engine(s); and/or pool expansion | ||
| to more engines). |
There was a problem hiding this comment.
what if you get DER_NOSPACE on reintegrate rebuild ?
There was a problem hiding this comment.
I'm not sure what to expect. I imagine if it is a direct reintegration of the excluded targets shortly before, then getting -DER_NOSPACE may be unlikely (since the data was placed there originally)?
There was a problem hiding this comment.
it definitely is likely and ran into that issue several times :-)
probably a bug, but sometimes we get aroudn that issue by restarting the system or reintegrate only 1 pool at a time.
There was a problem hiding this comment.
if reint got DER_NOSPACE it should fail and revert pool map change of reint and will not automatically retry, admin can redo the reintegrate later.
"it definitely is likely and ran into that issue several times"
probably it was other err caused retry, and finally hit ENOSPACE? from code reading it should not automatically retry ENOSPACE case anyway.
| (e.g., system stop, system/pool exclude, reintegrate, drain, pool extend) can | ||
| trigger new rebuild(s). Also, if a rebuild is stopped, whatever progress it had | ||
| made in reconstructing the data in the pool is not retained — a subsequent | ||
| "rebuild start" command will start the rebuild from the beginning (i.e., this is |
There was a problem hiding this comment.
what if a pool has failed rebuild with something random, let's say DER_STALE, or some network issue.
can someone just call rebuild start and it should restart, or would one need to call rebuild stop then rebuild start? or is that sort of use case not handled by this?
There was a problem hiding this comment.
rebuild start invokes the same logic as a pool leader switch or system restart — it launches rebuilds for any targets found in relevant states (DOWN, DRAINING, or UP). But I don't think it's needed here.
For transient errors like -DER_STALE or network errors, the retry is automatic. No pool map revert occurs, so the retry finds targets in the same state as the original operation (possibly with additional targets in new states from pool map changes that occurred during the failed rebuild). No rebuild start needed.
For other (general) errors, the pool map is reverted after failure (excluded targets stay DOWN; reintegrating UP → DOWN/DOWNOUT; draining DRAIN → UPIN; extending UP → NEW). A retry rebuild is also launched automatically, but for reintegrate/drain/extend the targets are no longer in a rebuildable state, so the retry quickly self-cancels. To redo those operations, the original dmg command (reintegrate, drain, extend) would need to be reissued.
There was a problem hiding this comment.
i would not agree that all network errors the retry is automatic. we have seen some failure of rebuild with DER_HG or DER_NOMEM (which come s from the network).
There was a problem hiding this comment.
This is my best understanding of the code. It's based on looking at the behavior of retry_rebuild_task() and the context in which it is invoked. I'd have to see particular failure instances in which a rebuild retry was not actually launched as a result of one of these failures, to be able to better see why it is not true.
| The `rebuild stop` commands are not typically allowed to terminate a rebuild in | ||
| the `op:Reclaim` and `op:Fail_reclaim` phases — instead the command must be | ||
| issued during the `op:Rebuild` execution. An exception is available with the | ||
| `--force` option to `rebuild stop`, intended to be applied for rebuilds that | ||
| repeatedly fail and possibly may even be looping `Fail_reclaim` operations. |
There was a problem hiding this comment.
is there a reason to do it this way? why not just make it always stop (the force being the default)?
just from experience, usually when you want to stop a rebuild, there is usually always something wrong going on either in the rebuild phase, or in the reclaim_fail phase.
There was a problem hiding this comment.
Sorry, I think I got the documentation wrong here and in the last section describing command errors. I'll fix it up. When rebuild stop occurs during Fail_reclaim, it succeeds silently (no error returned to the administrator), though it does allow the Fail_reclaim to continue to its own (hopefully successful) completion. After which, any normal retry mechanism will be suppressed (it remembers that the command was issued during Fail_Reclaim).
The more pathological case of Fail_reclaim itself failing and retrying itself is what the --force option is for. This one will only allow the Fail_reclaim to be forcibly terminated if it has failed at least once (i.e, an admin can't stop the first Fail_reclaim before we even know if that one will fail, and get into the pathological case).
I was being very conservative in the implementation to make sure to run any reclaim behavior during stop, to avoid causing potential further problems due to not freeing up space after a rebuild. One idea for the future may be to always force stop Fail_reclaim, with the idea that it "should be ok" to combine its cleanup work with a reclaim following the next rebuild. Obviously, would require some more careful code analysis and testing to be sure.
| Because of these details, carefully timing the execution of `rebuild stop` | ||
| commands is needed, which can be facilitated with pool rebuild state querying | ||
| with `dmg pool query`. See the section | ||
| [Rebuild Stop Command Errors](#rebuild-stop-command-errors) for examples of | ||
| errors returned by "rebuild stop" in different timing circumstances. |
There was a problem hiding this comment.
There is a race condition here, and you cannot expect an admin to "time" commands to be issued.
There was a problem hiding this comment.
In the typical case (Rebuild fails, Fail_reclaim runs successfully, Rebuild is retried) - if the stop command is received during Fail_reclaim it will be handled gracefully without undue burden on the administrator to time the command. I'll update the text to fix my inaccurate description.
| Pool cdf27ec1-ed97-4aa6-a766-39a2ed2136a1, ntarget=64, disabled=8, leader=6, version=77, state=TargetsExcluded | ||
| Pool health info: | ||
| - Disabled ranks: 3 | ||
| - Rebuild busy, 1 objs, 0 recs |
There was a problem hiding this comment.
just for the sake of this example, i would put something in recs and not 0. usually if it is 0 and objs is not 0, it means it is in reclaim or reclaim_fail.
| $ dmg pool rebuild stop p1 | ||
| ``` | ||
|
|
||
| Run `dmg pool query` in a loop (with short delays between commands). |
There was a problem hiding this comment.
this is really not a good practice. other than pool query sometimes hangs or takes a long time when rebuild is running, this also a pretty ugly way to administer a system.
also one can just cause a problem running multiple queries in large loops.
anyway, i just think that we should not be asking anyone to do any pool operations in a loop.
There was a problem hiding this comment.
will fix to emphasize pool query can be used to inspect rebuild state, and with advice to not run it too frequently, and to consider using --health-only option.
There was a problem hiding this comment.
i do not think you need to get into that. just avoid really saying that one needs to query in a loop.
| - Disabled ranks: 3 | ||
| - Rebuild stopped (state=idle, status=-2027) | ||
| - Data redundancy: degraded | ||
| ``` |
There was a problem hiding this comment.
an alternate outcome is that fail_reclaim fails or actually stuck.. what is the course of action in this case?
stop the rebuild with --force and start it again?
will reclaim_fail restart first in this case before rebuild?
maybe add that example here?
There was a problem hiding this comment.
The action is to run rebuild stop with --force and start it again. When restarted, it will start from the beginning with Rebuild (not Fail_reclaim).
I can try to put a small example in for this, though it may be thin on command outputs
and may just have to describe the conditions seen in pool query output.
| Rank State | ||
| ---- ----- | ||
| [0-2,4-7] Joined | ||
| 3 Stopped |
There was a problem hiding this comment.
Stopped will transition to Excluded and this is when rebuild kicks in.
There was a problem hiding this comment.
added in latest push
| By default, the system works as previous versions (per-pool `self_heal` | ||
| property) if one doesn't modify the "new" system `self_heal` property. An |
There was a problem hiding this comment.
[Nit] This assumes the current release to 2.8, which is not the case for master. Might reword 'the "new" system self_heal property' to something like 'the system self_heal property introduced in 2.8', and "previous versions" to something like "pre-2.8 versions".
|
|
||
| Pool-level properties apply to pool-specific behavior and each pool has its own | ||
| set of pool-properties, which can be read and written by the system | ||
| administrator using `dmg pool get-prop` / `dmg pool set-prop`. |
There was a problem hiding this comment.
[Nit] Might want to remind users about the need to escape ; (as in exclude\;rebuild).
| * **`exclude`** — Whether engines get excluded from the system membership based | ||
| on (SWIM) activity detection. | ||
| * **`pool_exclude`** — Whether when engine/target states change (e.g., as a | ||
| result of engine exclusion from the system), all relevant affected pools' pool | ||
| maps will be automatically updated (or not). |
There was a problem hiding this comment.
[Nit] For both "exclude"s, I feel that for users an important info is to know this: When disabled, future operations (pool create, object update, rebuild tasks, etc.) may time out and retry until either the exclusion happens or the unavailable targets/engines become available again.
There was a problem hiding this comment.
re. "exclusion happens" isn't this dependent on the exclusion flag in question being enabled? "or the unavailable targets/engines become available again" doesn't immediately make sense to me in the context of the "exclude" flags, isn't that just a general statement?
| rebuild does not necessarily trigger automatically and can be delayed based on | ||
| user requirements. Delay rebuild is mostly out of scope for this section. | ||
|
|
||
| On starting a DAOS system and pool creation, default `self_heal` flags will be |
There was a problem hiding this comment.
Hmm, you mean "on creating a DAOS system"? If one modifies system self_heal to, say, exclude, then restarting the system will not "reset" this property to exclude;pool_exclude;pool_rebuild.
|
|
||
| ## Pool Query Data Redundancy Status | ||
|
|
||
| **Available in:** DAOS 2.6+ |
There was a problem hiding this comment.
[Nit] 2.6 doesn't have it, IIRC. So perhaps clearer to say something like "2.8 and later" or ">= 2.8".
|
|
||
| The `dmg pool query` command displays the pool's data redundancy status as part | ||
| of the health information output. This field provides a clear indication of | ||
| whether the pool has sufficient target availability to maintain data redundancy. |
There was a problem hiding this comment.
[Question] Is it really accurate? My understanding is that there may be sufficient targets available, but the data redundancy can still be degraded, for example, because the rebuild task is still in progress, or is temporarily disabled.
Doc-only: true Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Doc-only: true Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
| The `rebuild stop` command may return errors when it is issued at a time that it | ||
| may not be able to handle the request. The following subsections show examples. | ||
| The `rebuild stop` command behavior depends on the current rebuild phase. | ||
| The following subsections describe the responses in each case. |
| * When a pool has insufficient free space to accommodate relocation of affected | ||
| data upon engine(s) excluded/drained. For example, if a rebuild fails with | ||
| `status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic | ||
| retries). Stopping such a rebuild allows an administrator to perform alternate |
There was a problem hiding this comment.
this is mismatch with real processing, it will not automatically retry for ENOSPACE err.
you may change the descripted case to temporarily network error or sth else retryable err.
| * When a pool has insufficient free space to accommodate relocation of affected | ||
| data upon engine(s) excluded/drained. For example, if a rebuild fails with | ||
| `status=-1007` (`-DER_NOSPACE`) (that will likely repeat in its automatic | ||
| retries). Stopping such a rebuild allows an administrator to perform alternate | ||
| actions (e.g., directly reintegrate the lost engine(s); and/or pool expansion | ||
| to more engines). |
There was a problem hiding this comment.
if reint got DER_NOSPACE it should fail and revert pool map change of reint and will not automatically retry, admin can redo the reintegrate later.
"it definitely is likely and ran into that issue several times"
probably it was other err caused retry, and finally hit ENOSPACE? from code reading it should not automatically retry ENOSPACE case anyway.
| 1. Run `op:Rebuild` | ||
| 2. Run `op:Fail_reclaim` to clean up | ||
| * If `Fail_reclaim` *itself* failed, retry `Fail_reclaim` | ||
| * If `Fail_reclaim` succeeded, retry the original `op:Rebuild` |
There was a problem hiding this comment.
- if the rebuild was failed with retryable error.
For the DAOS version 2.8 release, add two major sections to the DAOS Administrator's Guide:
Doc-only: true
Signed-off-by: Kenneth Cain kenneth.cain@hpe.com
Signed-off-by: Tom Nabarro thomas.nabarro@hpe.com
Steps for the author:
After all prior steps are complete: