Skip to content

DAOS-17306 doc: self-healing properties, interactive rebuild#18023

Draft
kccain wants to merge 2 commits intomasterfrom
kccain/daos_17306_doc
Draft

DAOS-17306 doc: self-healing properties, interactive rebuild#18023
kccain wants to merge 2 commits intomasterfrom
kccain/daos_17306_doc

Conversation

@kccain
Copy link
Copy Markdown
Contributor

@kccain kccain commented Apr 15, 2026

For the DAOS version 2.8 release, add two major sections to the DAOS Administrator's Guide:

  • self-healing properties / policy controls (DAOS-17306)
  • explicit / interactive rebuild control (DAOS-17281)

Doc-only: true

Signed-off-by: Kenneth Cain kenneth.cain@hpe.com
Signed-off-by: Tom Nabarro thomas.nabarro@hpe.com

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@kccain kccain force-pushed the kccain/daos_17306_doc branch from b137cd4 to 185c293 Compare April 15, 2026 19:07
@github-actions
Copy link
Copy Markdown

Ticket title is 'Enable/disable auto recovery'
Status is 'Resolved'
Labels: '2.8pp'
https://daosio.atlassian.net/browse/DAOS-17306

kccain and others added 2 commits April 16, 2026 13:44
For the DAOS version 2.8 release, add two major sections to the
DAOS Administrator's Guide:
- self-healing properties / policy controls (DAOS-17306)
- explicit / interactive rebuild control (DAOS-17281)

Doc-only: true

Signed-off-by: Kenneth Cain <kenneth.cain@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
output (introduced in PR #17371 / DAOS 2.6).

The new section explains:
- Field values (normal vs degraded)
- When to check this field
- Example usage with exclude-only self-heal policies
- How to verify exclusion completed when auto-rebuild is disabled

Updated pool query examples throughout to show the Data redundancy
field for consistency with DAOS 2.6+ output.

Particularly useful for scenarios where system.self_heal is set to
exclude,pool_exclude or pool self_heal has exclude bit set without
rebuild, to confirm exclusion has occurred.

Related-to: #17371
Doc-only: true
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the kccain/daos_17306_doc branch from 4cc9216 to caf846e Compare April 16, 2026 12:46
@tanabarr
Copy link
Copy Markdown
Contributor

had to force push to clean up the commit list, Hope that's okay


| Value | Meaning |
|-------|---------|
| `normal` | All targets are UP and accessible; data redundancy is intact |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `normal` | All targets are UP and accessible; data redundancy is intact |
| `normal` | No targets are DOWN; data redundancy is intact |

Reasoning is for targets that are being reintegrated, their state is UP (but not yet UP_IN). The data redundancy would likely be intact in this case, if a previous exclude rebuild finished successfully.

- Data redundancy: normal
```

or when targets are excluded:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or when targets are excluded:
or when targets are excluded and a corresponding rebuild has not yet completed:

- Rebuild busy, 42 objs, 21 recs
- Data redundancy: degraded
```

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I thought of, but no change requested
there are other cases like drain that would show Rebuild busy but Data redundancy: normal
And I guess extend might show something similar. But it's probably not worth enumerating too many cases actually.


The combination of:
- `Disabled ranks: 3` — Shows which rank was excluded
- `Rebuild idle` — No automatic rebuild triggered (expected with exclude-only policy)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `Rebuild idle` No automatic rebuild triggered (expected with exclude-only policy)
- `Rebuild idle` or `Rebuild done`— idle if pool has not been rebuilt since system start, done if it has been rebuild for a prior change. No new automatic rebuild triggered for this exclusion (expected with exclude-only policy)

- `Data redundancy: degraded` — Confirms data redundancy is impaired and manual
intervention is needed

To restore redundancy, manually trigger rebuild:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To restore redundancy, manually trigger rebuild:
To restore redundancy for the excluded targets, manually trigger rebuild. Refer to [Detailed Pool Operations: Interactive Rebuild Controls](rebuild_controls.md) for more information about rebuild start commands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants