Skip to content

HDDS-14990. Show failed volumes in ozone admin datanode list output and SCM metrics#10058

Open
xichen01 wants to merge 4 commits intoapache:masterfrom
xichen01:HDDS-14990
Open

HDDS-14990. Show failed volumes in ozone admin datanode list output and SCM metrics#10058
xichen01 wants to merge 4 commits intoapache:masterfrom
xichen01:HDDS-14990

Conversation

@xichen01
Copy link
Copy Markdown
Contributor

@xichen01 xichen01 commented Apr 9, 2026

What changes were proposed in this pull request?

ozone admin datanode list can dispaly "Total volume count"  and "Healthy volume count", Datanode JMX can display the NumFailedVolumes 

Add:

  • The command display which disk was failure.
  • SCM's Prometheus metrics for failed disk.
  • Add ozone admin datanode list --failed-volumes to disply failed disk Datanode only

Example

$ ozone admin datanode list
Datanode: d81a7da8-86d4-42ca-83e9-7015401ed11c (/default/1.2.3.01/host0/0 pipelines)
Operational State: IN_SERVICE
Health State: STALE
Total volume count: 4
Healthy volume count: 3
Failed volumes:    <--- Newly added
  /data/disk2

//...

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14990

How was this patch tested?

new test

Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the patch.

Comment on lines +92 to +102
Integer healthyVolumeCount = node.hasHealthyVolumeCount() ? node.getHealthyVolumeCount() : null;
BasicDatanodeInfo singleNodeInfo = new BasicDatanodeInfo.Builder(
DatanodeDetails.getFromProtoBuf(node.getNodeID()), node.getNodeOperationalStates(0),
node.getNodeStates(0)).withVolumeCounts(totalVolumeCount, healthyVolumeCount).build();
node.getNodeStates(0)).withVolumeCounts(totalVolumeCount, healthyVolumeCount)
.withFailedVolumes(getFailedVolumes(node)).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest a small refactoring to reduce duplication of code related to BasicDatanodeInfo creation.

HDDS-14990.patch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
} else {
pipelineListInfo.append("No pipelines in cluster.");
pipelineListInfo.append(System.getProperty("line.separator"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? pipelineListInfo is output with println below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to maintain consistency with the pipelineListInfo in the previous branches, where pipelineListInfos always have line breaks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info. Then please change to:

pipelineListInfo.append("No pipelines in cluster.")
    .append(System.getProperty("line.separator"));

due to recently added PMD rule (HDDS-14934).

(BTW, I think using \n would be fine for now, it is already used in few places instead of System.getProperty("line.separator"), both in this file and in 122 others.)

Comment on lines +398 to +401
assertTrue(output.contains("Failed volume"));
assertTrue(output.contains("/data/disk2"));
assertTrue(output.contains("/data/disk1"));
assertTrue(output.contains("/data/disk5"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use assertThat(output).contains(...) for better failure message (see HDDS-9951)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines +69 to +70
@CommandLine.Option(names = {"--failed-volumes"},
description = "Only show datanodes that have at least one failed volume.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would --with-failed-volume be better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about --show-failed-volumes ? --with-failed-volume seems mean that "display along with bad disk information".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--show-failed-volumes implies failed volumes are normally hidden, and included in the output only if this flag is used.

I think the command

ozone admin datanode list --with-failed-volume

reads better.

If we want to be extra clear, it can be --nodes-with-failed-volume.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to --nodes-with-failed-volume

@adoroszlai adoroszlai changed the title HDDS-14990. Dispaly the failed Disk detail info in the command and Metrics HDDS-14990. Let ozone admin datanode list show failed volumes and add metrics Apr 9, 2026
@adoroszlai adoroszlai changed the title HDDS-14990. Let ozone admin datanode list show failed volumes and add metrics HDDS-14990. Show failed volumes in ozone admin datanode list output and SCM metrics Apr 9, 2026
Copy link
Copy Markdown
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for updating the patch. Let's wait to see if others have comments.

@adoroszlai
Copy link
Copy Markdown
Contributor

TestReconAndAdminContainerCLI is flaky (HDDS-11128), no need to re-run.

@adoroszlai adoroszlai requested a review from sarvekshayr April 9, 2026 18:29
Copy link
Copy Markdown
Contributor

@sreejasahithi sreejasahithi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for working on this.

Comment on lines +68 to +72
@CommandLine.Option(names = {"--nodes-with-failed-volumes"},
description = "Only show datanodes that have at least one failed volume.",
defaultValue = "false")
private boolean nodeWithFailedVolumes;

Copy link
Copy Markdown
Contributor

@sreejasahithi sreejasahithi Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --nodes-with-failed-volumes filter is silently ignored when --node-id is used.
We should make these options mutually exclusive to avoid confusion when a user provides both in the command but receives the result for the node whose ID was specified, regardless of whether it has failed volumes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants