Skip to content

HDDS-13133. Display Ratis state machine event timeline in OM web UI#10034

Draft
jojochuang wants to merge 6 commits intoapache:masterfrom
jojochuang:HDDS-13133
Draft

HDDS-13133. Display Ratis state machine event timeline in OM web UI#10034
jojochuang wants to merge 6 commits intoapache:masterfrom
jojochuang:HDDS-13133

Conversation

@jojochuang
Copy link
Copy Markdown
Contributor

@jojochuang jojochuang commented Apr 2, 2026

What changes were proposed in this pull request?

HDDS-13133. Display Ratis state machine event timeline in OM web UI

Please describe your PR in detail:

  • Add a new dashboard to OM web UI to show Ratis state machine events in chronological order.
  • At every Ratis state machine notification callback, record the notification messages and keep the last 100 of them.
  • A simple front end UI to show the state machine transitions.
  • Gemini CLI model gemini-3-flash-preview

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13133

How was this patch tested?

Unit tests to ensure OM and SCM state machine events are recorded.
Visited OM and SCM web UI to check for the output.

Screenshot 2026-04-02 at 2 50 23 PM Screenshot 2026-04-02 at 3 12 54 PM

Change-Id: I4febee20c124e4c738b8f646f1e347844d2b7346
… web UI

Change-Id: If8d7504ade66706ba1feb5fe05cd2def14c1ec5c
Change-Id: I7164f2d08a0c1ed6a8bf00ada697672f180bd642
Change-Id: Ia568c9ce99631d2f780a4d833d357748e51fa34b
Change-Id: Ib27d985d730869e99e38c27394fd858581d5b756
Change-Id: I025a3d1aab863c766f584b162931b9ec56bc79e0
@jojochuang jojochuang requested a review from szetszwo April 2, 2026 22:13
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jojochuang , thanks for working on this! Reviewed the SCM part. Please see the comments inlined.

BTW, we may consider adding Ratis support for this feature:

  • Ratis may publish the events and store them in a queue.
  • Ozone may subscribe it and also query the queue.

Then, all Ozone components (OM, SCM, Datanodes) only have to display the event in the UI. We don't have to repeat the metrics code. What do you think?

SCMMetrics.class.getSimpleName();

private final List<String> ratisEvents = new ArrayList<>();
private static final int MAX_RATIS_EVENTS = 100;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it configurable?

Comment on lines +218 to +219
SCMMetrics metrics = StorageContainerManager.getMetrics();
if (metrics != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that the returned metrics is never null. Add a field and a helper method.

  private final SCMMetrics metrics = StorageContainerManager.getMetrics();
  private void addRatisEvent(String eventMessage) {
    LOG.info(eventMessage);
    metrics.addRatisEvent(eventMessage);
  }

public void addRatisEvent(String event) {
synchronized (ratisEvents) {
if (ratisEvents.size() >= MAX_RATIS_EVENTS) {
ratisEvents.remove(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use LinkedList and removeFirst() to avoid array copying.

@Metric("Ratis state machine events")
public String getRatisEvents() {
synchronized (ratisEvents) {
return String.join("\n", ratisEvents);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should the metrics publish only the new events?
  • How often this method is involved by the metrics system? It could be inefficient if it keep creating the same, long string again and again.

LOG.info("current leader SCM steps down.");
SCMMetrics metrics = StorageContainerManager.getMetrics();
if (metrics != null) {
metrics.addRatisEvent("current leader SCM steps down.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include the id.

    addRatisEvent("This server " + getId() + " has stepped down from the Leader.");

Comment on lines 312 to +314
LOG.info("leader changed, yet current SCM is still follower.");
if (metrics != null) {
metrics.addRatisEvent("Leader changed to " + newLeaderId + ", yet current SCM is still follower.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include the id of this server. The log message and the

      addRatisEvent("Leader changed to " + newLeaderId
          + ", this server " + groupMemberId.getPeerId() + " remains a Follower.");

Comment on lines +440 to +442
"New peers " + newPeerIds +
(newListenersIds.isEmpty() ? "" : ", new listeners " + newListenersIds) +
" added at term index (" +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPeersList() returns the current peers in the conf. It include both the newly added peers and the existing peers (unless they are removed).

    addRatisEvent("notifyConfigurationChanged at " + TermIndex.valueOf(term, index)
        + ": " + TextFormat.shortDebugString(newRaftConfiguration));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants