HDDS-13133. Display Ratis state machine event timeline in OM web UI#10034
HDDS-13133. Display Ratis state machine event timeline in OM web UI#10034jojochuang wants to merge 6 commits intoapache:masterfrom
Conversation
Change-Id: I4febee20c124e4c738b8f646f1e347844d2b7346
… web UI Change-Id: If8d7504ade66706ba1feb5fe05cd2def14c1ec5c
Change-Id: I7164f2d08a0c1ed6a8bf00ada697672f180bd642
Change-Id: Ia568c9ce99631d2f780a4d833d357748e51fa34b
Change-Id: I025a3d1aab863c766f584b162931b9ec56bc79e0
There was a problem hiding this comment.
@jojochuang , thanks for working on this! Reviewed the SCM part. Please see the comments inlined.
BTW, we may consider adding Ratis support for this feature:
- Ratis may publish the events and store them in a queue.
- Ozone may subscribe it and also query the queue.
Then, all Ozone components (OM, SCM, Datanodes) only have to display the event in the UI. We don't have to repeat the metrics code. What do you think?
| SCMMetrics.class.getSimpleName(); | ||
|
|
||
| private final List<String> ratisEvents = new ArrayList<>(); | ||
| private static final int MAX_RATIS_EVENTS = 100; |
There was a problem hiding this comment.
Let's make it configurable?
| SCMMetrics metrics = StorageContainerManager.getMetrics(); | ||
| if (metrics != null) { |
There was a problem hiding this comment.
Seems that the returned metrics is never null. Add a field and a helper method.
private final SCMMetrics metrics = StorageContainerManager.getMetrics(); private void addRatisEvent(String eventMessage) {
LOG.info(eventMessage);
metrics.addRatisEvent(eventMessage);
}| public void addRatisEvent(String event) { | ||
| synchronized (ratisEvents) { | ||
| if (ratisEvents.size() >= MAX_RATIS_EVENTS) { | ||
| ratisEvents.remove(0); |
There was a problem hiding this comment.
Use LinkedList and removeFirst() to avoid array copying.
| @Metric("Ratis state machine events") | ||
| public String getRatisEvents() { | ||
| synchronized (ratisEvents) { | ||
| return String.join("\n", ratisEvents); |
There was a problem hiding this comment.
- Should the metrics publish only the new events?
- How often this method is involved by the metrics system? It could be inefficient if it keep creating the same, long string again and again.
| LOG.info("current leader SCM steps down."); | ||
| SCMMetrics metrics = StorageContainerManager.getMetrics(); | ||
| if (metrics != null) { | ||
| metrics.addRatisEvent("current leader SCM steps down."); |
There was a problem hiding this comment.
Let's include the id.
addRatisEvent("This server " + getId() + " has stepped down from the Leader.");| LOG.info("leader changed, yet current SCM is still follower."); | ||
| if (metrics != null) { | ||
| metrics.addRatisEvent("Leader changed to " + newLeaderId + ", yet current SCM is still follower."); |
There was a problem hiding this comment.
Let's include the id of this server. The log message and the
addRatisEvent("Leader changed to " + newLeaderId
+ ", this server " + groupMemberId.getPeerId() + " remains a Follower.");| "New peers " + newPeerIds + | ||
| (newListenersIds.isEmpty() ? "" : ", new listeners " + newListenersIds) + | ||
| " added at term index (" + |
There was a problem hiding this comment.
getPeersList() returns the current peers in the conf. It include both the newly added peers and the existing peers (unless they are removed).
addRatisEvent("notifyConfigurationChanged at " + TermIndex.valueOf(term, index)
+ ": " + TextFormat.shortDebugString(newRaftConfiguration));
What changes were proposed in this pull request?
HDDS-13133. Display Ratis state machine event timeline in OM web UI
Please describe your PR in detail:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13133
How was this patch tested?
Unit tests to ensure OM and SCM state machine events are recorded.
Visited OM and SCM web UI to check for the output.