Skip to content

Operator record leader count before upgrade is inaccurate #6540

@mayjiang0203

Description

@mayjiang0203

Feature Request

Is your feature request related to a problem? Please describe:

Describe the feature you'd like:

For v1, after the PD /ready interface is confirmed, the operator will initiate the process of evicting the TiKV leader. Prior to this, it first retrieves the current TiKV leader count from the PD. However, although the loadRegion process has completed on the PD at that point, the leader count may not yet be fully up to date, as it requires some time for the region heartbeat to propagate and update the information.

As a result, the recorded TiKV region leader count at that moment may be inaccurate—often significantly lower than the actual number. Since this recorded count is used to determine when to wait for rebalancing, an underestimated value could cause the next TiKV node to restart earlier than intended. This premature restart may then lead to an excessive number of leader transfers during the second TiKV restart, subsequently increasing CDC lag.

It would be helpful if we could support one new API for leader-ready. Maybe we can realize it through this new api interface in PD tikv/pd#9852

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions