Skip to content

[ARO-25354] Monitor fixes for races#4702

Open
hawkowl wants to merge 7 commits intomasterfrom
hawkowl/ARO-25186-2
Open

[ARO-25354] Monitor fixes for races#4702
hawkowl wants to merge 7 commits intomasterfrom
hawkowl/ARO-25186-2

Conversation

@hawkowl
Copy link
Copy Markdown
Collaborator

@hawkowl hawkowl commented Mar 20, 2026

Fixes ARO-25354

Reduces test races in the Monitor and reduces flakes.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 20, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Mar 20, 2026
@github-actions
Copy link
Copy Markdown

Please rebase pull request.

@hawkowl hawkowl force-pushed the hawkowl/ARO-25186-2 branch from 9b0357e to 0c3abcb Compare March 24, 2026 04:19
@github-actions github-actions bot added needs-rebase branch needs a rebase and removed needs-rebase branch needs a rebase labels Mar 24, 2026
@github-actions
Copy link
Copy Markdown

Please rebase pull request.

func (mon *clusterChangeFeedResponder) deleteDoc(doc *api.OpenShiftClusterDocument) {
v, loaded := mon.docs.LoadAndDelete(doc.ID)
if loaded {
close(v.stop)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we guard this close for nil? For non-owned buckets, fixDoc can leave v.stop == nil (no worker started), but deleteDoc can still be reached for that cached doc, and close(nil) would panic. The previous implementation only closed when non-nil. Related context:

ARO-RP/pkg/monitor/cache.go

Lines 150 to 160 in 0c3abcb

func (mon *clusterChangeFeedResponder) fixDoc(v *cacheDoc) {
_, ours := mon.buckets[v.doc.Bucket]
if !ours && v.stop != nil {
close(v.stop)
v.stop = nil
} else if ours && v.stop == nil {
ch := make(chan struct{})
v.stop = ch
go mon.newWorker(ch, v.doc.ID)
}

@hawkowl hawkowl changed the title [WIP] Monitor fixes for races [ARO-25354] Monitor fixes for races Apr 7, 2026
@hawkowl hawkowl force-pushed the hawkowl/ARO-25186-2 branch from 0c3abcb to 0adea16 Compare April 7, 2026 01:04
@github-actions github-actions bot removed the needs-rebase branch needs a rebase label Apr 7, 2026
@hawkowl hawkowl marked this pull request as ready for review April 7, 2026 04:53
Copilot AI review requested due to automatic review settings April 7, 2026 04:53
@hawkowl hawkowl added bug Something isn't working ready-for-review flake go Pull requests that update Go code skippy pull requests raised by member of Team Skippy labels Apr 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses ARO-25354 by reducing race conditions and flakiness in Monitor-related changefeed processing, primarily by tracking “processed” vs “data updated” timestamps and by refactoring the monitor’s cluster cache to use a concurrent map.

Changes:

  • Extend the generic changefeed consumer contract to report whether any documents were retrieved in the current polling iteration.
  • Add “last data update” timestamps to changefeed-backed caches and update tests to wait on those signals instead of fake iterator consumption.
  • Refactor Monitor’s cluster cache into a dedicated changefeed responder backed by xsync.Map.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/util/changefeed/subscriptioncache.go Adds GetLastDataUpdate and tracks processed vs data-update timestamps.
pkg/util/changefeed/subscriptioncache_test.go Replaces iterator-consumption waits with timestamp-based waits.
pkg/util/changefeed/changefeed.go Changes consumer callback to OnAllPendingProcessed(bool) and plumbs “gotAny” through.
pkg/util/changefeed/changefeed_test.go Updates fake responder to new callback signature.
pkg/monitor/worker.go Introduces clusterChangeFeedResponder cache/responder and updates worker to read from it.
pkg/monitor/worker_test.go Updates changefeed wait conditions to timestamp-based signals.
pkg/monitor/monitor.go Wires the new cluster responder into Monitor and readiness checks.
pkg/monitor/monitor_test.go Updates bucket assertions to use the new responder state.
pkg/monitor/cache.go Moves cache doc lifecycle (upsert/delete/fix) onto the new responder with xsync.Map.
pkg/monitor/cache_test.go Updates cache tests to use the new responder and map.
pkg/mimo/scheduler/clustercache.go Aligns MIMO cluster cache changefeed timestamp tracking with the new “processed vs updated” split.
pkg/database/cosmosdb/subscriptions_ext.go Removes fake client AllIteratorsConsumed helper (tests no longer rely on it).
pkg/database/cosmosdb/openshiftcluster_ext.go Removes fake client AllIteratorsConsumed helper (tests no longer rely on it).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hawkowl hawkowl force-pushed the hawkowl/ARO-25186-2 branch from 0adea16 to a1a2bdb Compare April 8, 2026 09:54
Copilot AI review requested due to automatic review settings April 8, 2026 11:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hawkowl
Copy link
Copy Markdown
Collaborator Author

hawkowl commented Apr 9, 2026

/azp run ci

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working flake go Pull requests that update Go code ready-for-review skippy pull requests raised by member of Team Skippy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants