Skip to content

Default to Balanced durability mode in cluster mode#4885

Merged
AhmedSoliman merged 1 commit into
mainfrom
pr4885
Jun 4, 2026
Merged

Default to Balanced durability mode in cluster mode#4885
AhmedSoliman merged 1 commit into
mainfrom
pr4885

Conversation

@AhmedSoliman
Copy link
Copy Markdown
Member

@AhmedSoliman AhmedSoliman commented Jun 4, 2026

When attempting to run a cluster without a snapshot repository we now default to using a balanced durability mode.
We decide whether we are in a cluster-mode or not based on the number of workers in nodes configuration as well as
if partition replication is > 1. This means that we will automatically disable trimming and WARN the user every
minute that they need to configure a snapshot store. This is safer than the previous default which would keep trimming
enabled (to save space and to reduce risk of historical nodeset decay) but would mean that partitions will fail to
move to other nodes due to absence of the snapshot.

Bonus: This takes a couple of small steps in the direction of fully removing the legacy "everywhere" replication from
partition table. A todo is left for a future proactive migration for existing clusters with old partition table metadata.

Fixes #4824


Stack created with Sapling. Best reviewed with ReviewStack.

@AhmedSoliman AhmedSoliman force-pushed the pr4885 branch 2 times, most recently from 4c9d612 to 28f67e2 Compare June 4, 2026 10:00
@AhmedSoliman
Copy link
Copy Markdown
Member Author

@codex review

@AhmedSoliman AhmedSoliman marked this pull request as ready for review June 4, 2026 10:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 28f67e2693

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +370 to +373
partition_replication: Some(
partition_table
.replication_property(&Metadata::with_current(|m| m.nodes_config_ref()))
.into(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Everywhere on configuration reads

For clusters that still have legacy PartitionReplication::Everywhere metadata, this now returns a concrete replication property instead of omitting the field. If a client reads the cluster configuration and writes it back unchanged (for example while changing only bifrost_provider), update_cluster_configuration treats any Some value that is not already Limit(current) as a request to set Limit (crates/admin/src/cluster_controller/service.rs:699-701); previously the protobuf conversion encoded Everywhere as None (crates/types/src/protobuf.rs:134-140), which preserved the legacy dynamic behavior. That round-trip freezes replication to the current worker count, so later added workers are no longer automatically included as they were under Everywhere.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really want to get rid of the legacy everywhere.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Test Results

  8 files  ±0    8 suites  ±0   4m 56s ⏱️ +7s
 60 tests ±0   60 ✅ ±0  0 💤 ±0  0 ❌ ±0 
267 runs  ±0  267 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 6b47208. ± Comparison against base commit 93b38cc.

♻️ This comment has been updated with latest results.

When attempting to run a cluster without a snapshot repository we now default to using a balanced durability mode.
We decide whether we are in a cluster-mode or not based on the number of workers in nodes configuration as well as
if partition replication is `> 1`. This means that we will automatically disable trimming and `WARN` the user every
minute that they need to configure a snapshot store. This is safer than the previous default which would keep trimming
enabled (to save space and to reduce risk of historical nodeset decay) but would mean that partitions will fail to
move to other nodes due to absence of the snapshot.

Bonus: This takes a couple of small steps in the direction of fully removing the legacy "everywhere" replication from
partition table. A todo is left for a future proactive migration for existing clusters with old partition table metadata.

Fixes #4824
Copy link
Copy Markdown
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changing the default DurabilityMode from ReplicaSetOnly to Balanced @AhmedSoliman. LGTM. +1 for merging.

@AhmedSoliman AhmedSoliman merged commit 6b47208 into main Jun 4, 2026
56 of 57 checks passed
@AhmedSoliman AhmedSoliman deleted the pr4885 branch June 4, 2026 13:48
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verification Compatibility Random failed with a log gap

2 participants