Skip to content

Fix-6509 Improve collator strength benchmarking performance#7589

Open
Jayant-kernel wants to merge 4 commits intounicode-org:mainfrom
Jayant-kernel:fix-6509-collator-strength
Open

Fix-6509 Improve collator strength benchmarking performance#7589
Jayant-kernel wants to merge 4 commits intounicode-org:mainfrom
Jayant-kernel:fix-6509-collator-strength

Conversation

@Jayant-kernel
Copy link
Copy Markdown
Contributor

@Jayant-kernel Jayant-kernel commented Feb 5, 2026

Fixes #6509

The Problem

Right now, the collator benchmarks run through all 5 strength levels (Primary, Secondary, Tertiary, Quaternary, Identical) on every test. The issue is that our test data (like the Polish and Latin names) only really differs at the Primary level, so we're basically running the same comparison 5 times and getting the same result each time. This makes the benchmark suite slow without giving us useful information about the higher strength levels.

What I Changed

1. Made existing benchmarks faster

  • Updated general benchmarks to use just tertiary strength (the default configuration) instead of all 5 levels
  • This should speed things up by roughly 5x
  • I kept the all_strength array definition in the code (in case it's needed elsewhere), but it's no longer used in the benchmark loops

2. Added realistic tests for higher strengths
Created three new test data files where the different strength levels actually matter:

  • TestNames_Secondary.txt – Names with accent differences (José vs Jose, Café vs Cafe, René vs Rene)
  • TestNames_Tertiary.txt – Names with case differences (McDonald vs MCDONALD, banana vs Banana)
  • TestNames_Quaternary.txt – Names with punctuation differences (can't vs cant, co-op vs coop, e-mail vs email)

Each file has its own dedicated benchmark that runs at the appropriate strength level, so we're actually testing scenarios where those strength differences matter.

Testing

  • Ran cargo bench -p icu_collator locally and everything compiles and runs correctly
  • The compiler shows all_strength as an unused variable, which confirms our changes are working as intended
  • The benchmark suite runs noticeably faster while providing better coverage of the different strength levels

Compress short benchmark entries to single lines per cargo fmt
Renamed all_strength to _all_strength to satisfy clippy deny(warnings)
@Jayant-kernel
Copy link
Copy Markdown
Contributor Author

@hsivonen @sffc @echeran @Manishearth @robertbastian
plaese review the solution when you are free .

@hsivonen
Copy link
Copy Markdown
Member

Please don't cause GitHub to re-send notication emails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve collator strength benchmarking

2 participants