CodePointTrie support for normalizer and collator perf improvements by hsivonen · Pull Request #7768 · unicode-org/icu4x

hsivonen · 2026-03-11T16:35:15Z

Split out of #7526 and #7600. The code here needs to be published to crates.io, before those changes can land, because the utf8_iter and utf16_iter crates need to depend on a version icu_collections that has this code on crates.io.

Changelog

icu_collections: Add CodePointTrie getters for fusing lookup into iterating over text: getters by Latin1, ASCII, two-byte UTF-8, and three-byte UTF-8.

New methods: get8(), get7(), get_utf8_two_byte(), get_utf8_three_byte() on CodePointTrie and TypedCodePointTrie
New trait: AbstractCodePointTrie

icu_collections: Serde and databake support for typed CodePointTries

New impls: databake::Bake, BakeSize, serde::Serialize, serde::Deserialize for FastCodePointTrie and SmallCodePointTrie

icu_collections: Iterators by char and TrieValue pairs for Latin1,str, and delegate iterator over char.

New types: CharIndicesWithTrie, CharIndicesWithTrieDefaultForAscii, CharIterWithTrie, CharsWithTrie, CharsWithTrieDefaultForAscii, CharsWithTrieEx, Latin1CharIndicesWithTrie, Latin1CharsWithTrie,
New trait: WithTrie
New extension traits on str: CharsWithTrieDefaultForAsciiEx, Latin1CharsWithTrieEx

Manishearth

Will take some time to properly review

Manishearth · 2026-03-11T16:53:15Z

+    }
+
+    #[inline(always)]
+    unsafe fn get_bit_prefix_suffix_assuming_fast_index(


issue: please document safety invariants (even if it's obvious from the function name)

edit: it's not; because there are invariants on bit_prefix and bit_suffix.

We should document this and ensure it's upheld by the callers.

Manishearth · 2026-03-11T16:56:18Z

+    pub unsafe fn get7(&self, ascii: u8) -> T {
+        debug_assert!(ascii < 128);
+        debug_assert!((ascii as usize) < self.data.len());
+        // SAFETY: Length of `self.data` checked in the constructor.


issue: We may add more ctors in the future. This should reference the safety invariants on data, just say // SAFETY: Allowed by datas safety invariant, updating data's invariant to require that it has at least 128 elements and updating the constructor validation to saying something like // data safety invariant upheld here

Manishearth · 2026-03-11T16:56:47Z

+        debug_assert!(low_six <= 0b111_111); // Safety invariant.
+        debug_assert!(high_five <= 0b11_111); // Safety invariant.
+        debug_assert!(high_five > 0b1); // Non-shortest form; not safety invariant.
+                                        // SAFETY: The highest character representable as a two-byte


nit: maybe introduce a newline so that this formats better

Manishearth · 2026-03-12T01:53:03Z

@@ -0,0 +1,1240 @@
+// This file is part of ICU4X. For terms of use, please see the file


This file is a lot of unsafe code and I'm not convinced it is justified. Can we reduce the amount of unsafe code in this PR by writing these iterators to wrap CharIndices? There will still be unsafe in this file, but it will be around CPT invariants rather than also around UTF8 decoding.

Separately we can try and justify additional unsafe using benchmarks if needed, and that would be a nice scoped PR that can be easily reviewed and benchmarked.

Fusing the trie lookup into UTF-8 decoding is the key point of this changeset: CPT in ICU4C has been designed so that its bit split lines up with the bits in the last UTF-8 trail byte, and we've been using it pessimally in ICU4X.

I guess I will need to port the UTF-16 NFC to NFD throughput benchmark to str and then get exact numbers for the effect here.

Hmm, I see. I feel like using CharIndices (especially with its offset function) you might still be able to get the same benefits, but I understand if that was the point of this change.

In that case we should probably have more careful tracking of the invariant on the contained iterator whenever it is advanced.

I don't see how CharIndices would allow avoiding redundant math in the CPT queries.

Manishearth · 2026-03-12T02:00:47Z

+    pub fn get8(&self, latin1: u8) -> T {
+        let code_point = u32::from(latin1);
+        debug_assert!(code_point <= SMALL_TYPE_FAST_INDEXING_MAX);
+        // SAFETY: `u8` is always below `SMALL_TYPE_FAST_INDEXING_MAX` and,


suggestion (non blocking): worth documenting on those two constants that their precise values are extremely safety relevant and relied upon by many different checks in this file

Manishearth · 2026-03-12T02:04:52Z

+        debug_assert!(low_six <= 0b111_111); // Safety invariant.
+        debug_assert!(high_five <= 0b11_111); // Safety invariant.
+        debug_assert!(high_five > 0b1); // Non-shortest form; not safety invariant.
+                                        // SAFETY: The highest character representable as a two-byte


issue: the safety invariants on this function are not currently documented, but once they are, this comment should be in terms of those invariants

Added a line break.

Manishearth · 2026-03-12T02:05:43Z

+    ///
+    /// `low_six` must not have bit positions other than the lowest 6 set to 1.
+    ///
+    /// # Intended Invariant


question: what is this? Is this a non-safety-relevant invariant?

Perhaps explicitly say it is non-safety relevant.

Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>

Manishearth · 2026-03-24T15:58:59Z

Great, thanks for updating all those invariants, this is looking much better! I'll try and finish review today, so we can get this in for 2.2.

Manishearth

Overall this looks good.

I am in favor of landing this for the 2.2 release, which we're hoping to make next week. I have convinced myself there is no code where the safety is worse than before (the code for get_bit_prefix_suffix_assuming_fast_index was already lacking justification), and the newly introduced unsafe code is well done. Further cleanups/documentation can be performed (I can file followups for the main unsafe issues)

It would be nice to have as much of these comments addressed as possible, but we should also try to land this before next week.

Manishearth · 2026-03-24T22:36:42Z

+    ///
+    /// # Safety
+    ///
+    /// `high_ten` must not have bit positions other than the lowest 10 set to 1.


thought: we should consider using different types here, like u16 and u8. We can then as cast.

non blocking

Manishearth · 2026-03-24T22:39:12Z

+    /// # Safety
+    ///
+    /// `ascii` must be less than 128.
+    unsafe fn ascii(&self, ascii: u8) -> T;


question: should these be get_*?

I think they don't have to be, but this is a new public API so we should think about it.

If these are mostly internal facing we should namespace them as abstract_cpt_ascii or something.

+1; this is a scaffolding trait, and we've almost always been disappointed when we give scaffolding trait functions nice names. So many editors are happy to create an import when you type my_cpt.ascii(), but we don't want to make unsafe functions so easy to accidentally stumble upon. Clients should always start with the safe functions by default, and where unsafe functions deliver significant gains, they are available for clients who need them.

(here and elsewhere in this trait)

I deliberately picked non-get naming for these to make it clear without turbofishes what's from the trait and what's not.

These are public in the sense that they need to be visible to the utf8_iter and utf16_iter crates. I think it's not harmful for other code to call through the trait when calling through the trait isn't strictly necessary, so I'd prefer not to obfuscate these for that reason.

The niceness of naming doesn't affect safety: The non-trait get counterparts are safe or unsafe in the same situations as these. The names for the two and three-byte UTF-8 accessors are already rather obscure.

Do I understand correctly that the main issue is that ascii is nice, but ascii is also unsafe (as unsafe as get7)? It would be possible to say utf8_one_byte instead of ascii, but wouldn't that just be weird?

It's very frustrating that simple things remain perma-undecided/unstable in the Rust standard library: ascii could be safe if rust-lang/rust#110998 had been stabilized already.

If you prefix the trait fns with abstract_cpt_, then it is "clear without turbofishes what's from the trait and what's not", and it is more forward-compatible, too, because when rust-lang/rust#110998 eventually lands, you can add a safe version of the fn.

Manishearth · 2026-03-24T22:43:21Z

+    }
+}
+
+impl<'slice, 'trie, T, V> Iterator for CharsWithTrieDefaultForAscii<'slice, 'trie, T, V>


question: any hope of sharing code between the CharsWithTrie and CharsWithTrieDefaultForAscii impls?

Seems straightforward: have a type CharsWithTrieWithDefaultHandling that has .next<F>() where it calls f(lead) in the default case.

Do you mean having CharsWithTrieWithDefaultHandling as the inner type of repr(transparent) CharsWithTrie and CharsWithTrieDefaultForAscii.

It seems to me that a macro would be simpler. Would that be OK?

Manishearth · 2026-03-24T22:59:25Z

One note: This adds a lot of new APIs, and if we want to land this as new public APIs we will need to make sure they are

This adds:

New iterator types
A trait for abstracting over those iterator types
New get methods on CPT (get7, etc), some of them unsafe
A trait for abstracting over CPT types, with many methods

How much of this is needed by utf8_iter? Can we mark most of these new APIs as unstable for now and still make a release? I think I see a fair number of naming questions that we should spend time on (marked as discuss-priority to see if we can handle them Thursday)

cc @sffc to look at new APIs and naming as well

Manishearth · 2026-03-24T23:01:51Z

+    /// With debug assertions enabled, panics if the above safety invariants are
+    /// violated or `high_five` represents non-shortest form.
+    #[inline(always)]
+    pub unsafe fn get_utf8_two_byte(&self, high_five: u32, low_six: u32) -> T {


This is a new public function. Should these parameters be narrower types?

Manishearth · 2026-03-24T23:13:22Z

+/// Method naming intentionally differs from the method naming on
+/// those types in order to disambiguate.
+#[allow(private_bounds)] // Permit sealing
+pub trait AbstractCodePointTrie<'trie, T: TrieValue>: Seal {


So we already have a TypedCodePointTrie trait. Can we avoid duplicating a similar trait? How much is this needed? I understand that TypedCodePointTrie cannot abstract over CodePointTrie, but what is the use case for abstracting over all three here?

In retrospect, TypedCodePointTrie should have been designed as a trait with a single associated constant that inherited from AbstractCodePointTrie (probably CodePointTrieLike).

The use case for abstracting over all three is to make the code that uses AbstractCodePointTrie work not only in the non-serde config where UTS 46 trie is small and UAX 15 tries are fast but also in the serde case where both are untyped.

Manishearth

Given the large number of new APIs I'm actually going to wait for us to discuss them more. Might not be worth trying to make this land for 2.2, but perhaps we can resolve everything by Thursday.

If there's an MVP set of changes that enable utf8_iter integration we should try for that.

But also I'm open to doing out-of-cycle releases for this. I dislike doing nontrivial API additions in an out of cycle release but ..... I don't want to rush this release either.

sffc

Thanks. Preface: I entrust @Manishearth and others to judge the tradeoffs of the new unsafe abstractions. My comments below are under the assumption that the abstractions are well motivated.

sffc · 2026-03-25T07:51:06Z

+    /// With debug assertions enabled, panics if the above safety invariants are
+    /// violated or `high_five` represents non-shortest form.
+    #[inline(always)]
+    pub unsafe fn get_utf8_two_byte(&self, high_five: u32, low_six: u32) -> T {


Issue, here and below: public functions with confusing invariants are not great to advertise to clients (although if they need to be cross-crate, it's better to have them documented than not). I see that you have these functions on a trait, too; I would prefer if you would keep them only on the trait and not export the concrete fns.

sffc · 2026-03-25T07:53:55Z

+    /// `header.trie_type`, `index`, and `data` must
+    /// satisfy the invariants for the fields of the
+    /// same names on `CodePointTrie`.


Suggestion: I prefer the safety invariant on parts constructors to be more like

Suggested change

/// `header.trie_type`, `index`, and `data` must

/// satisfy the invariants for the fields of the

/// same names on `CodePointTrie`.

/// The parameters must have been returned from `SmallCodePointTrie::to_parts()`

because it is easier to verify this.

sffc · 2026-03-25T08:01:07Z

+    /// # Safety
+    ///
+    /// `ascii` must be less than 128.
+    unsafe fn ascii(&self, ascii: u8) -> T;


+1; this is a scaffolding trait, and we've almost always been disappointed when we give scaffolding trait functions nice names. So many editors are happy to create an import when you type my_cpt.ascii(), but we don't want to make unsafe functions so easy to accidentally stumble upon. Clients should always start with the safe functions by default, and where unsafe functions deliver significant gains, they are available for clients who need them.

(here and elsewhere in this trait)

Manishearth · 2026-03-25T17:07:38Z

Some answers to questions on usage:

The trait is used by utf8_iter to abstract over all CPTs: hsivonen/utf8_iter@main...cptrie

utf8_two_byte, etc are used inside manual UTF8 decoding code in utf8_iter. It is not used by normalizer/collator code as far as I can tell.

The utf8_iter code is mostly more iterators with unsafe decoding code, code that looks rather similar to the iterator code here already. In terms of net amount of unsafe code, that's a lot of unsafe that probably could be abstracted over: we have next() and next_back() impls for each type of encoding, in a "with default" and "normal" mode. The "with default" and "normal" mode are definitely similar enough to be abstracted over, maybe with an internal macro. next() and next_back() are not. The unvalidated and validated Utf8 code might be; I'm not sure.

We're already supporting additional encodings here with the Latin-1 iterator. It is an ICU4X norm to support potentially-invalid utf8/utf16 and often Latin-1. So I think if we are choosing to have all these iterators, it's fine for them to live in icu_collections. This will also mean you don't need to do the release dance: you can add this to icu_collections and immediately start using it in normalizer/collator without any.

With those changes, I think we no longer need any public non-hidden methods on AbstractCPT, nor do we need any of the public methods on the concrete types. We'll only need:

AbstractCodePointTrie (sealed, no visible methods)
All the iterators and extension traits.

This is a publiclly-safe addition to the public API, and much more scoped since it's just the iterators.

With this, it might be nice to figure out a way to share code between the different iterator impls. My preference is still for this unsafe code to live here rather than split across icu_collections and utf8_iter regardless of whether we reduce the code, but I would like to investigate codesharing. I might try some things.

I also think we can get rid of the extension traits if we instead switch to having methods on the CPT types, like iter_str(), iter_utf8() (maybe zip_str(), or chars_with_str()?). Curious what @sffc thinks about that.

sffc · 2026-03-25T23:51:50Z

I also think we can get rid of the extension traits if we instead switch to having methods on the CPT types, like iter_str(), iter_utf8() (maybe zip_str(), or chars_with_str()?). Curious what @sffc thinks about that.

My understanding based on @Manishearth's comment, which could be incorrect: these methods are public and used only by the utf8_iter crate. The proposal is to create an iterator that instead lives in icu_collections, such that these unsafe methods don't need to be exported across crate boundaries, and the exported APIs become more safe.

If my understanding is correct, then I am supportive in principle.

Manishearth · 2026-04-03T18:03:40Z

Copying a suggestion from chat:

Another way of doing this would be to extend Utf8Chars with a Utf8CharsWithHandler type that takes in a Handler trait that has unsafe fn handle_one_byte(ascii) -> V, unsafe fun handle_two_byte(high, low, point) -> V (may not actually have to be unsafe!) , etc. Then CPT can have a private CPTHandler type that it uses: just wrapping Utf8CharsWithHandler<CPTHandler> lets you make all the iterators you want.

I'd very much support this extension: As I commented earlier I disliked the fact that the UTF8 code was being duplicated a bunch of times, and would love to see it being refactored.

Overall I do not think we have really explored the space of encapsulating the unsafe code in ways that reduce copy-pasting of the same unsafe code.

Manishearth · 2026-04-03T18:07:30Z

And based on later discussion, I'm hoping we can try doing that work in utf8_iter. I have a pretty clear idea of what we need at this point, but won't be able to look at this for a few weeks.

sffc · 2026-04-16T16:30:28Z

We briefly discussed this today. I am okay so long as the unsafe code is appropriately packaged, without crossing crate boundaries, while deferring to @Manishearth on concerns about duplicating the unsafe blocks between icu_collections and utf8_iter. @hsivonen is waiting on @Manishearth for feedback on how to better structure the unsafe code to reduce the duplication with utf8_iter.

Manishearth · 2026-04-17T21:32:31Z

Hmm, I did not realize @hsivonen was waiting for me. I did offer to do some of the work here myself to get it started but that was if @hsivonen didn't have time to do it himself. I guess this is an indicator that he doesn't, or needs help understanding what I was going for? In that case, yes I can do this, but it won't be soon.

(and in the meantime we shouldn't block PRs like #7878 on that)

sffc · 2026-04-20T22:58:03Z

I think @hsivonen was assuming you were going to post a more detailed reply based on this comment:

I have a pretty clear idea of what we need at this point, but won't be able to look at this for a few weeks.

Manishearth · 2026-04-20T23:05:54Z

Right, that comment was an update for the rest of the team posted after Henri and I had had a pretty in depth conversation on this on a call. I have some interest in trying to sketch a solution but I was also under the impression that I had given Henri enough to go on.

Either way not super important, I'll get to it eventually ...

Manishearth · 2026-04-27T20:52:21Z

hsivonen/utf8_iter#2

CodePointTrie support for normalizer and collator perf improvements

576bf49

hsivonen requested a review from echeran as a code owner March 11, 2026 16:35

hsivonen requested a review from Manishearth March 11, 2026 16:35

hsivonen mentioned this pull request Mar 11, 2026

Review normalizer changes #7528

Draft

Manishearth reviewed Mar 11, 2026

View reviewed changes

Manishearth reviewed Mar 12, 2026

View reviewed changes

robertbastian reviewed Mar 12, 2026

View reviewed changes

Comment thread components/collections/src/codepointinvliststringlist/mod.rs Outdated

Comment thread components/collections/src/codepointinvliststringlist/mod.rs Outdated

hsivonen and others added 4 commits March 17, 2026 11:14

Avoid validating whole string in inversion list lookup

dc5c98e

Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>

Adjust comments around fast ASCII asccess invariant

1df7592

Better formatting

35abb96

Rework UTF-8 iteration safety remarks

77f585b

hsivonen requested a review from a team as a code owner March 17, 2026 11:00

Manishearth mentioned this pull request Mar 18, 2026

Low-dependency ref-cast #7607

Open

Manishearth approved these changes Mar 24, 2026

View reviewed changes

Manishearth added the discuss-priority Discuss at the next ICU4X meeting label Mar 24, 2026

Manishearth reviewed Mar 24, 2026

View reviewed changes

Manishearth requested changes Mar 24, 2026

View reviewed changes

sffc reviewed Mar 25, 2026

View reviewed changes

Manishearth mentioned this pull request Apr 27, 2026

Introduce generic iteration code hsivonen/utf8_iter#2

Open

robertbastian removed the discuss-priority Discuss at the next ICU4X meeting label Apr 30, 2026

		@@ -0,0 +1,1240 @@
		// This file is part of ICU4X. For terms of use, please see the file

Conversation

hsivonen commented Mar 11, 2026 • edited by Manishearth Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Manishearth Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Manishearth commented Mar 24, 2026

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Manishearth commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Manishearth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hsivonen commented Mar 11, 2026 •

edited by Manishearth

Loading

Manishearth Mar 11, 2026 •

edited

Loading

Manishearth left a comment •

edited

Loading