Generate properties from the UCD by robertbastian · Pull Request #7904 · unicode-org/icu4x

robertbastian · 2026-04-22T14:41:06Z

Changelog

icu_provider_source:

Compute properties directly from the unicode data source, instead of from icuexport

Manishearth · 2026-04-24T14:45:21Z

Is there a reason we have a data diff at all here?

robertbastian · 2026-04-24T16:33:49Z

The diff for the BidiClass and Script CPTs is weird, but it's the same diff you get if you rebuild CPTs with icuexportdata, see #7299. The removed/added values are the trie's default values, and iterating through the trie yields the same results, so I don't think this diff is relevant.

The diff for CCC parsing is because the UCD also defines numeric aliases, which I've added to the parser for good measure.

Manishearth

General shape is promising.

Manishearth · 2026-04-24T18:30:10Z

-        for (start, end) in &self.ranges {
-            builder.add_range32(start..=end);
+
+        // UTS #18 Annex C Compatibility Properties


thought: huh; I didn't realize we generate compound compat props.

That's data duplication. Though it's also not that important i guess.

I'd love to get rid of them. We can still support them in the unicode set parser.

Manishearth · 2026-04-24T18:30:39Z

+
+        // UTS #18 Annex C Compatibility Properties
+        if name == "alnum" {
+            // \p{alpha}\p{digit} = \p{Alphabetic}\p{gc=Decimal_Number}


thought: would be very cool to use our unicodeset code to make this big function smaller

not actually a suggestion

sffc · 2026-04-25T07:26:52Z

Nit: "Changelog: N/A" is not correct. It is relevant to clients that their data comes from a --ucd-root instead of a --icuexportdata-root. Please fix

sffc

I'm inclined to defer to @Manishearth's review, and ideally @markusicu could take a look, too. I'd like though if we can get this to a zero diff. I don't understand why there are diffs in bidi, script, and canonical combining class. Please try to make this a clean diff if possible or else write up in more detail why the diff is (1) benign and (2) unavoidable.

sffc · 2026-04-25T07:56:26Z

The diff for the BidiClass and Script CPTs is weird, but it's the same diff you get if you rebuild CPTs with icuexportdata, see #7299. The removed/added values are the trie's default values, and iterating through the trie yields the same results, so I don't think this diff is relevant.

Please try to land that PR before this one in order to avoid the data diff.

The diff for CCC parsing is because the UCD also defines numeric aliases, which I've added to the parser for good measure.

I prefer also splitting this out into a separate PR.

robertbastian · 2026-04-25T08:33:18Z

Please try to land that PR before this one in order to avoid the data diff.

That PR is not landeable because of datagen performance. I'm not going to invest time improving that just to immediately replace it again.

sffc · 2026-04-25T09:39:52Z

I don't understand: you are needing to build all the same CPTs here as you do in that other PR. So why does only one of them have performance problems?

robertbastian · 2026-04-25T11:03:47Z

Because in this PR I've implemented caching inside the UCD data source.

Manishearth · 2026-04-25T16:20:34Z

Would it be possible to make a small PR (maybe one that adds defaults to TrieValue) that has temporary hacks leading to the same diffs? It would be nic eto have a programmatic understanding of what's changed. But I don't think we should spend too much time on this.

robertbastian · 2026-04-26T09:04:11Z

I've updated the Script default in the other PR. The diffs are now byte-identical.

Manishearth · 2026-04-27T16:24:29Z

Nice macro.

Manishearth · 2026-04-27T16:38:10Z

+                _ => "ucd/PropList.txt",
+            };
+
+            for line in self.unicode()?.read_to_string(file)?.lines() {


issue: this type of UCD parsing code shows up in multiple spots: can we make a shared helper?
(or maybe two, one for binary and one for enum props)

Having a separate ucd.rs with all that code would be nice.

They are somewhat different formats (especially things like emoji-sequences) so maybe it's not easy to share the helper, but maybe they can all be in one file at least.

I though about this, but

we can isolate the comment stripping, but apparently some comments are semantically relevant, so that might be confusing

there are enough edge cases, like CCC, that this would get quite complicated

I don't want to create intermediate data structures if possible

Manishearth · 2026-04-27T16:41:08Z

+
+            for line in self.unicode()?.read_to_string(file)?.lines() {
+                let line = line.split('#').next().unwrap().trim();
+                if line.is_empty() {


issue: These files have @missing rules that are also supposed to be parsed. As far as I can tell, each file has only a single such rule per property, but I'm not sure if that's guaranteed, and I'm not sure if that rule always matches the enum default we have here. Probably.

We should at the very least think about this and add a comment here mentioning it. I've already asked Robin for clarity.

afaict only DerivedNormalizationProps.txt has @missing. I'll add it

As far as I can tell, each file has only a single such rule per property,

Nope, and that’s explicitly documented: https://www.unicode.org/reports/tr44/#Missing_Conventions

Starting with Version 15.0, some data files in the UCD may contain multiple @‌missing lines defined for the same property.

And DerivedEastAsianWidth.txt, which is in this PR, is one such file.

That also says

An @missing line is never provided for a binary property, because the default value for binary properties is always "N" and need not be defined redundantly for each binary property.

which doesn't match the data.

ah, the property I looked at wasn't binary, it was enumerated over Yes and No. of course

(It does, it’s just that two enumerated properties—NFD_QC and NFKD_QC—have exactly two values Y=Yes and N=No. Interestingly, they do not have the aliases T=True and F=False that real binary properties have.)

To @Manishearth’s original comment: The @missing support is there in this PR already, in enum_codepointtrie.rs‎ which is where it is applicable:

icu4x/provider/source/src/properties/enum_codepointtrie.rs

Lines 115 to 122 in 556232d

for line in self.unicode()?.read_to_string(&file)?.lines() {

let line = line.strip_prefix("# @missing: ").unwrap_or(line);

let line = line.split('#').next().unwrap().trim();

if line.is_empty() {

continue;

}

let mut fields = line.split(';');

let cp_range = fields.next().unwrap().trim();

Co-authored-by: Copilot <copilot@github.com>

This reverts commit ce74f86.

Manishearth

Overall I like this. I still would prefer if all UCD-format handling for loops were in a separate file in one place (even if they are half a dozen different functions with slightly different APIs), in case we need to fix bugs around that. Markus/Elango have indicated that there are nuances there, and @missing is one such nuance we messed up here once already, and things are liable to change in the future.

But I'm happy to do that myself as a followup.

Manishearth · 2026-04-28T22:57:10Z

-                    })
+                    #[cfg(not(any(feature = "use_wasm", feature = "use_icu4c")))]
+                    return Err(DataError::custom(
+                        "icu_provider_source must be built with use_icu4c or use_wasm to build properties data",


question: is this change something we care about?

I think this is fine, kind of inevitable, wasm is default so I expect most users to just use this without noticing.

Shane also agrees this is fine.

Manishearth · 2026-04-28T22:58:13Z

                    let trie = map
                        .into_iter()
+                        // Filter CCC's numeric names.
+                        // TODO: Don't


nit: followup issue

Manishearth · 2026-04-28T22:59:07Z

+                    .read_to_string("ucd/ScriptExtensions.txt")?
+                    .lines()
+                {
+                    let line = line.split('#').next().unwrap().trim();


observation: this code is more complex than the others

Manishearth · 2026-04-28T23:00:36Z

-    pub(crate) short: Option<String>,
-    #[serde(default)]
-    pub(crate) aliases: Vec<String>,
+    pub(crate) short: String,


observation: so we still use this data for ICU4C discriminants and the short name?

Manishearth · 2026-04-29T00:19:53Z

 }

-fn hardcoded_segmenter_provider() -> SourceDataProvider {
+fn unicode_15_1() -> &'static SourceDataProvider {


nit: document as being for segmenter to use

Manishearth · 2026-04-29T00:20:23Z

-        if p.left.is_none() && p.right.is_none() {
-            // If any values aren't set, this is builtin type.
-            simple_properties_count += 1;
+    match &*segmenter.segmenter_type {


I like this new code

robertbastian force-pushed the ucd branch 8 times, most recently from e5335d1 to debc7d6 Compare April 23, 2026 17:11

robertbastian added discuss Discuss at a future ICU4X-SC meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Apr 23, 2026

robertbastian force-pushed the ucd branch 3 times, most recently from 3370232 to 3aa43c4 Compare April 23, 2026 22:53

robertbastian marked this pull request as ready for review April 24, 2026 08:30

robertbastian requested review from a team, Manishearth, echeran and sffc as code owners April 24, 2026 08:30

robertbastian force-pushed the ucd branch from 3aa43c4 to ae571bf Compare April 24, 2026 08:34

robertbastian force-pushed the ucd branch from ae571bf to ef3304e Compare April 24, 2026 18:03

Manishearth reviewed Apr 24, 2026

View reviewed changes

sffc reviewed Apr 25, 2026

View reviewed changes

Comment thread provider/source/data/debug/property/name/parse/canonical/combining/class/v1.json

robertbastian force-pushed the ucd branch from 595f08f to 3c3c671 Compare April 27, 2026 11:20

Manishearth reviewed Apr 27, 2026

View reviewed changes

robertbastian and others added 4 commits April 28, 2026 12:03

make segmenter data dependencies more granular

c7c0dcc

read ucd

124fc53

Filter CCC's numeric names

9942709

Co-authored-by: Copilot <copilot@github.com>

move default

0628f2a

robertbastian force-pushed the ucd branch from 3c3c671 to 0628f2a Compare April 28, 2026 10:03

@missing for binary props

ce74f86

robertbastian requested review from Manishearth, eggrobin and sffc and removed request for echeran April 28, 2026 10:15

Revert "@missing for binary props"

556232d

This reverts commit ce74f86.

Manishearth approved these changes Apr 29, 2026

View reviewed changes

robertbastian merged commit 56ebf1a into unicode-org:main Apr 29, 2026
64 of 65 checks passed

robertbastian deleted the ucd branch April 29, 2026 10:10

robertbastian mentioned this pull request Apr 29, 2026

Build CodePointTries in datagen #7299

Closed

Manishearth mentioned this pull request May 7, 2026

Collect UCD-format parsing code in one file #7947

Open

	for line in self.unicode()?.read_to_string(&file)?.lines() {
	let line = line.strip_prefix("# @missing: ").unwrap_or(line);
	let line = line.split('#').next().unwrap().trim();
	if line.is_empty() {
	continue;
	}
	let mut fields = line.split(';');
	let cp_range = fields.next().unwrap().trim();

Conversation

robertbastian commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Uh oh!

Manishearth commented Apr 24, 2026

Uh oh!

robertbastian commented Apr 24, 2026

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc commented Apr 25, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sffc commented Apr 25, 2026

Uh oh!

robertbastian commented Apr 25, 2026

Uh oh!

sffc commented Apr 25, 2026

Uh oh!

robertbastian commented Apr 25, 2026

Uh oh!

Manishearth commented Apr 25, 2026

Uh oh!

robertbastian commented Apr 26, 2026

Uh oh!

Manishearth commented Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eggrobin Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Manishearth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

robertbastian commented Apr 22, 2026 •

edited

Loading

eggrobin Apr 28, 2026 •

edited

Loading