Skip to content

Generate properties from the UCD#7904

Merged
robertbastian merged 6 commits intounicode-org:mainfrom
robertbastian:ucd
Apr 29, 2026
Merged

Generate properties from the UCD#7904
robertbastian merged 6 commits intounicode-org:mainfrom
robertbastian:ucd

Conversation

@robertbastian
Copy link
Copy Markdown
Member

@robertbastian robertbastian commented Apr 22, 2026

#4602

Changelog

icu_provider_source:

  • Compute properties directly from the unicode data source, instead of from icuexport

@robertbastian robertbastian force-pushed the ucd branch 8 times, most recently from e5335d1 to debc7d6 Compare April 23, 2026 17:11
@robertbastian robertbastian added discuss Discuss at a future ICU4X-SC meeting and removed discuss Discuss at a future ICU4X-SC meeting labels Apr 23, 2026
@robertbastian robertbastian force-pushed the ucd branch 3 times, most recently from 3370232 to 3aa43c4 Compare April 23, 2026 22:53
@robertbastian robertbastian marked this pull request as ready for review April 24, 2026 08:30
@robertbastian robertbastian requested review from a team, Manishearth, echeran and sffc as code owners April 24, 2026 08:30
@Manishearth
Copy link
Copy Markdown
Member

Is there a reason we have a data diff at all here?

@robertbastian
Copy link
Copy Markdown
Member Author

The diff for the BidiClass and Script CPTs is weird, but it's the same diff you get if you rebuild CPTs with icuexportdata, see #7299. The removed/added values are the trie's default values, and iterating through the trie yields the same results, so I don't think this diff is relevant.

The diff for CCC parsing is because the UCD also defines numeric aliases, which I've added to the parser for good measure.

Copy link
Copy Markdown
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General shape is promising.

Comment thread components/properties/src/trievalue.rs Outdated
for (start, end) in &self.ranges {
builder.add_range32(start..=end);

// UTS #18 Annex C Compatibility Properties
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: huh; I didn't realize we generate compound compat props.

That's data duplication. Though it's also not that important i guess.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to get rid of them. We can still support them in the unicode set parser.


// UTS #18 Annex C Compatibility Properties
if name == "alnum" {
// \p{alpha}\p{digit} = \p{Alphabetic}\p{gc=Decimal_Number}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: would be very cool to use our unicodeset code to make this big function smaller

not actually a suggestion

@sffc
Copy link
Copy Markdown
Member

sffc commented Apr 25, 2026

Nit: "Changelog: N/A" is not correct. It is relevant to clients that their data comes from a --ucd-root instead of a --icuexportdata-root. Please fix

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to defer to @Manishearth's review, and ideally @markusicu could take a look, too. I'd like though if we can get this to a zero diff. I don't understand why there are diffs in bidi, script, and canonical combining class. Please try to make this a clean diff if possible or else write up in more detail why the diff is (1) benign and (2) unavoidable.

@sffc
Copy link
Copy Markdown
Member

sffc commented Apr 25, 2026

The diff for the BidiClass and Script CPTs is weird, but it's the same diff you get if you rebuild CPTs with icuexportdata, see #7299. The removed/added values are the trie's default values, and iterating through the trie yields the same results, so I don't think this diff is relevant.

Please try to land that PR before this one in order to avoid the data diff.

The diff for CCC parsing is because the UCD also defines numeric aliases, which I've added to the parser for good measure.

I prefer also splitting this out into a separate PR.

@robertbastian
Copy link
Copy Markdown
Member Author

Please try to land that PR before this one in order to avoid the data diff.

That PR is not landeable because of datagen performance. I'm not going to invest time improving that just to immediately replace it again.

@sffc
Copy link
Copy Markdown
Member

sffc commented Apr 25, 2026

I don't understand: you are needing to build all the same CPTs here as you do in that other PR. So why does only one of them have performance problems?

@robertbastian
Copy link
Copy Markdown
Member Author

Because in this PR I've implemented caching inside the UCD data source.

@Manishearth
Copy link
Copy Markdown
Member

Would it be possible to make a small PR (maybe one that adds defaults to TrieValue) that has temporary hacks leading to the same diffs? It would be nic eto have a programmatic understanding of what's changed. But I don't think we should spend too much time on this.

@robertbastian
Copy link
Copy Markdown
Member Author

I've updated the Script default in the other PR. The diffs are now byte-identical.

@Manishearth
Copy link
Copy Markdown
Member

Nice macro.

Comment thread provider/icu4x-datagen/src/main.rs Outdated
Comment thread provider/source/src/properties/bidi.rs
_ => "ucd/PropList.txt",
};

for line in self.unicode()?.read_to_string(file)?.lines() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this type of UCD parsing code shows up in multiple spots: can we make a shared helper?
(or maybe two, one for binary and one for enum props)

Having a separate ucd.rs with all that code would be nice.

They are somewhat different formats (especially things like emoji-sequences) so maybe it's not easy to share the helper, but maybe they can all be in one file at least.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though about this, but

  • we can isolate the comment stripping, but apparently some comments are semantically relevant, so that might be confusing
  • there are enough edge cases, like CCC, that this would get quite complicated
  • I don't want to create intermediate data structures if possible


for line in self.unicode()?.read_to_string(file)?.lines() {
let line = line.split('#').next().unwrap().trim();
if line.is_empty() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: These files have @missing rules that are also supposed to be parsed. As far as I can tell, each file has only a single such rule per property, but I'm not sure if that's guaranteed, and I'm not sure if that rule always matches the enum default we have here. Probably.

We should at the very least think about this and add a comment here mentioning it. I've already asked Robin for clarity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict only DerivedNormalizationProps.txt has @missing. I'll add it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, each file has only a single such rule per property,

Nope, and that’s explicitly documented: https://www.unicode.org/reports/tr44/#Missing_Conventions

Starting with Version 15.0, some data files in the UCD may contain multiple @‌missing lines defined for the same property.

And DerivedEastAsianWidth.txt, which is in this PR, is one such file.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That also says

An @missing line is never provided for a binary property, because the default value for binary properties is always "N" and need not be defined redundantly for each binary property.

which doesn't match the data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, the property I looked at wasn't binary, it was enumerated over Yes and No. of course

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It does, it’s just that two enumerated properties—NFD_QC and NFKD_QC—have exactly two values Y=Yes and N=No. Interestingly, they do not have the aliases T=True and F=False that real binary properties have.)

Copy link
Copy Markdown
Member

@eggrobin eggrobin Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @Manishearth’s original comment: The @missing support is there in this PR already, in enum_codepointtrie.rs‎ which is where it is applicable:

for line in self.unicode()?.read_to_string(&file)?.lines() {
let line = line.strip_prefix("# @missing: ").unwrap_or(line);
let line = line.split('#').next().unwrap().trim();
if line.is_empty() {
continue;
}
let mut fields = line.split(';');
let cp_range = fields.next().unwrap().trim();

@robertbastian robertbastian requested review from Manishearth, eggrobin and sffc and removed request for echeran April 28, 2026 10:15
Copy link
Copy Markdown
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this. I still would prefer if all UCD-format handling for loops were in a separate file in one place (even if they are half a dozen different functions with slightly different APIs), in case we need to fix bugs around that. Markus/Elango have indicated that there are nuances there, and @missing is one such nuance we messed up here once already, and things are liable to change in the future.

But I'm happy to do that myself as a followup.

})
#[cfg(not(any(feature = "use_wasm", feature = "use_icu4c")))]
return Err(DataError::custom(
"icu_provider_source must be built with use_icu4c or use_wasm to build properties data",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this change something we care about?

I think this is fine, kind of inevitable, wasm is default so I expect most users to just use this without noticing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shane also agrees this is fine.

let trie = map
.into_iter()
// Filter CCC's numeric names.
// TODO: Don't
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: followup issue

.read_to_string("ucd/ScriptExtensions.txt")?
.lines()
{
let line = line.split('#').next().unwrap().trim();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: this code is more complex than the others

pub(crate) short: Option<String>,
#[serde(default)]
pub(crate) aliases: Vec<String>,
pub(crate) short: String,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: so we still use this data for ICU4C discriminants and the short name?

}

fn hardcoded_segmenter_provider() -> SourceDataProvider {
fn unicode_15_1() -> &'static SourceDataProvider {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: document as being for segmenter to use

if p.left.is_none() && p.right.is_none() {
// If any values aren't set, this is builtin type.
simple_properties_count += 1;
match &*segmenter.segmenter_type {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this new code

@robertbastian robertbastian merged commit 56ebf1a into unicode-org:main Apr 29, 2026
64 of 65 checks passed
@robertbastian robertbastian deleted the ucd branch April 29, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants