Generate properties from the UCD#7904
Conversation
e5335d1 to
debc7d6
Compare
3370232 to
3aa43c4
Compare
|
Is there a reason we have a data diff at all here? |
|
The diff for the BidiClass and Script CPTs is weird, but it's the same diff you get if you rebuild CPTs with icuexportdata, see #7299. The removed/added values are the trie's default values, and iterating through the trie yields the same results, so I don't think this diff is relevant. The diff for CCC parsing is because the UCD also defines numeric aliases, which I've added to the parser for good measure. |
Manishearth
left a comment
There was a problem hiding this comment.
General shape is promising.
| for (start, end) in &self.ranges { | ||
| builder.add_range32(start..=end); | ||
|
|
||
| // UTS #18 Annex C Compatibility Properties |
There was a problem hiding this comment.
thought: huh; I didn't realize we generate compound compat props.
That's data duplication. Though it's also not that important i guess.
There was a problem hiding this comment.
I'd love to get rid of them. We can still support them in the unicode set parser.
|
|
||
| // UTS #18 Annex C Compatibility Properties | ||
| if name == "alnum" { | ||
| // \p{alpha}\p{digit} = \p{Alphabetic}\p{gc=Decimal_Number} |
There was a problem hiding this comment.
thought: would be very cool to use our unicodeset code to make this big function smaller
not actually a suggestion
|
Nit: "Changelog: N/A" is not correct. It is relevant to clients that their data comes from a |
sffc
left a comment
There was a problem hiding this comment.
I'm inclined to defer to @Manishearth's review, and ideally @markusicu could take a look, too. I'd like though if we can get this to a zero diff. I don't understand why there are diffs in bidi, script, and canonical combining class. Please try to make this a clean diff if possible or else write up in more detail why the diff is (1) benign and (2) unavoidable.
Please try to land that PR before this one in order to avoid the data diff.
I prefer also splitting this out into a separate PR. |
That PR is not landeable because of datagen performance. I'm not going to invest time improving that just to immediately replace it again. |
|
I don't understand: you are needing to build all the same CPTs here as you do in that other PR. So why does only one of them have performance problems? |
|
Because in this PR I've implemented caching inside the UCD data source. |
|
Would it be possible to make a small PR (maybe one that adds defaults to TrieValue) that has temporary hacks leading to the same diffs? It would be nic eto have a programmatic understanding of what's changed. But I don't think we should spend too much time on this. |
|
I've updated the Script default in the other PR. The diffs are now byte-identical. |
|
Nice macro. |
| _ => "ucd/PropList.txt", | ||
| }; | ||
|
|
||
| for line in self.unicode()?.read_to_string(file)?.lines() { |
There was a problem hiding this comment.
issue: this type of UCD parsing code shows up in multiple spots: can we make a shared helper?
(or maybe two, one for binary and one for enum props)
Having a separate ucd.rs with all that code would be nice.
They are somewhat different formats (especially things like emoji-sequences) so maybe it's not easy to share the helper, but maybe they can all be in one file at least.
There was a problem hiding this comment.
I though about this, but
- we can isolate the comment stripping, but apparently some comments are semantically relevant, so that might be confusing
- there are enough edge cases, like CCC, that this would get quite complicated
- I don't want to create intermediate data structures if possible
|
|
||
| for line in self.unicode()?.read_to_string(file)?.lines() { | ||
| let line = line.split('#').next().unwrap().trim(); | ||
| if line.is_empty() { |
There was a problem hiding this comment.
issue: These files have @missing rules that are also supposed to be parsed. As far as I can tell, each file has only a single such rule per property, but I'm not sure if that's guaranteed, and I'm not sure if that rule always matches the enum default we have here. Probably.
We should at the very least think about this and add a comment here mentioning it. I've already asked Robin for clarity.
There was a problem hiding this comment.
afaict only DerivedNormalizationProps.txt has @missing. I'll add it
There was a problem hiding this comment.
As far as I can tell, each file has only a single such rule per property,
Nope, and that’s explicitly documented: https://www.unicode.org/reports/tr44/#Missing_Conventions
Starting with Version 15.0, some data files in the UCD may contain multiple @missing lines defined for the same property.
And DerivedEastAsianWidth.txt, which is in this PR, is one such file.
There was a problem hiding this comment.
That also says
An
@missingline is never provided for a binary property, because the default value for binary properties is always "N" and need not be defined redundantly for each binary property.
which doesn't match the data.
There was a problem hiding this comment.
ah, the property I looked at wasn't binary, it was enumerated over Yes and No. of course
There was a problem hiding this comment.
(It does, it’s just that two enumerated properties—NFD_QC and NFKD_QC—have exactly two values Y=Yes and N=No. Interestingly, they do not have the aliases T=True and F=False that real binary properties have.)
There was a problem hiding this comment.
To @Manishearth’s original comment: The @missing support is there in this PR already, in enum_codepointtrie.rs which is where it is applicable:
icu4x/provider/source/src/properties/enum_codepointtrie.rs
Lines 115 to 122 in 556232d
Co-authored-by: Copilot <copilot@github.com>
This reverts commit ce74f86.
Manishearth
left a comment
There was a problem hiding this comment.
Overall I like this. I still would prefer if all UCD-format handling for loops were in a separate file in one place (even if they are half a dozen different functions with slightly different APIs), in case we need to fix bugs around that. Markus/Elango have indicated that there are nuances there, and @missing is one such nuance we messed up here once already, and things are liable to change in the future.
But I'm happy to do that myself as a followup.
| }) | ||
| #[cfg(not(any(feature = "use_wasm", feature = "use_icu4c")))] | ||
| return Err(DataError::custom( | ||
| "icu_provider_source must be built with use_icu4c or use_wasm to build properties data", |
There was a problem hiding this comment.
question: is this change something we care about?
I think this is fine, kind of inevitable, wasm is default so I expect most users to just use this without noticing.
There was a problem hiding this comment.
Shane also agrees this is fine.
| let trie = map | ||
| .into_iter() | ||
| // Filter CCC's numeric names. | ||
| // TODO: Don't |
| .read_to_string("ucd/ScriptExtensions.txt")? | ||
| .lines() | ||
| { | ||
| let line = line.split('#').next().unwrap().trim(); |
There was a problem hiding this comment.
observation: this code is more complex than the others
| pub(crate) short: Option<String>, | ||
| #[serde(default)] | ||
| pub(crate) aliases: Vec<String>, | ||
| pub(crate) short: String, |
There was a problem hiding this comment.
observation: so we still use this data for ICU4C discriminants and the short name?
| } | ||
|
|
||
| fn hardcoded_segmenter_provider() -> SourceDataProvider { | ||
| fn unicode_15_1() -> &'static SourceDataProvider { |
There was a problem hiding this comment.
nit: document as being for segmenter to use
| if p.left.is_none() && p.right.is_none() { | ||
| // If any values aren't set, this is builtin type. | ||
| simple_properties_count += 1; | ||
| match &*segmenter.segmenter_type { |
#4602
Changelog
icu_provider_source:unicodedata source, instead of fromicuexport