Replace unihan and ucd by a unified unicode data source#7882
Replace unihan and ucd by a unified unicode data source#7882robertbastian merged 6 commits intounicode-org:mainfrom
Conversation
b2607f4 to
0cb76c4
Compare
|
The issue is that UCD doesn't come in a standard distribution package, whereas the Unihan database is conveniently available as a single zip file. The UCD files usually need to be downloaded one by one, making the UCD a bit of an outlier compared to the Unihan setup."
What confusion? We discussed this at length with multiple parties including directly with the Unicode tools team when adding these sources in 2.2. |
I think which one is an outlier is a matter of opinion here. From the PAG side I find Unihan to be the outlier: it is a large bundle of UCD-like data files that is all in one zip file for organizational reasons (in both senses: "the org structure of unicode" and "this avoids clutter")
What is "standard distribution package"? The UCD does come in a zip file: https://www.unicode.org/Public/17.0.0/ucd/UCD.zip . Is there anything missing there? But yes, I think we asked UCD people about our design and they seemed to be in favor. I'm not sure which thread it was. |
|
I may be missing some context here, but I think one concern with fully collapsing Unihan into UCD is that, in practice, they may need different stability expectations even if they come from the same upstream Unicode release. For example, one use case for Because of that, I think the question is not only whether Unihan is conceptually part of the UCD, but also whether ICU4X needs to preserve a way to express different versioning/ stability requirements for those inputs in practice. If that use case matters, keeping some separate override/config surface for Unihan seems helpful, even if the underlying source is still considered part of the broader UCD. |
|
Right, I think that is the key question, and I think that was a part of the original discussion. Being able to override unihan source is important. I think |
0cb76c4 to
a0a9ffd
Compare
|
Segmenter being on an old Unicode version should not be handled by a data source split. Even currently, we hardcode an older version of icuexportdata inside
Yes, but that is just the UCD! We need access to all files under https://www.unicode.org/Public/17.0.0. Even currently the one file that the code is loading from the "ucd" data source is not actually from the UCD, it's https://www.unicode.org/Public/17.0.0/security/IdentifierStatus.txt. The UCD is only a part of the Unicode data source, and we need the whole Unicode data source, which does not come in a zip file. The data in https://www.unicode.org/Public/17.0.0/ucd/UCD.zip also does not match what is listed in https://www.unicode.org/Public/17.0.0/ucd, so from our perspective they're not interchangeable representations of the same data source.
We asked the wrong question. We asked "given that we have to model UCD and Unihan as separate data sources, how should we name them?". But that was a stupid question, because all Unicode files under https://www.unicode.org/Public/17.0.0/ represent a single versioned data source, whether they're zipped or not. |
a0a9ffd to
1f3750c
Compare
That's an interesting point: I'm not so sure it applies to Unihan as much since the mapping in question ought to remain stable except in the case of bugs (where divergence is fine). This is different from the segmenter thing where the algorithm actually does need the exact same tables. With segmenter, there is no reason to diverge. Here I think divergence is fine, but I could be wrong. So we probably shouldn't be hardcoding versions of Unihan here. |
sffc
left a comment
There was a problem hiding this comment.
What are you asking me to review? Are you asking me to review the code or are you asking for feedback on your proposed shape?
One caveat is that we should not call this the "unicode" data source. All of our data sources, except for tzdb, are Unicode data sources. The fact that the Unicode Technical Committee claimed the name "unicode" for their data files and specification is a branding issue that we have been told by higher-ups not to propagate further.
You mark your discussion comments as reviews, so I'm putting you back in review to respond to my comments.
I'm open to other names. "Unicode" seems to be the natural name for files at https://unicode.org/Public/, "UCD" is not correct for files outside the Also, the name "unicode" for the data source is internal, so it's not "branding" that I "propagate" with this PR. |
|
I do not want to end up with one or more data sources per subdirectory in https://unicode.org/Public/17.0.0/, which is where the current path of having separate "ucd" and "unihan" data sources is taking us. This is both a versioning and a UX nightmare; for users there should be a single Unicode version, explicit file system inputs are mainly used for testing, where we can adapt our test data to the shape we need, and for vendoring, where 7 different data sources just add room for errors. |
sffc
left a comment
There was a problem hiding this comment.
The idea with having a single --ucd-tag was that the tag would be used for all data sources on unicode.org/Public. The problem you talk about with version drift does not exist because there is just the one tag flag.
I proposed --ucd-tag because I consider everything in unicode.org/Public as part of the UCD. The directory ucd contains certain tables, but everything else in there like confusables, emoji charts, etc., is all an artifact of the UTC and therefore I call it the UCD.
--ucd-root and --unihan-root seem appropriate. We have infra for reading zip files when they are a root. Removing --unihan-root and considering Unihan files to be {ucd-root}/ucd/Unihan.zip/something.txt is an interesting design. I wish it had been brought up when we were previously discussing this.
I leave my comments as PR reviews in order to clear out my backlog. I don't have much more to add.
But it does exist because
According to https://www.unicode.org/ucd/ that seems to be correct. I can rename the relevant identifiers, but it's not going to substantially change this PR. Currently, the UCD source has actually not been properly implemented, the segmenter radicals code always uses hardcoded test data, so this PR is actually needed to address that omission.
Just because we have the infra doesn't mean they should be modeled that way. There's another zip file at https://www.unicode.org/Public/17.0.0/security/uts39-data-17.0.0.zip, by your logic that needs to get its own root argument as well. |
Yeah, "UCD" is ambiguously used and I'm not happy about that. |
sffc
left a comment
There was a problem hiding this comment.
Most users will use --ucd-tag (or leave it at the default). The only problem we solve by merging --ucd-root and --unihan-root is a power user who actually goes through the trouble of setting up those directories (a nontrivial task especially for UCD that doesn't have a single downloadable artefact). I'm not convinced this is worth the increased complexity of handling a zip file in the middle of a resource tree.
|
Please give this an in-depth review. Modelling Unihan as part of the UCD is only part of this change, the bulk is to actually make the UCD a working data source. |
|
|
||
| let raw_content = self | ||
| .unicode()? | ||
| .read_to_string("ucd/Unihan.zip/Unihan_IRGSources.txt")?; |
There was a problem hiding this comment.
Issue: I'd like to support unzipped unihan data. Should it be ucd/Unihan/Unihan_IRGSources.txt where we automatically check for .zip files if the directory doesn't exist?
There was a problem hiding this comment.
The reason why I don't want to support unzipped Unihan data, and why Unicode presumably doesn't distribute unzipped Unihan data is because the raw files are absolutely massive. This file is 13.4MB unzipped, but 1.9MB zipped.
There was a problem hiding this comment.
CLDR JSON is like 300 MB unzipped, but we support both zip and unzip versions.
There was a problem hiding this comment.
no single CLDR JSON file that we need to include in the repo as test data is 13MB
| [ | ||
| ("security/IdentifierStatus.txt", include_bytes!("../../tests/data/ucd/security/IdentifierStatus.txt").as_slice()) | ||
| ("security/IdentifierStatus.txt", include_bytes!("../../tests/data/unicode/security/IdentifierStatus.txt").as_slice()), | ||
| ("ucd/Unihan.zip", include_bytes!("../../tests/data/unicode/ucd/Unihan.zip").as_slice()) |
There was a problem hiding this comment.
Issue: We should not pull in the whole Unihan.zip file and add it to the repo
There was a problem hiding this comment.
we don't, download-repo-sources puts only the required files inside
There was a problem hiding this comment.
that's clever, thanks. I still don't want to add a zip file to the repo.
| let irg_path = out_root.join("tests/data/unihan/Unihan_IRGSources.txt"); | ||
| let file = File::open(&irg_path)?; | ||
| let reader = io::BufReader::new(file); | ||
| let filtered_content: String = reader |
There was a problem hiding this comment.
Observation: you removed the filtering. We should have the filtering if we land a txt source in the repo, which I think we should, rather than a zip.
There was a problem hiding this comment.
no we should not have the filtering, because we should test with real data
There was a problem hiding this comment.
It is real data. It is like removing JSON files we don't use. This file is basically like a CSV file and it makes sense that we would only include the rows of the CSV that we need.
If there was a CLDR JSON file where we use only 10%, then similarly I would be open to removing the parts that we don't reference.
|
I brought this up at the UTC meeting, which started a conversation about what is the correct term: https://github.com/unicode-org/properties/issues/546 |
I don't seem to have access to that |
|
There was some discussion at the UTC meeting today about the confusion of the UCD’s scope (ie, which files under a directory like https://unicode.org/Public/17.0.0/ belongs to the UCD). My impression is that we who maintain the UCD and manage the Unicode Standard’s releases have a pretty consistent understanding, that is, only https://unicode.org/Public/17.0.0/ucd/ is the UCD. You guys already know, files under such directories can be retrieved by downloading these two mutually exclusive ZIP files:
(Yes, in our documentation, eg, https://unicode.org/reports/tr44/ and https://unicode.org/ucd/, there’s ambiguous and/or outdated language. But there’s no need to keep talking about the alternative interpretation of language like “The latest version of the UCD is always located on the Unicode website at: https://www.unicode.org/Public/UCD/latest/”.) The data files published under directories like https://unicode.org/Public/17.0.0/ can be understood as what a Unicode Standard “release” consists of, and that’s why they’re under the same Unicode Standard version number, but they’re not necessarily all part of the Unicode Standard (and the UCD is an even smaller scope than the Unicode Standard):
For the data files that are not part of the UCD, yeah, unfortunately you need to download them one by one. We welcome suggestions and contributions about how to improve the developer experience in this area. |
|
(Sorry that I didn’t see Shane’s comment when I was writing mine.) |
b66cc22 to
477dd13
Compare
d977add to
1120982
Compare
Split from #7882 ## Changelog N/A
1120982 to
cfb69c2
Compare
|
FWIW ICU has this data in a directory called unidata |
53289f9 to
30f27e8
Compare
sffc
left a comment
There was a problem hiding this comment.
There are two concerns I have with this PR:
- Landing binary data in the repo, especially when the binary data is actually text data hidden in a zip file
- Agree as a WG to deprecate the flag that we had newly added in 2.2
|
If you'd rather have test data that doesn't match the real data, sure. I hope you won't regret this in the future. |
f95a1d2 to
f585303
Compare
sffc
left a comment
There was a problem hiding this comment.
Praise: I like the automatic handling of zip or non-zip files.
This is LGTM but it touches API. We should follow up with the WG on the impact.
| if let (Some(unihan_zip), Some(unihan_path)) = | ||
| (self.unihan_zip.as_ref(), file.strip_prefix("ucd/unihan/")) | ||
| { | ||
| Ok(unihan_zip.file_exists(unihan_path)?) |
There was a problem hiding this comment.
Observation: if there is a zip file, it makes this code never look at the ucd/unihan directory, even if it exists.
https://unicode.org/Public/{version} represents a single data source. I think the confusion here was that this data source contains some zip files, whereas so far we only had data sources that are available as zip files or directories of text files.
Changelog
icu_provider_source: Deprecated the unihan data source