Skip to content

Replace unihan and ucd by a unified unicode data source#7882

Merged
robertbastian merged 6 commits intounicode-org:mainfrom
robertbastian:unicodecache
Apr 28, 2026
Merged

Replace unihan and ucd by a unified unicode data source#7882
robertbastian merged 6 commits intounicode-org:mainfrom
robertbastian:unicodecache

Conversation

@robertbastian
Copy link
Copy Markdown
Member

@robertbastian robertbastian commented Apr 17, 2026

https://unicode.org/Public/{version} represents a single data source. I think the confusion here was that this data source contains some zip files, whereas so far we only had data sources that are available as zip files or directories of text files.

Changelog

icu_provider_source: Deprecated the unihan data source

@robertbastian robertbastian requested review from a team, Manishearth and sffc as code owners April 17, 2026 13:04
@robertbastian robertbastian force-pushed the unicodecache branch 2 times, most recently from b2607f4 to 0cb76c4 Compare April 17, 2026 13:27
@sffc
Copy link
Copy Markdown
Member

sffc commented Apr 17, 2026

The issue is that UCD doesn't come in a standard distribution package, whereas the Unihan database is conveniently available as a single zip file. The UCD files usually need to be downloaded one by one, making the UCD a bit of an outlier compared to the Unihan setup."

I think the confusion here

What confusion? We discussed this at length with multiple parties including directly with the Unicode tools team when adding these sources in 2.2.

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ not correctly convinced by motivation

@Manishearth
Copy link
Copy Markdown
Member

The UCD files usually need to be downloaded one by one, making the UCD a bit of an outlier compared to the Unihan setup.

I think which one is an outlier is a matter of opinion here. From the PAG side I find Unihan to be the outlier: it is a large bundle of UCD-like data files that is all in one zip file for organizational reasons (in both senses: "the org structure of unicode" and "this avoids clutter")

The issue is that UCD doesn't come in a standard distribution package

What is "standard distribution package"? The UCD does come in a zip file: https://www.unicode.org/Public/17.0.0/ucd/UCD.zip . Is there anything missing there?

But yes, I think we asked UCD people about our design and they seemed to be in favor. I'm not sure which thread it was.

@opnuub
Copy link
Copy Markdown
Contributor

opnuub commented Apr 17, 2026

I may be missing some context here, but I think one concern with fully collapsing Unihan into UCD is that, in practice, they may need different stability expectations even if they come from the same upstream Unicode release.

For example, one use case for Unihan.zip in this repo is extracting the codepoint_id -> kRSUnicode radical_id mapping to train the Chinese word segmentation model. At inference time, that model depends on the same mapping remaining stable, so there is value in pinning Unihan to a stable version. By contrast, other UCD data may reasonably want to track a newer Unicode release for unrelated functionality.

Because of that, I think the question is not only whether Unihan is conceptually part of the UCD, but also whether ICU4X needs to preserve a way to express different versioning/ stability requirements for those inputs in practice. If that use case matters, keeping some separate override/config surface for Unihan seems helpful, even if the underlying source is still considered part of the broader UCD.

@Manishearth
Copy link
Copy Markdown
Member

Manishearth commented Apr 17, 2026

Right, I think that is the key question, and I think that was a part of the original discussion. Being able to override unihan source is important.

I think ucd-root should automatically expose unihan data if supplied, but unihan-root makes sense to have as an override option.

@robertbastian
Copy link
Copy Markdown
Member Author

Segmenter being on an old Unicode version should not be handled by a data source split. Even currently, we hardcode an older version of icuexportdata inside icu_provider_source to support segmenter, we can and should do the same with the Unicode data.

The UCD does come in a zip file: https://www.unicode.org/Public/17.0.0/ucd/UCD.zip

Yes, but that is just the UCD! We need access to all files under https://www.unicode.org/Public/17.0.0. Even currently the one file that the code is loading from the "ucd" data source is not actually from the UCD, it's https://www.unicode.org/Public/17.0.0/security/IdentifierStatus.txt. The UCD is only a part of the Unicode data source, and we need the whole Unicode data source, which does not come in a zip file.

The data in https://www.unicode.org/Public/17.0.0/ucd/UCD.zip also does not match what is listed in https://www.unicode.org/Public/17.0.0/ucd, so from our perspective they're not interchangeable representations of the same data source.

But yes, I think we asked UCD people about our design and they seemed to be in favor. I'm not sure which thread it was.

We asked the wrong question. We asked "given that we have to model UCD and Unihan as separate data sources, how should we name them?". But that was a stupid question, because all Unicode files under https://www.unicode.org/Public/17.0.0/ represent a single versioned data source, whether they're zipped or not.

@Manishearth
Copy link
Copy Markdown
Member

Segmenter being on an old Unicode version should not be handled by a data source split. Even currently, we hardcode an older version of icuexportdata inside icu_provider_source to support segmenter, we can and should do the same with the Unicode data.

That's an interesting point: I'm not so sure it applies to Unihan as much since the mapping in question ought to remain stable except in the case of bugs (where divergence is fine). This is different from the segmenter thing where the algorithm actually does need the exact same tables.

With segmenter, there is no reason to diverge. Here I think divergence is fine, but I could be wrong. So we probably shouldn't be hardcoding versions of Unihan here.

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you asking me to review? Are you asking me to review the code or are you asking for feedback on your proposed shape?

One caveat is that we should not call this the "unicode" data source. All of our data sources, except for tzdb, are Unicode data sources. The fact that the Unicode Technical Committee claimed the name "unicode" for their data files and specification is a branding issue that we have been told by higher-ups not to propagate further.

@robertbastian
Copy link
Copy Markdown
Member Author

robertbastian commented Apr 21, 2026

What are you asking me to review? Are you asking me to review the code or are you asking for feedback on your proposed shape?

You mark your discussion comments as reviews, so I'm putting you back in review to respond to my comments.

One caveat is that we should not call this the "unicode" data source. All of our data sources, except for tzdb, are Unicode data sources. The fact that the Unicode Technical Committee claimed the name "unicode" for their data files and specification is a branding issue that we have been told by higher-ups not to propagate further.

I'm open to other names. "Unicode" seems to be the natural name for files at https://unicode.org/Public/, "UCD" is not correct for files outside the ucd directory.

Also, the name "unicode" for the data source is internal, so it's not "branding" that I "propagate" with this PR.

@robertbastian
Copy link
Copy Markdown
Member Author

I do not want to end up with one or more data sources per subdirectory in https://unicode.org/Public/17.0.0/, which is where the current path of having separate "ucd" and "unihan" data sources is taking us. This is both a versioning and a UX nightmare; for users there should be a single Unicode version, explicit file system inputs are mainly used for testing, where we can adapt our test data to the shape we need, and for vendoring, where 7 different data sources just add room for errors.

@robertbastian robertbastian requested a review from sffc April 21, 2026 09:13
Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with having a single --ucd-tag was that the tag would be used for all data sources on unicode.org/Public. The problem you talk about with version drift does not exist because there is just the one tag flag.

I proposed --ucd-tag because I consider everything in unicode.org/Public as part of the UCD. The directory ucd contains certain tables, but everything else in there like confusables, emoji charts, etc., is all an artifact of the UTC and therefore I call it the UCD.

--ucd-root and --unihan-root seem appropriate. We have infra for reading zip files when they are a root. Removing --unihan-root and considering Unihan files to be {ucd-root}/ucd/Unihan.zip/something.txt is an interesting design. I wish it had been brought up when we were previously discussing this.

I leave my comments as PR reviews in order to clear out my backlog. I don't have much more to add.

@robertbastian
Copy link
Copy Markdown
Member Author

The problem you talk about with version drift does not exist because there is just the one tag flag.

But it does exist because --ucd-root and --unihan-root are separate data sources.

I consider everything in unicode.org/Public as part of the UCD

According to https://www.unicode.org/ucd/ that seems to be correct.

I can rename the relevant identifiers, but it's not going to substantially change this PR. Currently, the UCD source has actually not been properly implemented, the segmenter radicals code always uses hardcoded test data, so this PR is actually needed to address that omission.

--ucd-root and --unihan-root seem appropriate. We have infra for reading zip files when they are a root.

Just because we have the infra doesn't mean they should be modeled that way. There's another zip file at https://www.unicode.org/Public/17.0.0/security/uts39-data-17.0.0.zip, by your logic that needs to get its own root argument as well.

@robertbastian robertbastian requested a review from sffc April 21, 2026 21:24
@Manishearth
Copy link
Copy Markdown
Member

According to https://www.unicode.org/ucd/ that seems to be correct.

Yeah, "UCD" is ambiguously used and I'm not happy about that.

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users will use --ucd-tag (or leave it at the default). The only problem we solve by merging --ucd-root and --unihan-root is a power user who actually goes through the trouble of setting up those directories (a nontrivial task especially for UCD that doesn't have a single downloadable artefact). I'm not convinced this is worth the increased complexity of handling a zip file in the middle of a resource tree.

@robertbastian
Copy link
Copy Markdown
Member Author

Please give this an in-depth review. Modelling Unihan as part of the UCD is only part of this change, the bulk is to actually make the UCD a working data source.

@robertbastian robertbastian requested a review from sffc April 22, 2026 16:42
Comment thread provider/source/src/segmenter/unihan.rs Outdated

let raw_content = self
.unicode()?
.read_to_string("ucd/Unihan.zip/Unihan_IRGSources.txt")?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: I'd like to support unzipped unihan data. Should it be ucd/Unihan/Unihan_IRGSources.txt where we automatically check for .zip files if the directory doesn't exist?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I don't want to support unzipped Unihan data, and why Unicode presumably doesn't distribute unzipped Unihan data is because the raw files are absolutely massive. This file is 13.4MB unzipped, but 1.9MB zipped.

Copy link
Copy Markdown
Member

@sffc sffc Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLDR JSON is like 300 MB unzipped, but we support both zip and unzip versions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no single CLDR JSON file that we need to include in the repo as test data is 13MB

Comment thread provider/source/src/tests/data.rs Outdated
Comment thread provider/source/src/tests/data.rs Outdated
[
("security/IdentifierStatus.txt", include_bytes!("../../tests/data/ucd/security/IdentifierStatus.txt").as_slice())
("security/IdentifierStatus.txt", include_bytes!("../../tests/data/unicode/security/IdentifierStatus.txt").as_slice()),
("ucd/Unihan.zip", include_bytes!("../../tests/data/unicode/ucd/Unihan.zip").as_slice())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: We should not pull in the whole Unihan.zip file and add it to the repo

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't, download-repo-sources puts only the required files inside

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clever, thanks. I still don't want to add a zip file to the repo.

let irg_path = out_root.join("tests/data/unihan/Unihan_IRGSources.txt");
let file = File::open(&irg_path)?;
let reader = io::BufReader::new(file);
let filtered_content: String = reader
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: you removed the filtering. We should have the filtering if we land a txt source in the repo, which I think we should, rather than a zip.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we should not have the filtering, because we should test with real data

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is real data. It is like removing JSON files we don't use. This file is basically like a CSV file and it makes sense that we would only include the rows of the CSV that we need.

If there was a CLDR JSON file where we use only 10%, then similarly I would be open to removing the parts that we don't reference.

@robertbastian robertbastian requested a review from sffc April 22, 2026 21:18
@sffc
Copy link
Copy Markdown
Member

sffc commented Apr 22, 2026

I brought this up at the UTC meeting, which started a conversation about what is the correct term: https://github.com/unicode-org/properties/issues/546

@robertbastian
Copy link
Copy Markdown
Member Author

I brought this up at the UTC meeting, which started a conversation about what is the correct term: https://github.com/unicode-org/properties/issues/546

I don't seem to have access to that

@lianghai
Copy link
Copy Markdown
Member

There was some discussion at the UTC meeting today about the confusion of the UCD’s scope (ie, which files under a directory like https://unicode.org/Public/17.0.0/ belongs to the UCD).

My impression is that we who maintain the UCD and manage the Unicode Standard’s releases have a pretty consistent understanding, that is, only https://unicode.org/Public/17.0.0/ucd/ is the UCD. You guys already know, files under such directories can be retrieved by downloading these two mutually exclusive ZIP files:

(Yes, in our documentation, eg, https://unicode.org/reports/tr44/ and https://unicode.org/ucd/, there’s ambiguous and/or outdated language. But there’s no need to keep talking about the alternative interpretation of language like “The latest version of the UCD is always located on the Unicode website at: https://www.unicode.org/Public/UCD/latest/”.)

The data files published under directories like https://unicode.org/Public/17.0.0/ can be understood as what a Unicode Standard “release” consists of, and that’s why they’re under the same Unicode Standard version number, but they’re not necessarily all part of the Unicode Standard (and the UCD is an even smaller scope than the Unicode Standard):

  • Unicode Standard
    • UCD
      • Unihan
    • UAXes
  • Synced UTSes

For the data files that are not part of the UCD, yeah, unfortunately you need to download them one by one. We welcome suggestions and contributions about how to improve the developer experience in this area.

@lianghai
Copy link
Copy Markdown
Member

(Sorry that I didn’t see Shane’s comment when I was writing mine.)

@robertbastian robertbastian force-pushed the unicodecache branch 4 times, most recently from b66cc22 to 477dd13 Compare April 23, 2026 17:04
@robertbastian robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Apr 23, 2026
@robertbastian robertbastian force-pushed the unicodecache branch 3 times, most recently from d977add to 1120982 Compare April 23, 2026 22:27
robertbastian added a commit that referenced this pull request Apr 23, 2026
Split from #7882 

## Changelog

N/A
@robertbastian robertbastian requested a review from sffc April 23, 2026 22:56
@robertbastian
Copy link
Copy Markdown
Member Author

FWIW ICU has this data in a directory called unidata

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two concerns I have with this PR:

  1. Landing binary data in the repo, especially when the binary data is actually text data hidden in a zip file
  2. Agree as a WG to deprecate the flag that we had newly added in 2.2

@robertbastian
Copy link
Copy Markdown
Member Author

If you'd rather have test data that doesn't match the real data, sure. I hope you won't regret this in the future.

@robertbastian robertbastian requested a review from sffc April 27, 2026 09:57
Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: I like the automatic handling of zip or non-zip files.

This is LGTM but it touches API. We should follow up with the WG on the impact.

if let (Some(unihan_zip), Some(unihan_path)) =
(self.unihan_zip.as_ref(), file.strip_prefix("ucd/unihan/"))
{
Ok(unihan_zip.file_exists(unihan_path)?)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: if there is a zip file, it makes this code never look at the ucd/unihan directory, even if it exists.

@robertbastian robertbastian merged commit 7252653 into unicode-org:main Apr 28, 2026
34 checks passed
@robertbastian robertbastian deleted the unicodecache branch April 28, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss Discuss at a future ICU4X-SC meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants