Replace unihan and ucd by a unified unicode data source by robertbastian · Pull Request #7882 · unicode-org/icu4x

robertbastian · 2026-04-17T13:04:50Z

https://unicode.org/Public/{version} represents a single data source. I think the confusion here was that this data source contains some zip files, whereas so far we only had data sources that are available as zip files or directories of text files.

Changelog

icu_provider_source: Deprecated the unihan data source

sffc · 2026-04-17T16:45:28Z

The issue is that UCD doesn't come in a standard distribution package, whereas the Unihan database is conveniently available as a single zip file. The UCD files usually need to be downloaded one by one, making the UCD a bit of an outlier compared to the Unihan setup."

I think the confusion here

What confusion? We discussed this at length with multiple parties including directly with the Unicode tools team when adding these sources in 2.2.

sffc

^ not correctly convinced by motivation

Manishearth · 2026-04-17T21:29:53Z

The UCD files usually need to be downloaded one by one, making the UCD a bit of an outlier compared to the Unihan setup.

I think which one is an outlier is a matter of opinion here. From the PAG side I find Unihan to be the outlier: it is a large bundle of UCD-like data files that is all in one zip file for organizational reasons (in both senses: "the org structure of unicode" and "this avoids clutter")

The issue is that UCD doesn't come in a standard distribution package

What is "standard distribution package"? The UCD does come in a zip file: https://www.unicode.org/Public/17.0.0/ucd/UCD.zip . Is there anything missing there?

But yes, I think we asked UCD people about our design and they seemed to be in favor. I'm not sure which thread it was.

opnuub · 2026-04-17T22:32:51Z

I may be missing some context here, but I think one concern with fully collapsing Unihan into UCD is that, in practice, they may need different stability expectations even if they come from the same upstream Unicode release.

For example, one use case for Unihan.zip in this repo is extracting the codepoint_id -> kRSUnicode radical_id mapping to train the Chinese word segmentation model. At inference time, that model depends on the same mapping remaining stable, so there is value in pinning Unihan to a stable version. By contrast, other UCD data may reasonably want to track a newer Unicode release for unrelated functionality.

Because of that, I think the question is not only whether Unihan is conceptually part of the UCD, but also whether ICU4X needs to preserve a way to express different versioning/ stability requirements for those inputs in practice. If that use case matters, keeping some separate override/config surface for Unihan seems helpful, even if the underlying source is still considered part of the broader UCD.

Manishearth · 2026-04-17T23:04:08Z

Right, I think that is the key question, and I think that was a part of the original discussion. Being able to override unihan source is important.

I think ucd-root should automatically expose unihan data if supplied, but unihan-root makes sense to have as an override option.

robertbastian · 2026-04-20T08:18:57Z

Segmenter being on an old Unicode version should not be handled by a data source split. Even currently, we hardcode an older version of icuexportdata inside icu_provider_source to support segmenter, we can and should do the same with the Unicode data.

The UCD does come in a zip file: https://www.unicode.org/Public/17.0.0/ucd/UCD.zip

Yes, but that is just the UCD! We need access to all files under https://www.unicode.org/Public/17.0.0. Even currently the one file that the code is loading from the "ucd" data source is not actually from the UCD, it's https://www.unicode.org/Public/17.0.0/security/IdentifierStatus.txt. The UCD is only a part of the Unicode data source, and we need the whole Unicode data source, which does not come in a zip file.

The data in https://www.unicode.org/Public/17.0.0/ucd/UCD.zip also does not match what is listed in https://www.unicode.org/Public/17.0.0/ucd, so from our perspective they're not interchangeable representations of the same data source.

But yes, I think we asked UCD people about our design and they seemed to be in favor. I'm not sure which thread it was.

We asked the wrong question. We asked "given that we have to model UCD and Unihan as separate data sources, how should we name them?". But that was a stupid question, because all Unicode files under https://www.unicode.org/Public/17.0.0/ represent a single versioned data source, whether they're zipped or not.

Manishearth · 2026-04-20T15:17:33Z

Segmenter being on an old Unicode version should not be handled by a data source split. Even currently, we hardcode an older version of icuexportdata inside icu_provider_source to support segmenter, we can and should do the same with the Unicode data.

That's an interesting point: I'm not so sure it applies to Unihan as much since the mapping in question ought to remain stable except in the case of bugs (where divergence is fine). This is different from the segmenter thing where the algorithm actually does need the exact same tables.

With segmenter, there is no reason to diverge. Here I think divergence is fine, but I could be wrong. So we probably shouldn't be hardcoding versions of Unihan here.

sffc

What are you asking me to review? Are you asking me to review the code or are you asking for feedback on your proposed shape?

One caveat is that we should not call this the "unicode" data source. All of our data sources, except for tzdb, are Unicode data sources. The fact that the Unicode Technical Committee claimed the name "unicode" for their data files and specification is a branding issue that we have been told by higher-ups not to propagate further.

robertbastian · 2026-04-21T09:10:06Z

What are you asking me to review? Are you asking me to review the code or are you asking for feedback on your proposed shape?

You mark your discussion comments as reviews, so I'm putting you back in review to respond to my comments.

One caveat is that we should not call this the "unicode" data source. All of our data sources, except for tzdb, are Unicode data sources. The fact that the Unicode Technical Committee claimed the name "unicode" for their data files and specification is a branding issue that we have been told by higher-ups not to propagate further.

I'm open to other names. "Unicode" seems to be the natural name for files at https://unicode.org/Public/, "UCD" is not correct for files outside the ucd directory.

Also, the name "unicode" for the data source is internal, so it's not "branding" that I "propagate" with this PR.

robertbastian · 2026-04-21T09:13:25Z

I do not want to end up with one or more data sources per subdirectory in https://unicode.org/Public/17.0.0/, which is where the current path of having separate "ucd" and "unihan" data sources is taking us. This is both a versioning and a UX nightmare; for users there should be a single Unicode version, explicit file system inputs are mainly used for testing, where we can adapt our test data to the shape we need, and for vendoring, where 7 different data sources just add room for errors.

sffc

The idea with having a single --ucd-tag was that the tag would be used for all data sources on unicode.org/Public. The problem you talk about with version drift does not exist because there is just the one tag flag.

I proposed --ucd-tag because I consider everything in unicode.org/Public as part of the UCD. The directory ucd contains certain tables, but everything else in there like confusables, emoji charts, etc., is all an artifact of the UTC and therefore I call it the UCD.

--ucd-root and --unihan-root seem appropriate. We have infra for reading zip files when they are a root. Removing --unihan-root and considering Unihan files to be {ucd-root}/ucd/Unihan.zip/something.txt is an interesting design. I wish it had been brought up when we were previously discussing this.

I leave my comments as PR reviews in order to clear out my backlog. I don't have much more to add.

robertbastian · 2026-04-21T21:24:53Z

The problem you talk about with version drift does not exist because there is just the one tag flag.

But it does exist because --ucd-root and --unihan-root are separate data sources.

I consider everything in unicode.org/Public as part of the UCD

According to https://www.unicode.org/ucd/ that seems to be correct.

I can rename the relevant identifiers, but it's not going to substantially change this PR. Currently, the UCD source has actually not been properly implemented, the segmenter radicals code always uses hardcoded test data, so this PR is actually needed to address that omission.

--ucd-root and --unihan-root seem appropriate. We have infra for reading zip files when they are a root.

Just because we have the infra doesn't mean they should be modeled that way. There's another zip file at https://www.unicode.org/Public/17.0.0/security/uts39-data-17.0.0.zip, by your logic that needs to get its own root argument as well.

Manishearth · 2026-04-21T21:42:45Z

According to https://www.unicode.org/ucd/ that seems to be correct.

Yeah, "UCD" is ambiguously used and I'm not happy about that.

sffc

Most users will use --ucd-tag (or leave it at the default). The only problem we solve by merging --ucd-root and --unihan-root is a power user who actually goes through the trouble of setting up those directories (a nontrivial task especially for UCD that doesn't have a single downloadable artefact). I'm not convinced this is worth the increased complexity of handling a zip file in the middle of a resource tree.

robertbastian · 2026-04-22T16:42:29Z

Please give this an in-depth review. Modelling Unihan as part of the UCD is only part of this change, the bulk is to actually make the UCD a working data source.

sffc · 2026-04-22T20:58:25Z

+
+        let raw_content = self
+            .unicode()?
+            .read_to_string("ucd/Unihan.zip/Unihan_IRGSources.txt")?;


Issue: I'd like to support unzipped unihan data. Should it be ucd/Unihan/Unihan_IRGSources.txt where we automatically check for .zip files if the directory doesn't exist?

The reason why I don't want to support unzipped Unihan data, and why Unicode presumably doesn't distribute unzipped Unihan data is because the raw files are absolutely massive. This file is 13.4MB unzipped, but 1.9MB zipped.

CLDR JSON is like 300 MB unzipped, but we support both zip and unzip versions.

no single CLDR JSON file that we need to include in the repo as test data is 13MB

sffc · 2026-04-22T21:00:53Z

                    [
-                        ("security/IdentifierStatus.txt", include_bytes!("../../tests/data/ucd/security/IdentifierStatus.txt").as_slice())
+                        ("security/IdentifierStatus.txt", include_bytes!("../../tests/data/unicode/security/IdentifierStatus.txt").as_slice()),
+                        ("ucd/Unihan.zip", include_bytes!("../../tests/data/unicode/ucd/Unihan.zip").as_slice())


Issue: We should not pull in the whole Unihan.zip file and add it to the repo

we don't, download-repo-sources puts only the required files inside

that's clever, thanks. I still don't want to add a zip file to the repo.

sffc · 2026-04-22T21:11:12Z

-    let irg_path = out_root.join("tests/data/unihan/Unihan_IRGSources.txt");
-    let file = File::open(&irg_path)?;
-    let reader = io::BufReader::new(file);
-    let filtered_content: String = reader


Observation: you removed the filtering. We should have the filtering if we land a txt source in the repo, which I think we should, rather than a zip.

no we should not have the filtering, because we should test with real data

It is real data. It is like removing JSON files we don't use. This file is basically like a CSV file and it makes sense that we would only include the rows of the CSV that we need.

If there was a CLDR JSON file where we use only 10%, then similarly I would be open to removing the parts that we don't reference.

sffc · 2026-04-22T22:09:47Z

I brought this up at the UTC meeting, which started a conversation about what is the correct term: https://github.com/unicode-org/properties/issues/546

robertbastian · 2026-04-22T22:13:46Z

I brought this up at the UTC meeting, which started a conversation about what is the correct term: https://github.com/unicode-org/properties/issues/546

I don't seem to have access to that

lianghai · 2026-04-22T22:37:17Z

There was some discussion at the UTC meeting today about the confusion of the UCD’s scope (ie, which files under a directory like https://unicode.org/Public/17.0.0/ belongs to the UCD).

My impression is that we who maintain the UCD and manage the Unicode Standard’s releases have a pretty consistent understanding, that is, only https://unicode.org/Public/17.0.0/ucd/ is the UCD. You guys already know, files under such directories can be retrieved by downloading these two mutually exclusive ZIP files:

https://unicode.org/Public/17.0.0/ucd/UCD.zip (This is actually the UCD excluding the Unihan part.)
https://unicode.org/Public/17.0.0/ucd/Unihan.zip (This is the UCD’s Unihan part.)

(Yes, in our documentation, eg, https://unicode.org/reports/tr44/ and https://unicode.org/ucd/, there’s ambiguous and/or outdated language. But there’s no need to keep talking about the alternative interpretation of language like “The latest version of the UCD is always located on the Unicode website at: https://www.unicode.org/Public/UCD/latest/”.)

The data files published under directories like https://unicode.org/Public/17.0.0/ can be understood as what a Unicode Standard “release” consists of, and that’s why they’re under the same Unicode Standard version number, but they’re not necessarily all part of the Unicode Standard (and the UCD is an even smaller scope than the Unicode Standard):

Unicode Standard
- UCD
  - Unihan
- UAXes
Synced UTSes

For the data files that are not part of the UCD, yeah, unfortunately you need to download them one by one. We welcome suggestions and contributions about how to improve the developer experience in this area.

lianghai · 2026-04-22T22:47:54Z

(Sorry that I didn’t see Shane’s comment when I was writing mine.)

Split from #7882 ## Changelog N/A

robertbastian · 2026-04-23T23:23:31Z

FWIW ICU has this data in a directory called unidata

This reverts commit cfb69c2.

sffc

There are two concerns I have with this PR:

Landing binary data in the repo, especially when the binary data is actually text data hidden in a zip file
Agree as a WG to deprecate the flag that we had newly added in 2.2

robertbastian · 2026-04-27T09:57:18Z

If you'd rather have test data that doesn't match the real data, sure. I hope you won't regret this in the future.

sffc

Praise: I like the automatic handling of zip or non-zip files.

This is LGTM but it touches API. We should follow up with the WG on the impact.

sffc · 2026-04-28T09:56:14Z

+        if let (Some(unihan_zip), Some(unihan_path)) =
+            (self.unihan_zip.as_ref(), file.strip_prefix("ucd/unihan/"))
+        {
+            Ok(unihan_zip.file_exists(unihan_path)?)


Observation: if there is a zip file, it makes this code never look at the ucd/unihan directory, even if it exists.

robertbastian requested review from a team, Manishearth and sffc as code owners April 17, 2026 13:04

robertbastian force-pushed the unicodecache branch 2 times, most recently from b2607f4 to 0cb76c4 Compare April 17, 2026 13:27

sffc requested changes Apr 17, 2026

View reviewed changes

robertbastian force-pushed the unicodecache branch from 0cb76c4 to a0a9ffd Compare April 20, 2026 08:13

robertbastian requested a review from sffc April 20, 2026 08:19

robertbastian force-pushed the unicodecache branch from a0a9ffd to 1f3750c Compare April 20, 2026 08:46

sffc reviewed Apr 20, 2026

View reviewed changes

robertbastian requested a review from sffc April 21, 2026 09:13

sffc reviewed Apr 21, 2026

View reviewed changes

robertbastian requested a review from sffc April 21, 2026 21:24

sffc reviewed Apr 21, 2026

View reviewed changes

robertbastian requested a review from sffc April 22, 2026 16:42

sffc requested changes Apr 22, 2026

View reviewed changes

robertbastian requested a review from sffc April 22, 2026 21:18

robertbastian mentioned this pull request Apr 23, 2026

Simplify test sources #7906

Merged

robertbastian force-pushed the unicodecache branch 4 times, most recently from b66cc22 to 477dd13 Compare April 23, 2026 17:04

robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Apr 23, 2026

robertbastian force-pushed the unicodecache branch 3 times, most recently from d977add to 1120982 Compare April 23, 2026 22:27

robertbastian added a commit that referenced this pull request Apr 23, 2026

Simplify test sources (#7906)

e85cdb2

Split from #7882 ## Changelog N/A

robertbastian force-pushed the unicodecache branch from 1120982 to cfb69c2 Compare April 23, 2026 22:52

robertbastian requested a review from sffc April 23, 2026 22:56

robertbastian added 4 commits April 24, 2026 19:59

Add HTTP functionality to AbstractFs

48de323

Replace unihan and ucd by a unified unicode data source

95b6951

rename

e0a162e

Revert "rename"

30f27e8

This reverts commit cfb69c2.

robertbastian force-pushed the unicodecache branch from 53289f9 to 30f27e8 Compare April 24, 2026 18:03

sffc reviewed Apr 24, 2026

View reviewed changes

robertbastian mentioned this pull request Apr 25, 2026

Generate properties from the UCD #7904

Merged

robertbastian requested a review from sffc April 27, 2026 09:57

robertbastian added 2 commits April 27, 2026 12:43

don't use zip files implicitly

703a61a

docs

f585303

robertbastian force-pushed the unicodecache branch from f95a1d2 to f585303 Compare April 27, 2026 10:43

sffc approved these changes Apr 28, 2026

View reviewed changes

robertbastian merged commit 7252653 into unicode-org:main Apr 28, 2026
34 checks passed

robertbastian deleted the unicodecache branch April 28, 2026 10:00

Conversation

robertbastian commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog

Uh oh!

sffc commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sffc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Manishearth commented Apr 17, 2026

Uh oh!

opnuub commented Apr 17, 2026

Uh oh!

Manishearth commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertbastian commented Apr 20, 2026

Uh oh!

Manishearth commented Apr 20, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

robertbastian commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertbastian commented Apr 21, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

robertbastian commented Apr 21, 2026

Uh oh!

Manishearth commented Apr 21, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

robertbastian commented Apr 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sffc commented Apr 22, 2026

Uh oh!

robertbastian commented Apr 22, 2026

Uh oh!

lianghai commented Apr 22, 2026

Uh oh!

lianghai commented Apr 22, 2026

Uh oh!

robertbastian commented Apr 23, 2026

Uh oh!

sffc left a comment

Choose a reason for hiding this comment

Uh oh!

robertbastian commented Apr 27, 2026

robertbastian commented Apr 17, 2026 •

edited

Loading

sffc commented Apr 17, 2026 •

edited

Loading

sffc left a comment •

edited

Loading

Manishearth commented Apr 17, 2026 •

edited

Loading

robertbastian commented Apr 21, 2026 •

edited

Loading

sffc Apr 22, 2026 •

edited

Loading