Support numbering system overrides in datetime#7905
Support numbering system overrides in datetime#7905Manishearth merged 65 commits intounicode-org:mainfrom
Conversation
4e18fdf to
7ebc672
Compare
robertbastian
left a comment
There was a problem hiding this comment.
looks cool
tell your agent to not use Display, and to not allocate intermediate strings
| let s = o.format_number(year_val as usize); | ||
| w.with_part(PART, |w| w.write_str(&s))?; |
There was a problem hiding this comment.
issue: you're allocating an intermediate string. instead, let format_number write to w
| } | ||
|
|
||
| /// <https://github.com/unicode-org/cldr/blob/main/common/rbnf/root.xml#L522> | ||
| fn format_hanidec(number: usize) -> String { |
There was a problem hiding this comment.
don't allocate this intermediate string. we implement an allocation-free Writeable for the numeric types, you could wrap that to replace the digits
There was a problem hiding this comment.
Yeah, I was hoping to not have to do that. The hebrew stuff specifically is complex enough to benefit from intermediates.
There was a problem hiding this comment.
I made this use Writeables for everything except Hebrew.
There was a problem hiding this comment.
Drive-by: Datetime formatting is zero-alloc. Adding an allocation is a non-starter for me.
| 0 | ||
| ], | ||
| "elements": [ | ||
| "ddddddddddddddddd日", |
There was a problem hiding this comment.
that's a lot of ds, can we still tweak the encoding? I find it hard to believe that we've used all the other lengths
There was a problem hiding this comment.
This is based on preexisting agreement with CLDR
icu4x/components/datetime/src/provider/fields/length.rs
Lines 62 to 66 in 3ba5bf6
There was a problem hiding this comment.
well, we could still use 7, and if CLDR ever uses 7, shift that to 8 until we shift something to 16
There was a problem hiding this comment.
That's a data stability issue.
I consider this encoding out of scope for this PR. We got agreement on this a while ago, if we want to change that we can get agreement on that later. If we don't care about data stability, then we can change from 17 to 7 as well. If we do, then we will be sorry picking 7.
Note that these are almost never stored as strings, they are just stored as "17 d" in the baked data. It's only the rendered/JSON patterns.
There was a problem hiding this comment.
CLDR defined >16 (17 and higher) as the private use area.
These private-use field lengths are basically never stringified except in FsDataProvider JSON data.
|
Using Display and intermediate strings were actually deliberate choices to keep the impl simple, especially for Hebrew. There are probably ways to optimize |
|
I will try to make it use writeables, though. |
We've put a lot of effort into keeping this code alloc free. Please try. Your code already does sequential pushes into a string, it shouldn't be hard to change that to writeable. |
The Hebrew code has different behavior based on how many letters there are in the final string, and involve insertion of a mark inside the string. So yes, I could switch to writeable for most of this (and it's a followup I was hoping to play around with later), but I really don't want to try to do that for Hebrew just yet. |
|
To be clear, it's possible to write the Hebrew stuff in an alloc-free way, just that it would further complicate code that is already pretty complicated. |
Done |
cbf8bac to
1911b22
Compare
1911b22 to
90bddd9
Compare
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
Co-authored-by: Robert Bastian <4706271+robertbastian@users.noreply.github.com>
|
I discussed with @sffc about the errors: I don't really want to introduce new error types for "we didn't implement a full RBNF here". Instead, I made the romanlow/hebrew implementations fully match what is specced in CLDR (which eventually falls back to the default), and filed a TODO for hanidays #7922. My general position is we shouldn't be returning an error here, this is more of a GIGO situation and Latin fallback is reasonable. |
| return w.write_char('n'); // null | ||
| } | ||
| if n >= 5000 { | ||
| // romanlow falls back to the default past 5000. |
There was a problem hiding this comment.
the rules here says 5000: =#,##0=;. To me that looks like you should use the DecimalFormatter with grouping.
There was a problem hiding this comment.
We can't until we know which DecimalFormatter to use, that is part of
https://unicode-org.atlassian.net/browse/CLDR-19424
There was a problem hiding this comment.
I also don't particularly think we should optimize for this GIGO case.
| Ok(()) | ||
| } | ||
|
|
||
| /// <https://github.com/unicode-org/cldr/blob/main/common/rbnf/root.xml#L522> |
There was a problem hiding this comment.
| /// <https://github.com/unicode-org/cldr/blob/main/common/rbnf/root.xml#L522> | |
| /// <https://github.com/unicode-org/cldr/blob/fb0b4f0cb809cac10e8539dcba669c1d27d8e70c/common/main/root.xml#L3169-L3171> |
this seems to be locale-sensitive?
There was a problem hiding this comment.
Not exactly, and that's somewhat of an open question.
There was a problem hiding this comment.
ah I looked at the wrong thing, the symbols to use with hanidec are locale dependent. as we don't do grouping, that doesn't matter.
link should be https://github.com/unicode-org/cldr/blob/c6d4b3579d2ee196ad0f9c3a9adb608a55ddf99b/common/supplemental/numberingSystems.xml#L39
34101f2 to
22483c1
Compare
Fixes #1225
Multiple of these are used in a year or day context, so the original suggestion of "generate string month data for these" does not work. In theory, we could handle most of these in one-off ways:
jpnyearis a simple hardcodedif, the month-only ones like romanlow can go in data, etc. But since we need this infrastructure anyway for hanidec/hebr years, I don't think it's worth implementing this three different ways.This is a pretty nontrivial PR that was also almost entirely agent-written. I spent a lot of work talking to the agent, linking it to docs, and getting it to diagnose various issues. In the actual code, most of my work was minor cleanups and reorganizing the
jjhistory.(I used an agent instead of doing this myself because I am attending UTC this week and I cannot do focused coding work but I can prompt agents)
The agent was pretty good at doing its own printf debugging to figure out the source of various bugs. The initial naïve implementation of just applying the preexisting pattern code was insufficient, there were a number of bugs, and it took a lot of back-and-forth with the agent to diagnose them. There were a lot of cycles of the agent implementing an incorrect fix (e.g. by force-overriding things in datagen) and me saying we don't want to do that, undoing the change, and trying again with more prompting.
A thing I did not do (that the agent wanted to do) was support explicit user-driven overrides, like
he-u-calendar-hebrew-nu-hebr. I didn't do this because I'd need to supporthanidaysfor a wider range. However, it's not a complicated change to make.I have been reviewing this code extensively as I go on. However, this is a draft PR because I haven't done a proper end to end review. I also want to triple check the hebrew code against sources: it matches my understanding of the formatting but it might not be 100% correct.Changelog
icu_datetime: Support numbering system overrides for datetime patterns when found in data