Releases: SCKelemen/unicode
v6.2.0 — module path fix to /v6 (first working v6 release)
[v6.2.0] - 2026-05-20
Module path migration: github.com/SCKelemen/unicode → github.com/SCKelemen/unicode/v6.
Fixed
-
Module path now declares major version suffix. Tags
v2.0.0throughv6.1.0were unreachable viago getbecause thego.moddeclaredmodule github.com/SCKelemen/unicodewithout the required/v6suffix. Go's module proxy refuses to serve v2+ tags at a path with no major-version suffix, so the v6 line was effectively unusable by downstream consumers.This release fixes the path declaration:
module github.com/SCKelemen/unicode/v6
All internal imports between subpackages (
uax9,uax11,uax14,uax24,uax29,uax31,uax50,uts15,uts39,uts51) have been rewritten to use the/v6path.
Migration for downstream consumers
go get github.com/SCKelemen/unicode/v6@v6.2.0Update imports:
import "github.com/SCKelemen/unicode/v6/uax29"Source-level API is unchanged from v6.1.0. This is purely a packaging fix.
Historical tags
v6.0.0 and v6.1.0 git tags remain in the repository as historical artifacts of the v6 development line. They are not consumable via go get due to the path mismatch and will not be re-tagged (re-tagging would require destructive force-push, which is not policy for this repo). Use v6.2.0 or later.
The v1.x line remains available at github.com/SCKelemen/unicode for consumers who have not yet migrated. v1.1.1 is the most recent v1 release.
v6.1.0
Full Changelog: v1.1.1...v6.1.0
v6.0.0: Memory Optimization and ASCII Fast Paths
Version 6.0.0: Memory Optimization and ASCII Fast Paths
Version 6.0.0 focuses on memory optimization and ASCII fast paths to dramatically improve performance for common cases while maintaining 100% Unicode conformance.
🚀 Performance Improvements
ASCII Fast Paths (100x+ Speedups)
UTS #15 (Normalization):
- ASCII NFC normalization: 129x faster (7.68 ns/op vs 995 ns/op)
- ASCII NFKC normalization: 144x faster (7.72 ns/op vs 1,115 ns/op)
- 🎯 ASCII text normalization is essentially FREE (single
isASCII()check)
UTS #39 (Security):
- ASCII mixed-script check: 34x faster (4.18 ns/op vs 142 ns/op)
- ASCII safe identifier check: 3.7x faster (74.7 ns/op vs 277 ns/op)
Unicode Text Improvements
UTS #39 (Security):
- Skeleton algorithm: 2.5x faster (174 ns/op vs 430 ns/op)
- Confusable detection: 1.7x faster (502 ns/op vs 874 ns/op)
UTS #15 (Normalization):
- NFKC: 8% faster (5,390 ns/op vs 5,877 ns/op)
- NFKD: 6% faster (3,135 ns/op vs 3,337 ns/op)
💾 Memory Improvements
Type Size Reductions
| Component | Before | After | Savings |
|---|---|---|---|
| combiningClassMap (UTS #15) | ~15.5 KB | ~7.75 KB | 50% (7.75 KB) |
| Script type (UAX #24) | 8 bytes/value | 1 byte/value | 87.5% (7 bytes) |
| BreakClass type (UAX #14) | 8 bytes/value | 1 byte/value | 87.5% (7 bytes) |
🎯 All runtime structures using these types are 50-87.5% smaller with better CPU cache density.
🔧 Technical Changes
Type Reductions
-
UTS #15:
combiningClassMapchanged frommap[rune]inttomap[rune]uint8- Unicode combining classes range 0-240, fit perfectly in uint8 (0-255)
-
UAX #24:
Scripttype changed frominttouint8- 176 Unicode scripts fit comfortably in uint8 (0-255)
-
UAX #14:
BreakClasstype changed frominttouint8- 66 break classes fit in uint8 (0-255)
ASCII Fast Paths
UTS #15 (Normalization):
- Added
isASCII()check to NFC, NFD, NFKC, NFKD functions - ASCII text is already normalized in all forms
- Avoids expensive decomposition/composition operations
UTS #39 (Security):
- ASCII fast paths in
IsMixedScript()- ASCII is single-script (Latin) - ASCII fast paths in
IsSafeIdentifier()- ASCII identifiers only need validation - Skips expensive script analysis for common identifiers
🌍 Real-World Impact
Typical web application (mostly ASCII identifiers):
- Variable name validation: 34x faster
- URL normalization: 129x faster
- Username security checks: 3.7x faster
International text (mixed Unicode):
- Confusable detection: 2.5x faster
- Text normalization: 3-8% faster
- Security validation: 1.7x faster
✅ Conformance Maintained
100% conformance maintained on all official Unicode test suites:
- UTS #15: 20,034/20,034 normalization tests passing
- UAX #24: 159,866/159,866 script property tests passing
- UTS #39: 6,565/6,565 confusable mappings verified
- Total: All 207,333 tests passing
🎯 Key Benefits
✅ ASCII normalization: 129-144x faster (essentially free)
✅ ASCII security checks: 34x faster
✅ Skeleton algorithm: 2.5x faster for all text
✅ Confusable detection: 1.7x faster for all text
✅ Memory footprint: ~15-20 KB saved, 50-87.5% reduction in type sizes
✅ Conformance: 100% maintained
📝 Design Philosophy
The optimizations excel at what matters most:
- Common case (ASCII) is blazingly fast (100x+ speedups)
- Full Unicode support still provides solid improvements (1.7-2.5x)
- 100% correctness maintained everywhere
🔨 Breaking Changes
None. All changes are backwards compatible.
📦 Installation
go get github.com/SCKelemen/unicode/uts15@v6.0.0
go get github.com/SCKelemen/unicode/uax24@v6.0.0
go get github.com/SCKelemen/unicode/uts39@v6.0.0🙏 Benchmarks
All benchmarks run on Apple M4 Pro. See the README for detailed benchmark results and methodology.
v5.0.0 - Rule-Based Line Breaking Architecture 🐻
v5.0.0 - Rule-Based Line Breaking Architecture
🎯 Major Achievement: 100% UAX #14 Conformance
This release extends the rule-based state machine architecture from UAX #29 (v4.0.0) to UAX #14 (Line Breaking Algorithm), achieving 100% conformance on all 19,338 official Unicode tests.
✨ What's New
Rule-Based Line Breaking Implementation
UAX #14 now uses a clean, rule-based architecture that directly maps to the Unicode Standard specification:
-
LineBreakContext abstraction: Clean navigation API with helper methods
SkipBackward/SkipForward: Skip over combining marks (LB9 rule)FindForward/FindBackward: Search for target classesMatchSequence: Pattern matching for rule sequences
-
59 Named rule functions: Each Unicode rule (LB4, LB5, LB8, LB21, etc.) becomes a named, testable function
-
Declarative rule chains: First-match-wins strategy with clear precedence
-
Pair table fallback: Common cases handled by efficient 2,064-entry lookup table
100% Conformance Fixes
Achieved perfect conformance by fixing these edge cases:
-
French guillemet separators (
»word«pattern)- Pattern: « SP ÷ AL when part of emphasis, not quotation
- U+00AB/U+00BB require special break handling
-
German quotes (
„..."and‚...'patterns)- ClassQU_Pi acts as closing quote (not opening)
- U+201E/U+201A (ClassOP) open, U+201C/U+2018 (ClassQU_Pi) close
-
Hebrew MAQAF (U+05BE hyphen)
- HL × HH ÷ HL pattern for Hebrew hyphen
- New
ruleLB21_HH_Breakhandles (HL | AL) × HH ÷ HL
-
Regional indicators with combining marks
- RI × CM × RI sequences
ruleLB30anow skips CM/ZWJ when counting RIs
-
Extended pictographic × emoji modifier
- Reserved emoji ranges (U+1F000-U+1FFFD)
ruleLB30bchecksisExtendedPictographicfor any base class
📊 Test Results
Total tests: 19,338
Passed: 19,338 (100.0%)
Failed: 0 (0.0%)
🏗️ Architecture Benefits
Before (Original Implementation)
- 1,112-line monolithic function
- Complex inline conditionals
- Difficult to debug and extend
After (Rule-Based Implementation)
- Isolated, independently testable rule functions
- Direct spec mapping (ruleLB4, ruleLB21, etc.)
- Clear documentation with spec links
- Easy to add new rules without refactoring
- No massive conditional chains
⚡ Performance Impact
The rule-based implementation is 2-3x slower due to abstraction overhead:
| Text Length | Original | Rule-Based | Change |
|---|---|---|---|
| Short (10 chars) | 494 ns/op | 1,360 ns/op | 2.75x slower |
| Medium (64 chars) | 3,934 ns/op | 9,374 ns/op | 2.38x slower |
| Long (45 chars) | 2,138 ns/op | 5,209 ns/op | 2.44x slower |
Trade-off: Performance remains excellent for text layout (thousands of characters per millisecond). The maintainability benefits far outweigh the performance cost for this use case.
🐻 License Update
Updated to BearWare 1.0 - MIT License with bear emojis:
- Less corporate feel
- Easy to detect in the wild
- Shows we're weekend warriors, not a corporation
📦 New Files
uax14/context.go- LineBreakContext abstraction (354 lines)uax14/linebreak_rules.go- Rule-based implementation (1,786 lines, 59 rule functions)uax14/linebreak_rules_test.go- Test suite with conformance testsuax14/LINEBREAK_RULES.md- Comprehensive rule documentationLICENSE- BearWare 1.0 license with bear emoji ASCII art
🔧 Breaking Changes
None - the original implementation remains available as FindLineBreakOpportunities. The new rule-based implementation is exposed via FindLineBreakOpportunitiesWithRules for testing and comparison.
🎓 What This Means
This architecture provides:
- Direct spec mapping: Rule functions named after Unicode spec rules
- Independent testing: Each rule can be tested and traced independently
- Clear debugging: Rule execution can be logged to understand break decisions
- Easy updates: New Unicode versions can add rules without refactoring
- Reduced complexity: No massive conditional chains or inline state tracking
This matches the successful pattern from UAX #29 v4.0.0, providing consistency across the codebase.
🔗 References
🙏 Acknowledgments
This release demonstrates rigorous engineering while maintaining a personal, accessible approach. Made with care by weekend warriors. 🐻
Full Changelog: v4.0.0...v5.0.0
v4.0.0: Rule-Based State Machine Architecture
Version 4.0.0: Rule-Based State Machine Architecture
This release focuses on code quality and maintainability through rule-based state machine architecture for all break detection algorithms.
New Features
- BreakContext abstractions:
GraphemeBreakContext,WordBreakContext,SentenceBreakContextprovide clean navigation APIs - Named rule functions: Each Unicode rule (GB3, WB5, SB8, etc.) becomes a named function with clear semantics
- Declarative rule chains: Rules checked in order with first-match-wins strategy
- Maintained hierarchical optimization: Words checked only at grapheme boundaries, sentences only at word boundaries
Code Organization
New files implementing the rule-based architecture:
context.go- Break context abstractions with navigation methods (661 lines)grapheme_rules.go- Grapheme breaking rules (ruleGB3 through ruleGB12_13, 308 lines)word_rules.go- Word breaking rules (ruleWB3 through ruleWB15_16, 376 lines)sentence_rules.go- Sentence breaking rules (ruleSB3 through ruleSB11, 244 lines)single_pass.go- Cleaned up to use rule-based implementations (96 lines vs 574 lines)
Performance (Apple M4 Pro)
Rule-based grapheme breaking alone:
| Text Length | v3.0.0 Inline | v4.0.0 Rule-Based | Speedup |
|---|---|---|---|
| Short (33 chars) | 1,882 ns/op | 1,183 ns/op | 1.59x |
| Medium (86 chars) | 8,759 ns/op | 3,041 ns/op | 2.88x |
| Long (467 chars) | 168,060 ns/op | 15,170 ns/op | 11.08x |
Single-Pass API:
| Text Length | v3.0.0 Inline | v4.0.0 Rule-Based | Change |
|---|---|---|---|
| Short (33 chars) | 2,197 ns/op | 2,717 ns/op | 1.24x slower |
| Medium (86 chars) | 9,636 ns/op | 6,647 ns/op | 1.45x faster |
| Long (467 chars) | 188,982 ns/op | 32,200 ns/op | 5.87x faster |
Single-Pass vs Three Separate Passes (v4.0.0):
| Text Length | Single Pass | Three Separate | Speedup |
|---|---|---|---|
| Short (33 chars) | 2,717 ns/op | 3,380 ns/op | 1.24x |
| Medium (86 chars) | 6,647 ns/op | 14,312 ns/op | 2.15x |
| Long (467 chars) | 32,200 ns/op | 239,624 ns/op | 7.44x |
Key findings:
- Rule-based grapheme breaking provides 1.6-11x speedup over inline implementation
- Performance improvements increase dramatically with text length
- Single-pass API maintains significant advantage over three separate calls
- Medium and long texts benefit most from rule-based architecture
Benefits
- Readability: Rules directly match Unicode Standard specification
- Maintainability: Easy to understand, modify, and extend
- Debuggability: Each rule can be tested and traced independently
Conformance
100% conformance maintained on all official Unicode test suites:
- Grapheme: 766/766 tests passing
- Word: 1,944/1,944 tests passing
- Sentence: 512/512 tests passing
Installation
go get github.com/SCKelemen/unicode/uax29@v4.0.0v3.0.0: Hierarchical Break Detection
Performance Improvements
Version 3.0.0 implements hierarchical optimization for the single-pass FindAllBreaks() API introduced in v2.0.0.
Hierarchical Break Detection
Leverages the natural subset relationships between break types:
- Words ⊆ Graphemes: Word breaks only checked at grapheme cluster boundaries
- Sentences ⊆ Words: Sentence breaks only checked at word boundaries
This eliminates redundant checks and significantly improves performance.
Benchmark Results
Performance on Apple M4 Pro comparing v3.0.0 single-pass vs three separate function calls:
| Text Length | v2.0.0 Three Passes | v3.0.0 Single Pass | Speedup |
|---|---|---|---|
| Short (33 chars) | 3,457 ns/op | 2,197 ns/op | 1.57x |
| Medium (86 chars) | 16,191 ns/op | 9,636 ns/op | 1.68x |
| Long (467 chars) | 423,491 ns/op | 188,982 ns/op | 2.24x |
Key benefits:
- Speedup increases with text length (hierarchical pruning more effective on longer text)
- Single UTF-8 decode and classification pass
- Pre-classified data reused across all three break types
- No additional memory allocations compared to v2.0.0
Conformance
Maintains 100% conformance on all official Unicode 17.0.0 test suites:
- Grapheme: 766/766 tests passing
- Word: 1,944/1,944 tests passing
- Sentence: 512/512 tests passing
Breaking Changes
None - all existing APIs remain backward compatible.
v2.0.0: Table-Driven O(log n) Architecture
Performance Improvements
Version 2.0.0 focuses on performance optimization while maintaining 100% conformance with Unicode standards.
Table-Driven Binary Search
All packages now use table-driven O(log n) binary search for character classification, replacing sequential O(n) checks:
- UAX #9: Bidi class lookup optimized with 3,060 precomputed ranges from
DerivedBidiClass.txt - UAX #29: Unified packed data structure with 4,673 ranges encoding all three break types (grapheme, word, sentence) in 16-bit format
Performance: Character classification now runs at ~60-100 ns/op with 0 allocations on Apple M4 Pro.
Generated Unicode Data
All Unicode property data is now generated directly from official Unicode 17.0.0 data files:
- Download from unicode.org during build
- Parse property files (
DerivedBidiClass.txt,GraphemeBreakProperty.txt, etc.) - Generate optimized Go code with binary search tables
- Ensures correctness and synchronization with Unicode standard
Single-Pass API
UAX #29 provides a new FindAllBreaks() API that computes grapheme, word, and sentence boundaries in a single traversal.
Conformance
Maintains 100% Unicode conformance on all official test suites:
- UAX #9: 513,494/513,494 tests passing
- UAX #14: 19,338/19,338 tests passing
- UAX #29: 3,222/3,222 tests passing (766+1944+512)
- UTS #51: 5,223/5,223 tests passing
v1.0.0 - Unicode 17.0.0 Implementations
🎉 First stable release of Unicode Standard Annexes implementations in Go!
📦 Packages
UAX #11: East Asian Width
- Character width classification for terminal emulators
- Context-aware width resolution for ambiguous characters
- Display width calculations for CJK text
- Unicode 17.0.0 conformance
UTS #51: Unicode Emoji
- Six emoji properties (Emoji, Emoji_Presentation, etc.)
- Terminal width calculation for emoji
- Sequence validation (keycap, tag, modifier, flag, ZWJ)
- 100% conformance (5,223/5,223 tests passing)
UAX #50: Vertical Text Layout
- Vertical orientation properties for East Asian typography
- Four orientation values (Rotated, Upright, Transformed)
- Mixed-script vertical text support
- Unicode 17.0.0 conformance
UAX #9: Bidirectional Algorithm
- Bidirectional text reordering for LTR/RTL scripts
- Full isolating run sequences (BD13)
- Bracket pair handling (N0 rule)
- 100% conformance (513,494/513,494 tests passing)
UAX #14: Line Breaking Algorithm
- Line break opportunity detection
- Three hyphenation modes (none, manual, auto)
- CJK ideographic text support
- 100% conformance (19,338/19,338 tests passing)
UAX #29: Text Segmentation
- Grapheme cluster boundaries (user-perceived characters)
- Word boundaries for text selection
- Sentence boundaries for text processing
- 100% conformance (3,222/3,222 tests passing)
🏆 Achievements
- 541,277/541,277 total tests passing across all packages
- 100% conformance on all testable specifications
- Zero external dependencies - standard library only
- Unicode 17.0.0 - latest Unicode version
- Clean commit history - logical progression from first principles
📜 License
BearWare 1.0 (MIT Compatible) - 🐻🌲🐻❄️ Help the bear. 🐻❄️🌲🐻
🙏 Acknowledgments
Unicode® is a registered trademark of Unicode, Inc.
All Unicode data files are copyright © Unicode, Inc.