Thursday, May 29, 2025

ICU4X 2.0 released!

At the intersection of human and computer languages, internationalization (i18n) continues to play a pivotal role in modern software. Evolving i18n libraries means better quality experiences, improved performance, and support for digitally disadvantaged languages.


ICU4X is Unicode's modern, lightweight, portable, and secure i18n library. Built from the ground up, its binary size and memory usage footprint is 50-90% smaller than ICU4C. It is memory-safe, written in Rust with interfaces into C++, JavaScript, and TypeScript — and Python, Dart, and Kotlin are in the pipeline. Mozilla Firefox, Google Pixel Watch, core Android, numerous Flutter apps, and more clients are already using ICU4X.


After 6 months of iterating on beta releases and a soft launch earlier this month, the ICU4X Technical Committee is happy to announce ICU4X 2.0. This release brings a new paradigm for locale objects, a rewritten DateTime component, overhauled C++/C/JS interfaces, the latest locale data, and much more.

Date, Time, and Time Zone Formatting

ICU4X 2.0 implements the new semantic datetime skeletons specification in UTS 35. An evolution from previous datetime APIs, the ICU4X DateTime component is designed from decades of experience understanding what developers need from datetime formatting.


With ICU4X 2.0, users pick a "field set" and fine-tune it with "options". There are a fixed number of field sets, which represent all valid combinations of fields.


Users of ICU and JavaScript are familiar with "classical" datetime skeletons and components bags, respectively. The following table illustrates the correlation with semantic datetime skeletons:


ICU Classical Skeleton

ECMA-402 Components Bag

ICU4X 2.0 Rust Code

yMMMd

{ year: "numeric", month: "abbreviated", day: "numeric" }

fieldsets::YMD::medium()

MdEjm

{ month: "numeric", day: "numeric", weekday: "short", hour: "numeric", minute: "numeric" }

fieldsets::MDE::short()
    .time_hm()

jmsV

{ hour: "numeric", minute: "numeric", second: "numeric", timeZoneName: "generic" }

fieldsets::T::hms()
    .zone(zone::GenericShort)


Semantic datetime skeletons, called "field sets with options" in ICU4X, have numerous advantages:


  1. Easier to understand and harder to make mistakes. For example, a common error in ICU skeletons is to write an incorrect skeleton string such as "YMd" or "ymd" instead of the correct "yMd".
  2. Enables new formatting options not possible with components bags or skeletons:
    • Year style: the era, such as "BCE", can be automatically inserted
    • Time precision: the minute can be hidden if it is zero
  3. Prevents nonsensical combinations of fields and options. For example, the ICU4X API prevents "month with minute" (“December 10” for December 5 at 7:10).
  4. Well-suited for data slicing, allowing for minimal data overhead. For example, apps won’t carry weekday names if they are formatting with only a year/month/day or time field set.

Locale Preferences

ICU4X 2.0 introduces Preferences objects, a new paradigm for locale and user preference resolution in component constructors.


The new structures enable richer, type-safe management of user preferences coming from different sources, including locales and other preferences objects. String-based locales are still supported as well.


Locale Identifier String

ICU4X 2.0 Rust Code*

en-US-u-hc-h23

let mut p = Preferences::from(LanguageIdentifier {
    language: language!("en"),
    region: region!("US"),
    ..Default::default()
})
p.hour_cycle = HourCycle::H23;

zh-Hant-TW-u-ca-roc

let mut p = Preferences::from(LanguageIdentifier {
    language: language!("zh"),
    script: Some(script!("Hant")),
    region: Some(region!("TW")),
    ..Default::default()
})
p.calendar_algorithm = CalendarAlgorithm::Roc;

ar-EG-u-nu-latn-fw-sun

let mut p = Preferences::from(LanguageIdentifier {
    language: language!("ar"),
    region: region!("EG"),
    ..Default::default()
})
p.numbering_system = value!("latn").try_into().unwrap();
p.first_day = FirstDay::Sun;


* The type name "Preferences" is a placeholder for the formatter-specific preferences object, such as DecimalFormatterPreferences, a structured object containing all the pieces of a locale required for number formatting: information on the language, script, region, variant, and numbering system preference, but not irrelevant pieces like calendar system.

Cross Programming Language Improvements

The foreign function interface (FFI) has been overhauled with major ergonomic improvements. Key changes include:


  • Separate constructors in FFI for built-in compiled data and data from an explicit data provider, enabling better dead-code elimination for non-Rust clients.

  • C/C++

    • Namespacing: ICU4X types are exported in a namespace, allowing for including "icu4x::DateTimeFormatter" instead of "ICU4XDateTimeFormatter".

    • Smart pointers: ICU4X types are returned within std::unique_ptr instead of internally containing an allocation; allowing more flexible usage with other reference strategies.

    • Versioned ABI: structs that are #[non_exhaustive] in Rust (and methods that use them) are now versioned on both the ABI and in headers, allowing them to evolve safely in future versions

  • JavaScript

    • Enums: enum representation changed from strings to classes. Strings can still be used in the constructor

    • Structs: objects can now be used wherever structs (such as options bags) are required

    • Special methods: constructors, iterator, getters and setters are now exposed idiomatically

    • Documentation: typedoc-generated documentation is a lot more readable now (check it out)

    • ICU4X is now published as an NPM package: https://www.npmjs.com/package/icu  

Other Cross-Cutting Changes

Additional changes you may encounter when upgrading from 1.5 to 2.0:


  1. Many Rust types have gained separate owned and borrowed variants; for example, there are now both "Collator" and "CollatorBorrowed". The borrowed variant is slightly more efficient; it can be created statically from compiled data or derived from the owned variant.
  2. Our internal data storage type has a more efficient binary representation (see the zerovec crate). This means that postcard data generated with ICU4X 1.5 will not work with 2.0.
  3. The icu_locid and icu_locid_transform crates were re-organized into icu_locale and icu_locale_core. This means that icu_locid and icu_locid_transform will be forever at 1.5. If you currently depend directly on icu_locid or icu_locid_transform, you need to switch to icu_locale or icu_locale_core.
  4. The icu_calendar crate now focuses only on calendrical calculations, and a new crate, icu_time, contains pieces from icu_calendar and icu_timezone. The icu_timezone crate will be forever at 1.5. If you currently depend directly on icu_timezone, you need to switch to icu_time.
  5. The icu_datagen crate was split into several sub-crates. If you currently depend directly on icu_datagen, you need to switch to icu_provider_source, icu_provider_export, and/or the icu4x-datagen binary crate.
  6. Performance improvements in multiple components. For example, the normalizer got a data rearrangement that benefits non-NFD normalizations, and the collator now has an identical prefix optimization.
  7. Input types for formatters are now re-exported from the formatter crate to reduce the number of explicit Cargo.toml dependencies.
  8. All crates are updated to the latest CLDR (47) and Unicode (16) versions.

Get started with ICU4X 2.0

ICU4X's new website, icu4x.unicode.org, now hosts tutorials, documentation, and more. The website reflects the current release, with previous releases also available.


Check out our quickstart tutorial, interactive demo, or C++, TypeScript, and (experimental) Dart documentation.


As before, the Rust crate is available at crates.io, with documentation at docs.rs


Please post any questions via GitHub Discussions.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Tuesday, May 20, 2025

Unicode 17.0 Beta Review Open


The beta review period for Unicode® 17.0 has started and is open until July 1, 2025.


The beta is intended primarily for review of character property data and changes to algorithm specifications (Unicode Standard Annexes and certain Unicode Technical Standards that are synchronized with the Unicode Standard). Also, a complete draft of the core specification text is available for review during the beta period.


At this phase of a release, the character repertoire is considered stable. No new characters will be added. Characters could still be removed, and character names or code points could be changed, but such changes would require strong justification.

For this release, 4,847 new characters have been added, bringing the total number of encoded characters in Unicode 17.0 to 159,845. The largest set of added characters is in the new CJK Unified Ideographs Extension J block, with 4,298 new CJK unified ideographs, which increases the number of CJK unified ideographs to over 100,000. The new additions also include characters for the following five new scripts:


  • Beria Erfe is a modern-use script used in central Africa.

  • Chisoi is a modern-use script used in northeast India.

  • Tolong Siki is a modern-use script used in northeast India.

  • Tai Yo is the traditional script of Tai Yo communities in northern Vietnam.

  • Sidetic is an historic script used in ancient Anatolia.


In addition to new CJK unified ideographs, nearly 2,500 already-encoded CJK ideographs were horizontally extended, adding source references and glyphs reflecting use of those ideographs in China and Korea.


Another notable character addition is the SAUDI RIYAL SIGN, recently created by the Saudi Central Bank for its riyal currency.


See The Pipeline and the delta code charts for details on all of the new characters.


In addition to new characters, there are some significant character property and algorithm changes, including the following:



Also note that locations of data files for synchronized UTSes have been changed. See the Unicode 17.0 Beta landing page for other noteworthy property and algorithm changes. For full details regarding the Beta, see Public Review Issue #526. Feedback should be reported under PRI #526 using the Unicode Contact Form by July 1, 2025.


----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock



Tuesday, May 6, 2025

Unicode Technology Workshop 2025 — Call for Submissions Now Open!

 📣 Call for Submissions Now Open!

Unicode is pleased to announce that session proposals for UTW 2025 are now being accepted!

We are seeking proposals for workshops, seminars, case studies, and tutorials that center around:

  • Unicode i18n libraries
  • Locale data frameworks
  • Globalization tooling
  • Localization pipelines
  • Input methods
  • Character encoding
  • Text rendering …and more!

Tutorial topics might include: font design and Unicode properties, introduction to Software Internationalization (i18n), and how to best support Bidirectional text.


Come connect with other Unicode users, share your knowledge and experience, and help us envision the future of Unicode technology. You will come away with deeper knowledge on how to solve tough problems in the i18n and l10n space and how to engineer products that work better for global users. Program and product managers who work with engineering teams are also strongly encouraged to join and propose sessions.


Deadline for submissions is June 30, 2025 by 5:00PM PT. Proposals will be reviewed in July and session hosts will be notified late July.

‼️Note: To encourage maximum collaboration amongst the attendees, this is an in-person-only event.

🗓️ Mark Your Calendars for Key Dates!

By May 16 - Early Bird Registration for Tutorials and UTW 2025 Opens

June 30 - Call for Submissions Closes - All Proposals, including Tutorials, Due

July 21 - Program Committee Notifications Go Out

August 11 - Early Bird Registration for Tutorials and UTW 2025 Closes

August 12 - Regular Registration for Tutorials and UTW 2025 Opens


See you there!


🫶 Sponsorship Opportunities

Sponsorship opportunities are available at various levels. Sponsorship benefits include complimentary registrations, opportunities to lead a session or workshop, recognition on the event website, program and event materials, visibility on social media, and much more. Specific offerings vary by sponsorship level.


If you want to demonstrate your industry leadership, enhance your brand, share your knowledge, promote your products and services, and foster community building, contact events@unicode.org today to learn more. Sponsorship discounts are available to Unicode Full and Supporting Members.

If you have any questions, please contact us at UTW2025@unicode.org



Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock