By Robert Bastian, ICU4X Technical Committee
Across the globe, people are coming online with smaller and more
varied devices including smartphones, smart watches, and gadgets. An offshoot of
the International Components for Unicode (ICU) Committee, the ICU4X Committee is
responsible for enabling these next-generation devices to communicate with their
users in thousands of languages. Written in Rust, ICU4X brings lightweight,
modular, and secure internationalization libraries to low-resource devices and
many programming languages.
Since our last release in April 2023, the ICU4X team has been busy
building additional features and improving the usability of the library. Today
we're happy to announce the 1.3 release, including built-in data, a new datagen
API, the first stable release of the case mapping component, support for more
calendar systems, a technology preview of rule-based transliteration, and more.
We have heard feedback that ICU4X's data pipeline, while allowing
powerful customization, has a significant learning curve. In ICU4X 1.3 we are
therefore introducing a new feature called "compiled data", where we ship data
generated from the latest CLDR and ICU versions in the library. This means that
every ICU4X type gains a new constructor that does not take a data provider
argument, but instead uses the compiled data. This data is using our existing
"baked data" format, which, just being Rust code, allows the compiler to perform
optimizations and granularly exclude unnecessary data. In fact, programs that
are not using any of the new constructors will not see a binary size difference
even with the
compiled_data Cargo feature enabled (it is enabled by default).
In addition to adding compiled data, we have also revamped our data
generation API
icu_datagen. The new API is more ergonomic, allows for more
flexible data generation, such as choosing which segmentation models to include,
and also better optimizes the size of the generated data. For example, with the
new "fallback mode" flag, data can be generated under the assumption that locale
fallback is going to be used at runtime. This way, data for e.g.
en-CA does not
have to be included if it matches the data for
en, because at runtime en will be
tried if
en-CA doesn't exist. This mode of data duplication is already used for
compiled data, which comes with built-in fallback.
ICU4X 1.3 also stabilizes a new component: casemapping. Many
scripts are bicameral, meaning they have an upper and lower case. Casemapping
allows for converting between upper, lower, and title case, and the related
casefolding operation allows for performing case-insensitive string matching.
These operations can be rather nuanced and locale-dependent: for example, the
letter “i” capitalizes to “İ” in Turkish, and modern Greek removes accents and
adds diæreses when uppercasing.
This release also completes the set of calendars to include all
CLDR calendars. In addition to the Gregorian, Thai Solar Buddhist, Coptic,
Ethiopian, Indian National (Śaka), and Japanese calendars that have been
supported since 1.0, ICU4X now also supports the Chinese, Korean (Dangi),
Hebrew, Persian (Solar Hijri), R.O.C., and four variants of the Islamic calendar
(civil, observational, tabular, and Umm al-Qura). This support includes
formatting, though formatting for Chinese and Korean is currently in a preview
state.
We're also launching a transliteration API as a technical preview.
Transliteration is the conversion between scripts, such as from Arabic to Latin,
preserving pronunciation as far as possible. CLDR supports many
transliterations, and this release brings these CLDR transliterations to ICU4X.
While data generation is not yet available, users can runtime-construct
transliterators to convert between any scripts supported by CLDR.
Finally, ICU4X 1.3 brings a number of smaller features to other
components. The experimental display names component now supports formatting
language identifiers, in addition to language, script, and region display names;
there are performance improvements across the board; and some APIs such as
LocaleFallbacker have been moved to better locations.
Read the full
ICU4X 1.3
release notes and then the
ICU4X tutorial to start using ICU4X in your project.
Support Unicode
To support Unicode’s mission to ensure everyone can communicate in
their languages across all devices, please consider
adopting a character,
making a gift of stock,
or
making a donation. As
Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3
organization, your contribution may be eligible for a tax deduction. Please
consult with a tax advisor for details.