Monday, April 17, 2023

ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices

By Shane Carr, Chair of the ICUX Subcommittee

Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with each other in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

Since our first big release in September 2022, the ICU4X team has been busy building additional features and infrastructure. Today, the team is excited to announce ICU4X 1.2, featuring the first stable release of the Segmenter component, more Unicode properties, property names, a technology preview of language and script display names, HarfBuzz bindings, CLDR 43, full compliance with the Unicode Bidirectional Algorithm (UAX #9), and many smaller features and improvements to the ICU4X components.

Text segmentation is the process of dividing strings into meaningful units, such as words, sentences, or grapheme clusters (characters). It is a fundamental task in a wide range of applications, including cursor movement, highlighting spans of text, evaluating text for spelling and grammatical correctness, information retrieval, and text layout.

ICU4X 1.2 supports the two standards Unicode Text Segmentation (UAX #29) for word, sentence, and grapheme cluster segmentation and Unicode Line Breaking Algorithm (UAX #14) for line segmentation.

Given ICU4X's focus on being lightweight for deployment in resource-constrained environments, the team focused on ways to reduce data size versus ICU4C. The highest-impact differences come from the use of runtime tailoring (reducing the number of rule tables) and machine learning models (eliminating the need for Southeast Asian word dictionaries). Overall, ICU4X data for segmentation is 20.1% smaller than the equivalent data in ICU4C, and 60.7% smaller for line break segmentation.

In addition to being smaller in size, ICU4X's line and word segmenters are 19.1% and 52.2% faster in non-complex scripts and 46.9% and 32.1% faster in Chinese than the equivalents in ICU4C, respectively.

The machine learning models in ICU4X are used for word and line breaking in Southeast Asian languages including Thai, Lao, Khmer, and Myanmar. The models use an LSTM, are trained on large datasets, and achieve high accuracy while retaining small model size. By leveraging modern computer architecture features such as SIMD, the team optimized the performance of the LSTM inference to be about 3× faster than the naive implementation. However, the dictionary model remains the fastest, about two orders of magnitude faster than the LSTM. ICU4X offers both types of models for clients to choose.

Another focus of ICU4X 1.2 has been to support your text layout stack. A text layout engine requires more than the scope of either ICU4C and ICU4X, but any layout engine requires at least two ICU features: line break segmentation and the ability to correctly order bidirectional text. ICU4X 1.2 supports the segmentation and bidirectional text needs of Skia’s SkParagraph and HarfBuzz.

Finally, ICU4X 1.2 brings a number of smaller features to other components. The experimental Display Names component now supports language and script display names, in addition to region display names; the Properties component supports converting UCD property and value enum discriminants to their long and short names, and vice-versa; and all components have been upgraded to support CLDR 43.

Read the full ICU4X 1.2 release notes and then the ICU4X tutorial to start using ICU4X in your project.

To learn more about the latest release, be sure to attend our ICU4X Virtual Open House this Wednesday, April 19th at 9am PT.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]