Thursday, September 29, 2022

Unicode CLDR v42 Beta — Spec Review

[beta image] The Unicode CLDR v42 Beta is now available for specification review and integration testing. The release is planned for Oct 19, but any feedback on the specification needs to be submitted well in advance of that date. The specification is available at Draft LDML Modifications. The biggest change is the new Person Names Formatting section.

The beta has already been integrated into the development version of ICU. We would especially appreciate feedback from ICU users and non-ICU consumers of CLDR data, and on Migration issues. 


Feedback can be filed at CLDR Tickets.


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.


In CLDR 42, the focus is on:


  1. Locale coverage. The following locales now have higher coverage levels:

    1. Modern: Igbo (ig), yo (Yoruba)

    2. Moderate: Chuvash (cv), Xhosa (xh)

    3. Basic: Haryanvi (bgc), Bhojpuri (bho), Rajasthani (raj), Tigrinya (ti)

  2. Formatting Person Names. Added data and structure for formatting people's names. For more information on why this feature is being added and what it does, see Background.

  3. Emoji 15.0 Support. Added short names, keywords, and sort-order for the new Unicode 15.0 emoji.

  4. Coverage, Phase 2. Added additional language names and other items to the Modern coverage level, for more consistency (and utility) across platforms.

  5. Unicode 15.0 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.


There are many other changes: to find out more, see the draft CLDR v42 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.


In version 42, the following levels were reached:


Level

Languages

Locales*

Notes

Modern

95

369

Suitable for full UI internationalization

Afrikaans‎, ‎… Čeština‎, ‎… Dansk‎, ‎… Eesti‎, ‎… Filipino‎, ‎… Gaeilge‎, ‎… Hrvatski‎, ‎Indonesia‎, ‎… Jawa‎, ‎Kiswahili‎, ‎Latviešu‎, ‎… Magyar‎, ‎…Nederlands‎, ‎… O‘zbek‎, ‎Polski‎, ‎… Română‎, ‎Slovenčina‎, ‎… Tiếng Việt‎, ‎… Ελληνικά‎, ‎Беларуская‎, ‎… ‎ᏣᎳᎩ‎, ‎ Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, … አማርኛ‎, ‎नेपाली‎, … ‎অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, ‎தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, ‎မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, ‎… 日本語‎, ‎…

Moderate

6

11

Suitable for full “document content” internationalization, such as formats in a spreadsheet.

Binisaya, … ‎Èdè Yorùbá, ‎Føroyskt, ‎Igbo, ‎IsiZulu, ‎Kanhgág, ‎Nheẽgatu, ‎Runasimi, ‎Sardu, ‎Shqip, ‎سنڌي, …

Basic

29

43

Suitable for locale selection, such as choice of language in mobile phone settings.

Asturianu, ‎Basa Sunda, ‎Interlingua, ‎Kabuverdianu, ‎Lea Fakatonga, ‎Rumantsch, ‎Te reo Māori, ‎Wolof, ‎Босански (Ћирилица), ‎Татар, ‎Тоҷикӣ, ‎Ўзбекча (Кирил), ‎کٲشُر, ‎कॉशुर (देवनागरी), ‎…, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, ‎粤语 (简体)‎

* Locales are variants for different countries or scripts.



Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Announcing ICU4X 1.0

ICU Logo

 


I. Introduction

Hello! Ndeewo! Molweni! Салам! Across the world, people are coming online with smartphones, smart watches, and other small, low-resource devices. The technology industry needs an internationalization solution for these environments that scales to dozens of programming languages and thousands of human languages.


Enter ICU4X. As the name suggests, ICU4X is an offshoot of the industry-standard i18n library published by the Unicode Consortium, ICU (International Components for Unicode), which is embedded in every major device and operating system.


This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution.


Lightweight: ICU4X is Unicode's first library to support static data slicing and dynamic data loading. With ICU4X, clients can inspect their compiled code to easily build small, optimized locale data packs and then load those data packs on the fly, enabling applications to scale to more languages than ever before. Even when platform i18n is available, ICU4X is suitable as a polyfill to add additional features or languages. It does this while using very little RAM and CPU, helping extend devices' battery life.


Portable: ICU4X supports multiple programming languages out of the box. ICU4X can be used in the Rust programming language natively, with official wrappers in C++ via the foreign function interface (FFI) and JavaScript via WebAssembly. More programming languages can be added by writing plugins, without needing to touch core i18n logic. ICU4X also allows data files to be updated independently of code, making it easier to roll out Unicode updates.


Secure: Rust's type system and ownership model guarantee memory-safety and thread-safety, preventing large classes of bugs and vulnerabilities.


How does ICU4X achieve these goals, and why did the team choose to write ICU4X over any number of alternatives?


II. Why ICU4X?

You may still be wondering, what led the Unicode Consortium to choose a new Rust-based library as the solution to these problems?

II.A. Why a new library?

The Unicode Consortium also publishes ICU4C and ICU4J, i18n libraries written for C/C++ and Java. Why write a new library from scratch? Wouldn’t that increase the ongoing maintenance burden? Why not focus our efforts on improving ICU4C and/or ICU4J instead?


ICU4X solves a different problem for different types of clients. ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written to bring i18n to new programming languages and resource-constrained environments. ICU4X is a product that has long been missing from Unicode's portfolio.


Early on, the team evaluated whether ICU4X's goals could have been achieved by refactoring ICU4C or ICU4J. We found that:


  1. ICU4C has already gone through a period of optimization for tree shaking and data size. Despite these efforts, we continue to have stakeholders saying that ICU4C is too large for their resource-constrained environment. Getting further improvements in ICU4C would amount to rewrites of much of ICU4C's code base, which would need to be done in a way that preserves backwards compatibility. This would be a large engineering effort with an uncertain final result. Furthermore, writing a new library allows us to additionally optimize for modern UTF-8-native environments.

  2. Except for JavaScript via j2cl, Java is not a suitable source language for portability to low-resource environments like wearables. Further, ICU4J has many interdependent parts that would require a large amount of effort to bring to a state where it could be a viable j2cl source.

  3. Some of our stakeholders (Firefox and Fuchsia) are drawn to Rust's memory safety. Like most complex C++ projects, ICU4C has had its share of CVEs, mostly relating to memory safety. Although C++ diagnostic tools are improving, Rust has very strong guarantees that are impossible in other software stacks.


For all these reasons, we decided that a Rust-based library was the best long-term choice.

II.B. Why use ICU4X when there is i18n in the platform?

Many of the same people who work on ICU4X also work to make i18n available in the platform (browser, mobile OS, etc.) through APIs such as the ECMAScript Intl object, android.icu, and other smartphone native libraries. ICU4X complements the platform-based solutions as the ideal polyfill:


  1. Some platform i18n features take 5 or more years to gain wide enough availability to be used in client-side applications. ICU4X can bridge the gap.

  2. ICU4X can enable clients to add more locales than those available in the platform.

  3. Some clients prefer identical behavior of their app across multiple devices. ICU4X can give them this level of consistency.

  4. Eventually, we hope that ICU4X will back platform implementations in ECMAScript and elsewhere, providing a maximal amount of consistency when ICU4X is also used as a polyfill.


II.C Why pluggable data?

One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an explicit data provider argument on most constructor functions. The ICU4X data provider supports the following use cases:


  1. Data files that are readable by both older and newer versions of the code; for more detail on how this works, see ICU4X Data Versioning Design

  2. Data files that can be swapped in and out at runtime, making it easy to upgrade Unicode, CLDR, or time zone database versions. Swapping in new data can be done at runtime without needing to restart the application or clear internal caches.

  3. Multiple data sources. For example, some data may be baked into the app, some may come from the operating system, and some may come from an HTTP service.

  4. Customizable data caches. We recognize that there is no "one size fits all" approach to caching, so we allow the client to configure their data pipeline with the appropriate type of cache.

  5. Fully configurable data fallbacks and overlays. Individual fields of ICU4X data can be selectively overridden at runtime.



III. How We Made ICU4X Lightweight

There are three factors that combine to make code lightweight: small binary size, low memory usage, and deliberate performance optimizations. For all three, we have metrics that are continuously measured on GitHub Actions continuous integration (CI).


III.A. Small Binary Size

Internationalization involves a large number of components with many interdependencies. To combat this problem, ICU4X optimizes for "tree shaking" (dead code elimination) by:


  1. Minimizing the number of dependencies of each individual component.

  2. Using static types in ways that scope functions to the pieces of data they need.

  3. Splitting functions and classes that pull in more data than they need into multiple, smaller pieces.


Developers can statically link ICU4X and run a tree-shaking tool like LLVM link-time optimization (LTO) to produce a very small amount of compiled code, and then they can run our static analysis tool to build an optimally small data file for it.


In addition to static analysis, ICU4X supports dynamic data loading out of the box. This is the ultimate solution for supporting hundreds of languages, because new locale data can be downloaded on the fly only when they are needed, similar to message bundles for UI strings.

III.B. Low Memory Usage

At its core, internationalization transforms inputs to human-readable outputs, using locale-specific data. ICU4X introduces novel strategies for runtime loading of data involving zero memory allocations:


  1. Supports Postcard-format resource files for dynamically loaded, zero-copy deserialized data across all architectures.

  2. Supports compile-time linking of required data without deserialization overhead via DataBake.

  3. Data schema is designed so that individual components can use the immutable locale data directly with minimal post-processing, greatly reducing the need for internal caches.

  4. Explicit "data provider" argument to each function that requires data, making it very clear when data is required.


ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers.


III.C. Deliberate Performance Optimizations

Reducing CPU usage improves latency and battery life, important to most clients. ICU4X achieves low CPU usage by:


  1. Writing in Rust, a high-performance language.

  2. Utilizing zero-copy deserialization.

  3. Measuring every change against performance benchmarks.


The ICU4X team uses a benchmark-driven approach to achieve highly competitive performance numbers: newly added components should have benchmarks, and future changes to those components should avoid regressing on those benchmarks.


Although we always seek to improve performance, we do so deliberately. There are often space/time tradeoffs, and the team takes a balanced approach. For example, if improving performance requires increasing or duplicating the data requirements, we tend to favor smaller data, like we've done in the normalizer and collator components. In the segmenter components, we offer two modes: a machine learning LSTM segmenter with lower data size but heavier CPU usage, and a dictionary-based segmenter with larger data size but faster. (There is ongoing work to make the LSTM segmenter require fewer CPU resources.)


IV. How We Made ICU4X Portable

The software ecosystem continually evolves with new programming languages. The "X" in ICU4X is a nod to the second main design goal: portability to many different environments.


ICU4X is Unicode's first internationalization library to have official wrappers in more than one target language. We do this with a tool we designed called Diplomat, which generates idiomatic bindings in many programming languages that encourage i18n best practices. Thanks to Diplomat, these bindings are easy to maintain, and new programming languages can be added without needing i18n expertise.


Under the covers, ICU4X is written in no_std Rust (no system dependencies) wrapped in a stable ABI that Diplomat bindings invoke across foreign function interface (FFI) or WebAssembly (WASM). We have some basic tutorials for using ICU4X from C++ and JavaScript/TypeScript.



V. What’s next?

ICU4X represents an exciting new step in bringing internationalized software to more devices, use cases, and programming languages. A Unicode working group is hard at work on expanding ICU4X’s feature set over time so that it becomes more useful and performant; we are eager to learn about new use cases and have more people contribute to the project.


Have questions?  You can contact us on the ICU4X discussion forum!


Want to try it out? See our tutorials, especially our Intro tutorial!

Interested in getting involved? See our Contribution Guide.


Want to stay posted on future ICU4X updates? Sign up for our low-traffic announcements list, icu4x-announce@unicode.org!




Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, September 21, 2022

New Online Event – Overview of Internationalization and Unicode Projects

The Unicode Consortium is excited to invite you to our upcoming online event, “Overview of Internationalization and Unicode Projects.”

During this ~2-hour event, hear pre-recorded sessions from some of the experts working to ensure that everyone can fully communicate and collaborate in their languages across all software and services. Unicode representatives will be available for live Q&A for the last 30-40 minutes and our emcee throughout will be Elango Cheran of Google.

Topics and speakers include:
  1. An Introduction to Internationalization (i18n) - Addison Phillips, Internationalization Engineer
  2. Overview of the Unicode Consortium: History and Future - Mark Davis, Cofounder and President
  3. Scripts and Character Encoding - Deborah Anderson, Chair of the Script Ad Hoc Committee
  4. The Common Locale Data Repository (CLDR) - Mark Davis and Annemarie Apple, Chair and Vice Chair of the CLDR Committee
  5. International Components for Unicode (ICU) - Markus Scherer, Chair of ICU Committee
  6. Bringing Internationalization to More Programming Languages and Resource-Constrained Environments (ICU4X) - Shane Carr, Chair of ICU4X Subcommittee
Date Wednesday, September 28th, 2022
Time 9:30am (California)/12:30pm (New York)/16:30 (UTC)/17:30 (London)
Location
and Cost
Online, free to attend
Registration    Register here. Please freely share this link with colleagues and anyone else who may be interested. Registration will also ensure you will receive updates for future Unicode events.

The recording and a playlist will be available on YouTube later this year for anyone who is unable to attend or if attendees want to share the information with others. Depending on community interest, Unicode project leaders will also be available in November and December for virtual “Office Hours” to talk more in depth and answer specific questions.

The link to share with your networks is: https://us06web.zoom.us/webinar/register/WN_ViDf3YFyS7WiAXnHYp88kw

Thanks and hope to see many of you on the 28th!


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Tuesday, September 13, 2022

Announcing The Unicode® Standard, Version 15.0

[Nag Mundari image] Version 15.0 of the Unicode Standard is now available, including the core specification, annexes, and data files. This version adds 4,489 characters, bringing the total to 149,186 characters. These additions include two new scripts, for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs. The new scripts and characters in Version 15.0 add support for modern language groups including:
  • Nag Mundari, a modern script used to write Mundari, a language spoken in India
  • A Kannada character used to write Konkani, Awadhi, and Havyaka Kannada in India
  • Kaktovik numerals, devised by speakers of Iñupiaq in Kaktovik, Alaska for the counting systems of the Inuit and Yupik languages
Among the popular symbol additions are 20 new emoji, including hair pick, maracas, jellyfish, khanda, and pink heart. For the full list of new emoji characters, see emoji additions for Unicode 15.0, and Emoji Counts. For a detailed description of support for emoji characters by the Unicode Standard, see UTS #51, Unicode Emoji.

[Image credit Noto Emoji]

Other symbol and notational additions include:
Support for other languages and scholarly work includes:
  • Kawi, a historical script found in Southeast Asia, used to write Old Javanese and other languages
  • Three additional characters for the Arabic script to support Quranic marks used in Turkey
  • Three Khojki characters found in handwritten and printed documents
  • Ten Devanagari characters used to represent auspicious signs found in inscriptions and manuscripts
  • Six Latin letters used in Malayalam transliteration
  • Sixty-three Cyrillic modifier letters used in phonetic transcription
Important chart font updates include:
  • A set of updated glyphs for Egyptian hieroglyphs, in addition to standardized variation sequences to support rotated glyphs found in texts
  • Improved glyphs for Unified Canadian Aboriginal Syllabics, which provide better support for Carrier and other languages
  • A new Wancho font, with improved and simplified shapes
Updates to the CJK blocks add:
  • 4,192 ideographs in the new CJK Unified Ideographs Extension H block
  • One ideograph in the CJK Unified Ideographs Extension C block
Unicode properties and specifications determine the behavior of text on computers and phones. The following six Unicode Standard Annexes and Technical Standards have noteworthy updates for Version 15.0:
  • UAX #9, Unicode Bidirectional Algorithm, amends the note in UAX9-C2 to emphasize the use of higher-level protocols to mitigate potential source code spoofing attacks.
  • UAX #31, Unicode Identifier and Pattern Syntax, provides more guidance on profiles for default identifiers, clarifies the use of default ignorable code points in identifiers, and discusses the relationship between Pattern_White_Space and bidirectional ordering issues in programming languages.
  • UAX #38, Unicode Han Database, adds the kAlternateTotalStrokes property. The kCihaiT property’s category was changed to Dictionary Indices, the kKangXi property was expanded, and Sections 3.0, 3.10, and 4.5 were added.
  • UTS #39, Unicode Security Mechanisms, changes the zero width joiner (ZWJ) and zero width non-joiner (ZWNJ) characters from Identifier_Status=Allowed to Identifier_Status=Restricted; they are therefore no longer allowed by the General Security Profile by default.
  • UAX #45, U-Source Ideographs, has records for new ideographs in its data file, “ExtH” was added as a new status, the status identifiers for the existing CJK Unified Ideographs blocks were improved, and Section 2.5 was added.
  • UTS #46, Unicode IDNA Compatibility Processing, clarified the edge case of the empty label in ToASCII and added documentation regarding the new IDNA derived property data files.

About the Unicode Standard

The Unicode Standard provides the basis for processing, storage and seamless data interchange of text data in any language in all modern software and information technology protocols. It provides a uniform, universal architecture and encoding for all languages of the world, with over 140,000 characters currently encoded.

Unicode is required by modern standards such as XML, Java, C#, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is a fundamental component of all modern software.

For additional information on the Unicode Standard, please visit https://home.unicode.org/.

About the Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations, many in the computer and information processing industry. For a complete member list go to https://home.unicode.org/membership/members/.
For more information, please contact the Unicode Consortium https://home.unicode.org/connect/contact-unicode/.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]