The Unicode Blog: October 2023

Tuesday, October 31, 2023

ICU 74 Released

Unicode® ICU 74 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 74 updates to Unicode 15.1, and to CLDR 44 locale data with various additions and corrections.

ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements. They subsume the changes for the ICU 73.2 and CLDR 43.1 maintenance releases.

Unicode 15.1 adds source code security mechanisms, improves line breaking for southeast Asian scripts, and adds important CJK unified ideographs.

CLDR 44 has added or improved data for a number of languages that have been newly added to ICU, and has improved measurement unit handling, conversion, and formatting.

ICU 74 implements these improvements, adds new C APIs for locale handling, adds a plug-in API for word segmentation, and switches the Java build system to Maven.

For details, please see https://icu.unicode.org/download/74.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Unicode CLDR v44 available

Unicode CLDR version 44 is now available and has been integrated into version 74 of ICU. In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

There are many other changes: to find out more, see the CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level	Langs	Usage
Modern	95	Suitable for full UI internationalization
	čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate	13	Suitable for “document content” internationalization, eg. in spreadsheet
	brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic	50	Suitable for locale selection, eg. choice of language on mobile phone
	asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

We are currently planning for CLDR version 45 to be a closed release with no submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.

Monday, October 30, 2023

Update from Unicode’s Community Engagement Team

By Elango Cheran, Vice Chair of the Community Engagement Team

If you have been receiving the Unicode Newsletter or following us on LinkedIn, Twitter, and the Friends of Unicode on Facebook, you have likely seen information about new Unicode events and resources.

These events and tools are the result of the coordinated efforts of Unicode’s Community Engagement (CE) Team. This team was formed in March of 2022 after our logistics partner for the Internationalization and Unicode Conference and Unicode decided to go in different directions.

A small group of volunteers and Unicode staff saw this as an opportunity to explore different types of events and see how we could reach a more global audience. Since July of last year, we have held seven online events. We conscientiously maintained a focus on the online medium for these initial events to ensure broader reach and access to knowledge, which is different from how Unicode has approached events in the past. From the recent online events, we have drawn in hundreds of new people from 65+ countries. And for each of those, we have made available and promoted the recordings on the Unicode YouTube channel.

And all that is to say, there is more to do!

As a small non-profit that depends on volunteers, we started modestly and have been pushing our boundaries, experimenting with our tools, and expanding our capabilities with each event.

The upcoming Unicode Technology Workshop is a natural extension of that experimentation. While this is in California as an in-person event, we hope that we can take lessons learned and apply this model to additional in-person events in other parts of the globe.

I am personally thankful for the opportunity to help Unicode connect with a more global audience, given how foundational and impactful Unicode’s work is on people, languages, and communities around the world. The work of our team is made possible by my committed colleagues, some of whom are from organizations such as Google, UC Berkeley, and Spotify.

It is encouraging to see the growing interest in events, as well as interest in people partnering with Unicode to launch such efforts. If you have ideas on types of outreach programming or educational tools that would help you or others on your internationalization journey, please reach out to us via events@unicode.org.

Friday, October 13, 2023

Unicode Technology Workshop on November 7-8 – Update on Sessions!

By the Unicode Technology Workshop Steering Committee

The Unicode Technology Workshop (UTW) is the internationalization event you want to attend this year.

Hear from internationalization experts from Adobe, Google, Meta, Square, UC Berkeley, and many more. Sessions include workshops, seminars, free-form discussions, and lightning talks centered around i18n libraries, locale data updates, globalization tooling, localization pipelines, input methods, and text rendering. Day 2 includes unconference sessions driven by attendees.

Topics for the workshops and seminars include:

Introduction to Unicode and Beyond
Intro to ICU4X Workshop
PersonNames in the Real World
Unicode Guide to Internationalization
Internationalizing the Internet’s Domain Name System
The First Steps to Go Global
Fixing Input Methods for Abugida Scripts
Automatic Grammar Agreement in Message Formatting
Prove it! Data Driven Conformance Testing
Unicode Source Code Handling
Script Encoding Initiative: Past and Future
Character to Glyph: how Unicode® Text Makes it to Your Screen
Critical Values for I18n Testing
ADLaM, the Power of a Script: Evolution, Community Impact & Challenges Post-Unicode
{ }: MessageFormat v2
What's New in CLDR and ICU
Unicode Properties & Algorithms
🔥😮‍💨🍄🪦💀🐷🐙😤
Lessons Launching Dozens of New Languages in a UI
Locale Aware Units and Units Inflection
“Ask Unicode Anything” with Mark Davis, Unicode’s Cofounder and CTO

Attendees will be encouraged at the event to bring up topics for unconference sessions and lightning talks.

Network with developers and users to help shape the future of Unicode technology. Expect two days of community building around the Unicode technology that makes software work for billions of people across all devices.

When and where: November 7-8, 2023. Bay Area (Hosted at Google). In-Person only!

Register Now at https://www.unicode.org/events/event-registration.html.

Friday, October 6, 2023

ICU4X 1.3: Now With Built-In Data, Case Mapping, Additional Calendar Systems, And More

By Robert Bastian, ICU4X Technical Committee

Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with their users in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

Since our last release in April 2023, the ICU4X team has been busy building additional features and improving the usability of the library. Today we're happy to announce the 1.3 release, including built-in data, a new datagen API, the first stable release of the case mapping component, support for more calendar systems, a technology preview of rule-based transliteration, and more.

We have heard feedback that ICU4X's data pipeline, while allowing powerful customization, has a significant learning curve. In ICU4X 1.3 we are therefore introducing a new feature called "compiled data", where we ship data generated from the latest CLDR and ICU versions in the library. This means that every ICU4X type gains a new constructor that does not take a data provider argument, but instead uses the compiled data. This data is using our existing "baked data" format, which, just being Rust code, allows the compiler to perform optimizations and granularly exclude unnecessary data. In fact, programs that are not using any of the new constructors will not see a binary size difference even with the compiled_data Cargo feature enabled (it is enabled by default).

In addition to adding compiled data, we have also revamped our data generation API icu_datagen. The new API is more ergonomic, allows for more flexible data generation, such as choosing which segmentation models to include, and also better optimizes the size of the generated data. For example, with the new "fallback mode" flag, data can be generated under the assumption that locale fallback is going to be used at runtime. This way, data for e.g. en-CA does not have to be included if it matches the data for en, because at runtime en will be tried if en-CA doesn't exist. This mode of data duplication is already used for compiled data, which comes with built-in fallback.

ICU4X 1.3 also stabilizes a new component: casemapping. Many scripts are bicameral, meaning they have an upper and lower case. Casemapping allows for converting between upper, lower, and title case, and the related casefolding operation allows for performing case-insensitive string matching. These operations can be rather nuanced and locale-dependent: for example, the letter “i” capitalizes to “İ” in Turkish, and modern Greek removes accents and adds diæreses when uppercasing.

This release also completes the set of calendars to include all CLDR calendars. In addition to the Gregorian, Thai Solar Buddhist, Coptic, Ethiopian, Indian National (Śaka), and Japanese calendars that have been supported since 1.0, ICU4X now also supports the Chinese, Korean (Dangi), Hebrew, Persian (Solar Hijri), R.O.C., and four variants of the Islamic calendar (civil, observational, tabular, and Umm al-Qura). This support includes formatting, though formatting for Chinese and Korean is currently in a preview state.

We're also launching a transliteration API as a technical preview. Transliteration is the conversion between scripts, such as from Arabic to Latin, preserving pronunciation as far as possible. CLDR supports many transliterations, and this release brings these CLDR transliterations to ICU4X. While data generation is not yet available, users can runtime-construct transliterators to convert between any scripts supported by CLDR.

Finally, ICU4X 1.3 brings a number of smaller features to other components. The experimental display names component now supports formatting language identifiers, in addition to language, script, and region display names; there are performance improvements across the board; and some APIs such as LocaleFallbacker have been moved to better locations.

Read the full ICU4X 1.3 release notes and then the ICU4X tutorial to start using ICU4X in your project.

Thursday, October 5, 2023

Unicode CLDR v44 Beta available for specification review

The Unicode CLDR v44 Beta is now available for specification review and integration testing. The release is planned for November 1st, but any feedback on the specification needs to be submitted well in advance of that date. The specification is available at Draft LDML Modifications. The biggest change is the new Person Names Formatting section.

The beta has already been integrated into the development version of ICU. We would especially appreciate feedback from ICU users and non-ICU consumers of CLDR data, and on Migration issues.

Feedback can be filed at CLDR Tickets.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

There are many other changes: to find out more, see the draft CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level	Langs	Usage
Modern	95	Suitable for full UI internationalization
	čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate	13	Suitable for “document content” internationalization, eg. in spreadsheet
	brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic	50	Suitable for locale selection, eg. choice of language on mobile phone
	asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

Tuesday, October 31, 2023

ICU 74 Released

Unicode CLDR v44 available

Monday, October 30, 2023

Update from Unicode’s Community Engagement Team

Friday, October 13, 2023

Unicode Technology Workshop on November 7-8 – Update on Sessions!

Friday, October 6, 2023

ICU4X 1.3: Now With Built-In Data, Case Mapping, Additional Calendar Systems, And More

Thursday, October 5, 2023

Unicode CLDR v44 Beta available for specification review

Links of Interest

Blog Archive

Labels

Followers

Tuesday, October 31, 2023

ICU 74 Released

Unicode CLDR v44 available

Monday, October 30, 2023

Update from Unicode’s Community Engagement Team

Friday, October 13, 2023

Unicode Technology Workshop on November 7-8 – Update on Sessions!

Friday, October 6, 2023

ICU4X 1.3: Now With Built-In Data, Case Mapping, Additional Calendar Systems, And More

Thursday, October 5, 2023

Unicode CLDR v44 Beta available for specification review

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog