Thursday, January 16, 2025

MessageFormat 2.0 Final Candidate: Review Requested

 Image of

Unicode recently released CLDR 46.1, a special interim release of CLDR that focuses on the Final Candidate release of MessageFormat 2.0. There are a few other changes, which are summarized at the end of this post.

MessageFormat 2.0 is a significant evolution from ICU MessageFormat 1.0. It is both more powerful in its abilities to represent localizable messages, and also strives to make those messages easier to translate. Unlike its predecessor, it is:

  • not defined through its ICU API, but by a specification that can be applied to by a wide range of implementations — and already has non-ICU implementations.

  • designed for extensibility: new functions and options can easily be added.     

The specification has been developed by the MessageFormat Working Group over the past five years and is open for public comment. It encompasses all the capabilities of the MessageFormat 1.0 syntax and is designed to handle messages in other existing message formats via its data model. Please review the specification before its finalization, and supply feedback on any areas where it does not meet this goal.

It is important to supply feedback on the Final Candidate by February 12, but ideally as early as possible.

While the structure is designed to be very extensible, once the Final Candidate is released as an approved version (in mid-March 2025), stability constraints will prevent incompatible changes to syntax and semantics of MessageFormat 2.0.

To supply feedback file an issue at: Unicode Message Format Issues — GitHub.

Tech preview implementations of MessageFormat 2.0 include Java, C++ and JavaScript. People can try these out with their implementations to see if there are any issues.

In addition to the MessageFormat 2.0 Final Candidate, there are a few other changes in the CLDR 46.1 release, specifically:

  • More explicit well-formedness and validity constraints for unit of measurement identifiers

  • Addition of derived emoji annotations that were missing: emoji with skin tones facing right

  • Fixes to make the ja, ko, yue, zh datetimeSkeletons useful for generating the standard patterns

  • Improved date/time test data


For more information, see 46.1 Changes


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Wednesday, December 4, 2024

Heiltsuk Revitalization: Introducing New Letters for the Haíɫzaqv Language


By Kevin King, Type Designer & Researcher, Typotheque



Indigenous people in British Columbia speaking Haíɫzaqv language were missing characters to correctly render the language in writing. As part of an ongoing partnership between the Haíɫzaqv (Heiltsuk) Nation and Typotheque under a Memorandum of Understanding, Heiltsuk Revitalization and Typotheque have created this video in order to present the story of how the publication of new Latin script characters were achieved and included in Unicode Version 16.0. 



This represents a major achievement for language reclamation and sovereignty for the Haíɫzaqv Nation, as it provided complete representation of their orthography within the Unicode Standard, removing significant barriers to digital language access. Alongside this announcement, Typotheque has prepared localized fonts – free and available in perpetuity to all community members – as well as updated keyboard layouts.

With this roadblock removed, Heiltsuk Revitalization and Typotheque look forward to collaborating under partnership on actions that help further extend the impact of this successful character addition to Unicode which enables greater access and engagement of Haíɫzaqvḷa (the Heiltsuk language) for the present and future.

To learn more, please read the full announcement on the Heiltsuk Revitalization website, and visit the Typotheque Indigenous North American Typography Research project’s website here.

Monday, December 2, 2024

🎁 Giving Tuesday is December 3, 2024


What is Giving Tuesday?

Giving Tuesday is a global generosity movement unleashing the power of people and organizations to transform their communities and the world.

This Giving Tuesday, join us in supporting the technology that ensures billions can communicate seamlessly across platforms. Your donation fuels innovation and inclusivity in global digital standards.

Ways to Support Unicode's Mission

As a non-profit, open-source, open-standards organization, the Unicode Consortium is funded by membership fees and donations from individuals, corporations, and other organizations.

This Giving Tuesday, there are many ways you can support Unicode:

Join us in building a digital world where everyone can communicate—no matter the language or platform. Together, we can make a lasting impact.

Learn how your funding fuels Unicode's mission!

Thank you for your continued support.



Tuesday, November 26, 2024

UTC #181 Highlights

Unicode Technical Committee (UTC) meeting #181 was held November 6 – 8 in Cupertino, hosted by Apple. Here are some highlights.

Starting the Unicode 17.0 cycle

UTC approved a plan and timeline for the Unicode 17.0 release. Here’s a summary of the timeline:

 

  • November 2024: UTC #181 approved new character repertoire
  • January 2025: UTC #182 will finalize content for the alpha release
  • February – March: alpha release for public review
  • April: UTC #183 will finalize content for the beta release
  • May – June: beta release for public review
  • July: UTC #184 will finalize 17.0 content
  • September: Unicode 17.0 release

 

Unicode 17.0 character and emoji repertoire

UTC #179 had previously approved 4,301 CJK ideographs for Unicode 17.0, including the addition of the CJK Unified Ideographs Extension J block. At this UTC meeting, a number of additional characters and symbols were approved for Unicode 17.0, including five new scripts:

 

  • Beria Erfe is a modern-use script used for the Zaghawa language in eastern Africa.
  • Chisoi is a modern-use script used for the Kurmali language in eastern India.
  • Sidetic is an historic script that was used in ancient Anatolia.
  • Tai Yo is the traditional script for the Tai Yo language, spoken in Vietnam and Laos.
  • Tolong Siki is a modern-use script used for the Kurukh language in eastern India.

 

A few changes were made to the approved new CJK ideographs repertoire: two ideographs from the CJK Extension J block were removed, while four ideographs were added. UTC also approved 297 other non-emoji character additions for already encoded scripts or symbol blocks.

 

UTC #181 also approved 8 new emoji characters for Unicode 17.0, along with a number of emoji ZWJ sequences; see document L2/24-226R for details.

 

Besides characters approved for Unicode 17.0, code points were provisionally assigned for 365 new characters that are candidates for encoding in a future Unicode version.

 

See the Pipeline page for all characters currently approved for Unicode 17.0, along with code points provisionally assigned for future encoding.

 

Algorithm specs

UTC approved some significant changes related to algorithm specifications for Unicode 17.0. Notably, in UAX #14, a new Line_Break property value was approved — Unambiguous_Hyphen —along with related changes to various rules of the line-breaking algorithm. Also, for UTS #10, Unicode Collation Algorithm, information about conformance tests had previously been published in a companion document, but this will be incorporated into UTS #10 for version 17.0. New public review issues will be posted soon to get feedback on the planned changes.

 

UTC also approved proposed drafts for two new algorithm specifications:

 

  • Proposed Draft UTS #58, Unicode Linkification: this proposed standard will specify a mechanism for detecting URLs that contain Unicode characters.
  • Proposed Draft UTR #59, East Asian Spacing: this proposed technical report will specify an algorithm for established typographic conventions in East Asian text for spacing between runs of text from different scripts.

 

A public review issue has been posted for review of PD UTS #58 (see PRI #509). A public review issue for PD UTR #59 will be posted soon.

 

Update on Text Terminal Working Group

At UTC #175, a temporary working group was formed to work on improving support for Unicode text in text terminal environments. After a slow start due to the original chairperson no longer being available, Fraser Gordon was chosen as a new chair for the group, and it has started to function with several interested participants. Fraser Gordon reported on the group’s activity and requested feedback from UTC on some technical questions the working group was facing, including whether it could be in scope to propose requirements for fonts or a text protocol for signaling between applications and terminals — UTC feedback was that either of these could be considered. See L2/24-264 for more details.

 

UTC coming to Eastern US

Earlier this year, UTC started discussing the possibility of trying new locations to make it easier for people in other regions or time zones to participate. Between having people interested from many parts of the world as well as travel constraints on regular participants, there is no perfect answer. However, we received a generous offer from the University of New Hampshire to host a meeting there, and so UTC has decided to switch the location of the July 2025 meeting from Redmond, WA to Manchester, New Hampshire (about an hour drive north of Boston). Some preliminary logistic info will be provided soon to give plenty of time to consider travel plans.

 

For complete details on outcomes from UTC #181, see the draft minutes.


Feedback Requested on Proposed Draft UTS #58 Unicode Linkification

Feedback is requested on Proposed Draft UTS #58 Unicode Linkification, especially by technologists working with browsers and any programs that automatically apply links to URLs, such as email programs. 

So what is Linkification?

With most email programs, when someone pastes in the plain text:
The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.
and sends to someone else, they receive it as:
The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.

 URLs are also “linkified” in many other applications, such when pasting into a word processor (triggered by typing a space afterward, for example). 

Problem

However, many products (many text messaging apps, video messaging chats, etc.) completely fail to recognize any non-ASCII characters past the domain name. And even among those that do recognize such non-ASCII characters, there are gratuitous differences in where they stop linkifying.

The linkification process for URLs is already fragmented — with different implementations producing very different results — but it is amplified with the addition of non-ASCII characters, which often have very different behavior. That is, developers’ lack of familiarity with the behavior of non-ASCII characters has caused the different implementations of linkification to splinter. Yet non-ASCII characters are very important for readability. People do not want to see the above URL expressed in escaped ASCII:

Proposed Solution

This proposed draft Unicode Technical Standard #58 Unicode Linkification specifies a standard mechanism for detecting URLs embedded in plain text — in particular, detecting URLs containing non-ASCII characters. It also defines the minimally necessary escaping of non-ASCII code points in the Path, Query, and Fragment portions of a URL that aligns with the mechanism for detecting URLs.

How to Provide Feedback

For information about how to discuss this Public Review Issue and how to supply formal feedback, please see the feedback and discussion instructions. The closing date is 2025 January 02 for this draft, but this is only the first step towards approval.


_________________________________________________

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Announcing ICU4X 2.0 Beta 1 (and UTW 2024 recording)

Across the globe, people are coming online with smaller and more varied devices including smartphones, smartwatches, and gadgets. An offshoot of the International Components for Unicode (ICU) Technical Committee, the ICU4X Committee, is responsible for enabling these next-generation devices to communicate with their users in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.


The ICU4X-TC is happy to now announce the release of ICU4X 2.0 Beta 1. Learn more about it in our UTW 2024 presentation: 2024 ICU4X 2.0: Next Level i18n


This release includes a rewritten datetime component, type-safe preferences in all constructors, CLDR 46 and Unicode 16 data, new experimental duration and unit formatting components, an all-new WebAssembly demo, and improvements to many other components including locale tailoring in segmenter, algorithmic plural selection, and IXDTF parsing for zoned datetimes.


This release includes breaking changes. The most common you will encounter include:

  1. All constructors take a preference bag by value instead of a `&DataLocale`.

  2. Many functions had subtle renames, such as `try_from_bytes` becoming `try_from_utf8`.

  3. The datetime component was rewritten, and call sites will need to be migrated.

Refer to the latest documentation for more information. Please also ask questions on GitHub:


https://github.com/unicode-org/icu4x/discussions/5872

This is a beta release, meaning that the team expects this to be mostly compatible with the upcoming 2.0 final release, but there is still room to make changes. Please send feedback by creating an issue or discussion on GitHub.

________________________________________________

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, October 29, 2024

Script Encoding and Cultural Identity: Navigating Digital Exclusion

By Maroua Bezzaoui, SILICON Intern

During the summer of 2024, Unicode’s internship program included interns from Stanford University, Northeastern University, and Google’s Summer of Code. Several of the interns have shared their experiences. The second featured piece is from Maroua Bezzaoui at Stanford University.

Friday, October 25, 2024

ICU 76 Released

ICU LogoUnicode® ICU 76 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

ICU 76 updates to Unicode 16 (blog), including new characters and scripts, emoji, collation & IDNA changes, and corresponding APIs and implementations. It also updates to CLDR 46 (beta blog) locale data with new locales, significant updates to existing locales, and various additions and corrections. For example, the CLDR and Unicode default sort orders are now very nearly the same.

Most of the java.time (Temporal) types can now be formatted directly using the existing ICU4J date/time formatting classes.

There are some new APIs to make ICU easier to use with modern C++ and Java patterns. Most of the C/C++ APIs added for this purpose are implemented as C++ header-only APIs, and usable on top of binary stable C APIs, which is a first for ICU.

The Java and C++ technology preview implementations of the (also in tech preview) CLDR MessageFormat 2.0 specification have been updated to match recent changes.

ICU 76 and CLDR 46 are major releases, including a new version of Unicode and major locale data improvements.

For details, please see
https://unicode-org.github.io/icu/download/76.html.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.