Tuesday, November 26, 2024

UTC #181 Highlights

Unicode Technical Committee (UTC) meeting #181 was held November 6 – 8 in Cupertino, hosted by Apple. Here are some highlights.

Starting the Unicode 17.0 cycle

UTC approved a plan and timeline for the Unicode 17.0 release. Here’s a summary of the timeline:

 

  • November 2024: UTC #181 approved new character repertoire
  • January 2025: UTC #182 will finalize content for the alpha release
  • February – March: alpha release for public review
  • April: UTC #183 will finalize content for the beta release
  • May – June: beta release for public review
  • July: UTC #184 will finalize 17.0 content
  • September: Unicode 17.0 release

 

Unicode 17.0 character and emoji repertoire

UTC #179 had previously approved 4,301 CJK ideographs for Unicode 17.0, including the addition of the CJK Unified Ideographs Extension J block. At this UTC meeting, a number of additional characters and symbols were approved for Unicode 17.0, including five new scripts:

 

  • Beria Erfe is a modern-use script used for the Zaghawa language in eastern Africa.
  • Chisoi is a modern-use script used for the Kurmali language in eastern India.
  • Sidetic is an historic script that was used in ancient Anatolia.
  • Tai Yo is the traditional script for the Tai Yo language, spoken in Vietnam and Laos.
  • Tolong Siki is a modern-use script used for the Kurukh language in eastern India.

 

A few changes were made to the approved new CJK ideographs repertoire: two ideographs from the CJK Extension J block were removed, while four ideographs were added. UTC also approved 297 other non-emoji character additions for already encoded scripts or symbol blocks.

 

UTC #181 also approved 8 new emoji characters for Unicode 17.0, along with a number of emoji ZWJ sequences; see document L2/24-226R for details.

 

Besides characters approved for Unicode 17.0, code points were provisionally assigned for 365 new characters that are candidates for encoding in a future Unicode version.

 

See the Pipeline page for all characters currently approved for Unicode 17.0, along with code points provisionally assigned for future encoding.

 

Algorithm specs

UTC approved some significant changes related to algorithm specifications for Unicode 17.0. Notably, in UAX #14, a new Line_Break property value was approved — Unambiguous_Hyphen —along with related changes to various rules of the line-breaking algorithm. Also, for UTS #10, Unicode Collation Algorithm, information about conformance tests had previously been published in a companion document, but this will be incorporated into UTS #10 for version 17.0. New public review issues will be posted soon to get feedback on the planned changes.

 

UTC also approved proposed drafts for two new algorithm specifications:

 

  • Proposed Draft UTS #58, Unicode Linkification: this proposed standard will specify a mechanism for detecting URLs that contain Unicode characters.
  • Proposed Draft UTR #59, East Asian Spacing: this proposed technical report will specify an algorithm for established typographic conventions in East Asian text for spacing between runs of text from different scripts.

 

A public review issue has been posted for review of PD UTS #58 (see PRI #509). A public review issue for PD UTR #59 will be posted soon.

 

Update on Text Terminal Working Group

At UTC #175, a temporary working group was formed to work on improving support for Unicode text in text terminal environments. After a slow start due to the original chairperson no longer being available, Fraser Gordon was chosen as a new chair for the group, and it has started to function with several interested participants. Fraser Gordon reported on the group’s activity and requested feedback from UTC on some technical questions the working group was facing, including whether it could be in scope to propose requirements for fonts or a text protocol for signaling between applications and terminals — UTC feedback was that either of these could be considered. See L2/24-264 for more details.

 

UTC coming to Eastern US

Earlier this year, UTC started discussing the possibility of trying new locations to make it easier for people in other regions or time zones to participate. Between having people interested from many parts of the world as well as travel constraints on regular participants, there is no perfect answer. However, we received a generous offer from the University of New Hampshire to host a meeting there, and so UTC has decided to switch the location of the July 2025 meeting from Redmond, WA to Manchester, New Hampshire (about an hour drive north of Boston). Some preliminary logistic info will be provided soon to give plenty of time to consider travel plans.

 

For complete details on outcomes from UTC #181, see the draft minutes.


Feedback Requested on Proposed Draft UTS #58 Unicode Linkification

Feedback is requested on Proposed Draft UTS #58 Unicode Linkification, especially by technologists working with browsers and any programs that automatically apply links to URLs, such as email programs. 

So what is Linkification?

With most email programs, when someone pastes in the plain text:
The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.
and sends to someone else, they receive it as:
The page https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン contains information about Albert Einstein.

 URLs are also “linkified” in many other applications, such when pasting into a word processor (triggered by typing a space afterward, for example). 

Problem

However, many products (many text messaging apps, video messaging chats, etc.) completely fail to recognize any non-ASCII characters past the domain name. And even among those that do recognize such non-ASCII characters, there are gratuitous differences in where they stop linkifying.

The linkification process for URLs is already fragmented — with different implementations producing very different results — but it is amplified with the addition of non-ASCII characters, which often have very different behavior. That is, developers’ lack of familiarity with the behavior of non-ASCII characters has caused the different implementations of linkification to splinter. Yet non-ASCII characters are very important for readability. People do not want to see the above URL expressed in escaped ASCII:

Proposed Solution

This proposed draft Unicode Technical Standard #58 Unicode Linkification specifies a standard mechanism for detecting URLs embedded in plain text — in particular, detecting URLs containing non-ASCII characters. It also defines the minimally necessary escaping of non-ASCII code points in the Path, Query, and Fragment portions of a URL that aligns with the mechanism for detecting URLs.

How to Provide Feedback

For information about how to discuss this Public Review Issue and how to supply formal feedback, please see the feedback and discussion instructions. The closing date is 2025 January 02 for this draft, but this is only the first step towards approval.


_________________________________________________

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Announcing ICU4X 2.0 Beta 1 (and UTW 2024 recording)

Across the globe, people are coming online with smaller and more varied devices including smartphones, smartwatches, and gadgets. An offshoot of the International Components for Unicode (ICU) Technical Committee, the ICU4X Committee, is responsible for enabling these next-generation devices to communicate with their users in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.


The ICU4X-TC is happy to now announce the release of ICU4X 2.0 Beta 1. Learn more about it in our UTW 2024 presentation: 2024 ICU4X 2.0: Next Level i18n


This release includes a rewritten datetime component, type-safe preferences in all constructors, CLDR 46 and Unicode 16 data, new experimental duration and unit formatting components, an all-new WebAssembly demo, and improvements to many other components including locale tailoring in segmenter, algorithmic plural selection, and IXDTF parsing for zoned datetimes.


This release includes breaking changes. The most common you will encounter include:

  1. All constructors take a preference bag by value instead of a `&DataLocale`.

  2. Many functions had subtle renames, such as `try_from_bytes` becoming `try_from_utf8`.

  3. The datetime component was rewritten, and call sites will need to be migrated.

Refer to the latest documentation for more information. Please also ask questions on GitHub:


https://github.com/unicode-org/icu4x/discussions/5872

This is a beta release, meaning that the team expects this to be mostly compatible with the upcoming 2.0 final release, but there is still room to make changes. Please send feedback by creating an issue or discussion on GitHub.

________________________________________________

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.