Friday, February 7, 2025

Unicode CLDR 47 Alpha Now Available for Testing

 

The Unicode CLDR 47 Alpha is now available for integration testing. 


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.


The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.


CLDR 47 focused on MessageFormat 2.0 and tooling for an expansion of DDL support. It was a closed cycle: locale data changes were limited to bug fixes and the addition of new locales, mostly regional variants.

RBNF improvements and Transforms


CLDR added Gujarati RBNF support, which provides number spell out functionality, and made improvements to many other languages.


Transforms were also improved in both CLDR 46.1 and 47 releases which included:

  • Adding a Hant-Latn transliterator

  • Aliasing Hans-Latn to Hani-Latn

  • Improvements to several other transliterators

More regional variants


Over the past few years there have been an increasing number of requests for locales to be added to languages, such as English, when they are commonly used in a region as a lingua franca.


CLDR has been adding additional child locales to support these requests and has begun restructuring inheritance to allow for better maintenance of shared regional data, such as currency symbols and metazone names.

46.1 Improvements


CLDR 46.1 was a special interim release of CLDR that focused on MessageFormat 2.0. It included a few additional changes:

  • More explicit well-formedness and validity constraints for unit of measurement identifiers

  • Addition of derived emoji annotations that were missing: emoji with skin tones facing right

  • Fixes to make the ja, ko, yue, zh datetimeSkeletons useful for generating the standard patterns

  • Improved date/time test data


For more information, see 46.1 Changes

Tooling changes


Many tooling changes are difficult to accommodate in a data-submission release, including performance work and UI improvements. The changes in CLDR 47 provide faster turn-around for linguists, and higher data quality. They are targeted at the CLDR 48 submission period, starting in April, 2025.

For more information


See the draft CLDR v47 Release Note, which has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.


_______________________________________________

Tuesday, February 4, 2025

Unicode Welcomes New Board Chair and Members

Mark Davis Steps Down as Board Chair

Mark Davis, co-founder of the Unicode Consortium and a pivotal figure in global digital communication, has transitioned from Chair of the Board of Directors to a continuing role on the Board. He will also remain Chair of the CLDR Technical Committee and Chief Technology Officer (CTO) for the Consortium. This transition reflects Davis’s commitment to ensuring a smooth leadership transition and his continuing dedication to the Consortium’s mission of enabling everyone around the globe and across all technology platforms to seamlessly communicate and collaborate in their own languages.

During Davis’s tenure as Board Chair, the Unicode Consortium solidified its position as a crucial global organization. Under his leadership, the Consortium standardized character encoding for modern and historical scripts and symbols, including the now ubiquitous emoji. He also founded two vital projects that have become core pillars of the Consortium: CLDR (Common Locale Data Repository), providing structured data for internationalization, and ICU (International Components for Unicode), delivering production-ready code libraries. These projects, along with the Unicode encoding, are fundamental to virtually all modern phones and operating systems, enabling billions of people worldwide to communicate in their native languages.

“One of the most satisfying accomplishments in a career is to find successors who take on challenging positions – and achieve even greater impact,” said Mark Davis. “Markus Scherer has done that for ICU; Jennifer Daniel for the Emoji working group (aka ESC), and Toral Cowieson as CEO of the Consortium. Having worked with Cathy during pivotal years in the development of the Unicode encoding, I'm confident that her talents and skills will make her an exceptional Chair of the Board.”

Cathy Wissink Elected as Board Chair

We are pleased to announce that the Unicode Board of Directors voted unanimously to appoint Cathy Wissink as the new chair. Wissink is a 30-year veteran of the technology industry, focused primarily at the intersection of international markets and innovation. The bulk of her career was spent at Microsoft, in diverse roles ranging from engineering to government affairs to product certification. She’s no stranger to the Unicode Consortium, having worked on Unicode and internationalization implementation from the earliest versions of 32-bit Windows through Windows 7. Wissink also led Microsoft’s participation in the Unicode Technical Committee from 2000-2005 and served as UTC vice-chair and INCITS/L2 chair from 2002-2005. 

"I am grateful for the trust that Mark and the board have placed in me as the incoming chair of the Board of Directors for the Unicode Consortium," said Wissink. "Unicode's products and standards are essential to global digital communications, and as innovation progresses and languages evolve, there is still significant work to enable all languages in digital spaces. I look forward to collaborating with Mark Davis and Toral Cowieson, as well as the broader community of technologists, linguists, and specialists to advance Unicode's mission."

Welcome to new Board Members, John Tinsley and Manuela Giese

John Tinsley is the VP of AI Solutions at Unicode member company Translated. He’s a computer scientist with more than 15 years of experience in the localization industry. Prior to Translated, he founded Iconic Translation Machines, an award-winning language technology software business that pioneered the commercial deployment of Neural Machine Translation technology. John led the business for almost a decade before selling it to RWS in 2020 in one of the largest technology deals in the language industry.

He holds a PhD in Computer Science and a degree in Applied Computational Linguistics and is a regular public speaker on topics related to language, translation, and business.

Manuela Giese is a Principal Group Manager at Microsoft. She has spent the last 25 years working on various aspects of localization across content types and languages including complex scripts; she still has fond memories of managing complex script languages through localization deliveries in the earlier days of Unicode support.

In recent years, she has been more focused on business horizontals supporting localization models and their challenges. She is passionate about language and culture and how both intersect with equality and gender. She has spent significant time in Europe, South America, and the US and currently resides on the ancestral homeland of the Nooksack, Lummi, and other Coast Salish peoples.

Unicode would also like to thank Ayman Aldahleh for his many years of service to the Unicode Board, including as Secretary until his retirement in January. 

The full list of Board members is available here.
_______________________________________________

Unicode 17.0 Alpha Review Opens for Feedback

The repertoire for Unicode Version 17.0 is now open for early review and comment. During alpha review, the repertoire is reasonably mature and stable but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be considered.

For the alpha review, preliminary data files are also available, with data covering existing and new character repertoire. In addition, a draft for the core specification is available, with new block descriptions for some of the newly added blocks and scripts.

The primary focus for the alpha review should be on the new character repertoire. This early review is provided so that reviewers may consider the repertoire and data file issues prior to the start of beta review (currently scheduled to start in May 2025). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.


Notable Changes

The planned repertoire for Unicode 17.0 adds 4,836 new characters, bringing the total number of characters to 159,834. The most significant addition for this release is the new CJK Unified Ideographs Extension J block with 4,298 characters. Also added are five new scripts: Beria Erfe, Chisoi, Sidetic, Tolong Siki, and Tai Yo. See The Pipeline and the delta code charts for details.

In addition to new characters, 46 standardized variation sequences will be added for rotational variants of Egyptian Hieroglyphs and “Sibe” forms of quotation marks.

Unicode Emoji 17.0 will include eight new emoji characters plus additional RGI sequences—see PRI #515 Unicode Emoji 17.0 Alpha Repertoire.

Feedback for the alpha review should be reported under PRI #514 using the Unicode contact form by April 2, 2025.

_______________________________________________

Friday, January 31, 2025

Highlights from UTC #182

By Peter Constable, UTC Chair


A huge thank you to Google for hosting Unicode Technical Committee (UTC) meeting #182 last week, January 22 – 24th in Sunnyvale, CA!

For complete details, see the draft UTC #182 minutes.

Unicode 17.0 alpha repertoire

UTC took technical decisions for the Unicode 17.0 alpha, which will be released for public review next week. No changes were made to the new character repertoire approved for Unicode 17.0 at UTC #181, but changes were made to some details for certain characters.

 

  • Some character name changes were approved for some new characters (three Arabic honorific characters and Tolong Siki letters).
  • For Tangut, the glyph and stroke count will be changed for one character, and the default UCA ordering for Tangut components will be changed.
  • Four variation sequences will be added for “Sibe” forms of quotation marks (U+2018, U+2019, U+201C, U+201D).
  • For CJK, some representative glyphs for 11 characters will be changed (one “G” source and ten “V” source”). Also, 1,685 “G” source references will be updated. Various Unihan property value changes were also approved.

 

Data files

A significant change was approved for data files for The Unicode Standard and the other version-synchronized standards, UTS #10, UTS #39, UTS #46 and UTS #51. Up through Version 16.0, data files for The Unicode Standard have been published in version folders on the Website in the /Public/ folder (e.g., https://www.unicode.org/Public/16.0.0/), while the detail files for synchronized UTSes have been in separate, UTS-specific, folders, each with version subfolders — for example, https://www.unicode.org/Public/emoji/16.0/. Starting with Unicode 17.0, data files for the synchronized UTSes will also be published within a version subfolder under /Public/.

 

For example, instead of UCD and UTS #51 data files being organized as follows,

 

https://www.unicode.org/Public/17.0/ucd

https://www.unicode.org/Public/emoji/17.0/

 

They will instead be organized like this:

 

https://www.unicode.org/Public/17.0/ucd

https://www.unicode.org/Public/17.0/emoji/

 

This is close to what has been done in the Public/draft folder for pre-release data files. The organization in that folder will be adjusted for the Unicode 17.0 alpha to match what will be used for the release.

 

Property/data changes

Some significant property changes were approved for Unicode 17.0, including the following:

 

  • The Identifier_Type property defined in UTS #39 is used by some identifier systems to limit the set of valid identifiers. In Version 16.0, all CJK ideographs have had a property value that makes them valid in such identifier systems. UTC #182 approved a change to the Identifier_Type value for a large number of CJK ideographs to make them invalid, matching what ICANN has done for IDNA root zone labels.

 

  • The Extended_Pictographic code point property was created to make segmentation behaviours defined in UAX #14 and UAX #29 forward compatible for future emoji characters. When it was created in Unicode 11.0, all unassigned code points in the range 1F000..1FFFD were given this property. When non-emoji characters are assigned in that range, they should not have that property, but UTC has not been consistent to remove that property for those code points. This will be corrected in Unicode 17.0.

 

UTC #181 also authorized a proposed draft for a possible new UAX #60 to document data for non-CJK ideographs based on L2/25-052; a public review issue for this will be posted soon. This would be analogous to UAX #38 but apply to ideographic scripts such as Nüshu and Tangut.

 

Please review!


UTC invites feedback on the following proposed specs:

  • PRI #509, Proposed Draft UTS #58, Unicode Link Detection and Serialization
  • PRI #510, Proposed Draft UTR #59, East Asian Spacing

As mentioned above, Identifier_Type property values for CJK characters are being changed based on analysis provided by ICANN. Other documents submitted to UTC propose other Identifier_Type changes based on similar analysis. UTC invites review and feedback on these documents; see the following public review issue for details:
  • PRI 517, Review of Identifier_Type for existing characters

Thursday, January 16, 2025

MessageFormat 2.0 Final Candidate: Review Requested

 Image of

Unicode recently released CLDR 46.1, a special interim release of CLDR that focuses on the Final Candidate release of MessageFormat 2.0. There are a few other changes, which are summarized at the end of this post.

MessageFormat 2.0 is a significant evolution from ICU MessageFormat 1.0. It is both more powerful in its abilities to represent localizable messages, and also strives to make those messages easier to translate. Unlike its predecessor, it is:

  • not defined through its ICU API, but by a specification that can be applied to by a wide range of implementations — and already has non-ICU implementations.

  • designed for extensibility: new functions and options can easily be added.     

The specification has been developed by the MessageFormat Working Group over the past five years and is open for public comment. It encompasses all the capabilities of the MessageFormat 1.0 syntax and is designed to handle messages in other existing message formats via its data model. Please review the specification before its finalization, and supply feedback on any areas where it does not meet this goal.

It is important to supply feedback on the Final Candidate by February 12, but ideally as early as possible.

While the structure is designed to be very extensible, once the Final Candidate is released as an approved version (in mid-March 2025), stability constraints will prevent incompatible changes to syntax and semantics of MessageFormat 2.0.

To supply feedback file an issue at: Unicode Message Format Issues — GitHub.

Tech preview implementations of MessageFormat 2.0 include Java, C++ and JavaScript. People can try these out with their implementations to see if there are any issues.

In addition to the MessageFormat 2.0 Final Candidate, there are a few other changes in the CLDR 46.1 release, specifically:

  • More explicit well-formedness and validity constraints for unit of measurement identifiers

  • Addition of derived emoji annotations that were missing: emoji with skin tones facing right

  • Fixes to make the ja, ko, yue, zh datetimeSkeletons useful for generating the standard patterns

  • Improved date/time test data


For more information, see 46.1 Changes


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Wednesday, December 4, 2024

Heiltsuk Revitalization: Introducing New Letters for the Haíɫzaqv Language


By Kevin King, Type Designer & Researcher, Typotheque



Indigenous people in British Columbia speaking Haíɫzaqv language were missing characters to correctly render the language in writing. As part of an ongoing partnership between the Haíɫzaqv (Heiltsuk) Nation and Typotheque under a Memorandum of Understanding, Heiltsuk Revitalization and Typotheque have created this video in order to present the story of how the publication of new Latin script characters were achieved and included in Unicode Version 16.0. 



This represents a major achievement for language reclamation and sovereignty for the Haíɫzaqv Nation, as it provided complete representation of their orthography within the Unicode Standard, removing significant barriers to digital language access. Alongside this announcement, Typotheque has prepared localized fonts – free and available in perpetuity to all community members – as well as updated keyboard layouts.

With this roadblock removed, Heiltsuk Revitalization and Typotheque look forward to collaborating under partnership on actions that help further extend the impact of this successful character addition to Unicode which enables greater access and engagement of Haíɫzaqvḷa (the Heiltsuk language) for the present and future.

To learn more, please read the full announcement on the Heiltsuk Revitalization website, and visit the Typotheque Indigenous North American Typography Research project’s website here.

Monday, December 2, 2024

🎁 Giving Tuesday is December 3, 2024


What is Giving Tuesday?

Giving Tuesday is a global generosity movement unleashing the power of people and organizations to transform their communities and the world.

This Giving Tuesday, join us in supporting the technology that ensures billions can communicate seamlessly across platforms. Your donation fuels innovation and inclusivity in global digital standards.

Ways to Support Unicode's Mission

As a non-profit, open-source, open-standards organization, the Unicode Consortium is funded by membership fees and donations from individuals, corporations, and other organizations.

This Giving Tuesday, there are many ways you can support Unicode:

Join us in building a digital world where everyone can communicate—no matter the language or platform. Together, we can make a lasting impact.

Learn how your funding fuels Unicode's mission!

Thank you for your continued support.



Tuesday, November 26, 2024

UTC #181 Highlights

Unicode Technical Committee (UTC) meeting #181 was held November 6 – 8 in Cupertino, hosted by Apple. Here are some highlights.

Starting the Unicode 17.0 cycle

UTC approved a plan and timeline for the Unicode 17.0 release. Here’s a summary of the timeline:

 

  • November 2024: UTC #181 approved new character repertoire
  • January 2025: UTC #182 will finalize content for the alpha release
  • February – March: alpha release for public review
  • April: UTC #183 will finalize content for the beta release
  • May – June: beta release for public review
  • July: UTC #184 will finalize 17.0 content
  • September: Unicode 17.0 release

 

Unicode 17.0 character and emoji repertoire

UTC #179 had previously approved 4,301 CJK ideographs for Unicode 17.0, including the addition of the CJK Unified Ideographs Extension J block. At this UTC meeting, a number of additional characters and symbols were approved for Unicode 17.0, including five new scripts:

 

  • Beria Erfe is a modern-use script used for the Zaghawa language in eastern Africa.
  • Chisoi is a modern-use script used for the Kurmali language in eastern India.
  • Sidetic is an historic script that was used in ancient Anatolia.
  • Tai Yo is the traditional script for the Tai Yo language, spoken in Vietnam and Laos.
  • Tolong Siki is a modern-use script used for the Kurukh language in eastern India.

 

A few changes were made to the approved new CJK ideographs repertoire: two ideographs from the CJK Extension J block were removed, while four ideographs were added. UTC also approved 297 other non-emoji character additions for already encoded scripts or symbol blocks.

 

UTC #181 also approved 8 new emoji characters for Unicode 17.0, along with a number of emoji ZWJ sequences; see document L2/24-226R for details.

 

Besides characters approved for Unicode 17.0, code points were provisionally assigned for 365 new characters that are candidates for encoding in a future Unicode version.

 

See the Pipeline page for all characters currently approved for Unicode 17.0, along with code points provisionally assigned for future encoding.

 

Algorithm specs

UTC approved some significant changes related to algorithm specifications for Unicode 17.0. Notably, in UAX #14, a new Line_Break property value was approved — Unambiguous_Hyphen —along with related changes to various rules of the line-breaking algorithm. Also, for UTS #10, Unicode Collation Algorithm, information about conformance tests had previously been published in a companion document, but this will be incorporated into UTS #10 for version 17.0. New public review issues will be posted soon to get feedback on the planned changes.

 

UTC also approved proposed drafts for two new algorithm specifications:

 

  • Proposed Draft UTS #58, Unicode Linkification: this proposed standard will specify a mechanism for detecting URLs that contain Unicode characters.
  • Proposed Draft UTR #59, East Asian Spacing: this proposed technical report will specify an algorithm for established typographic conventions in East Asian text for spacing between runs of text from different scripts.

 

A public review issue has been posted for review of PD UTS #58 (see PRI #509). A public review issue for PD UTR #59 will be posted soon.

 

Update on Text Terminal Working Group

At UTC #175, a temporary working group was formed to work on improving support for Unicode text in text terminal environments. After a slow start due to the original chairperson no longer being available, Fraser Gordon was chosen as a new chair for the group, and it has started to function with several interested participants. Fraser Gordon reported on the group’s activity and requested feedback from UTC on some technical questions the working group was facing, including whether it could be in scope to propose requirements for fonts or a text protocol for signaling between applications and terminals — UTC feedback was that either of these could be considered. See L2/24-264 for more details.

 

UTC coming to Eastern US

Earlier this year, UTC started discussing the possibility of trying new locations to make it easier for people in other regions or time zones to participate. Between having people interested from many parts of the world as well as travel constraints on regular participants, there is no perfect answer. However, we received a generous offer from the University of New Hampshire to host a meeting there, and so UTC has decided to switch the location of the July 2025 meeting from Redmond, WA to Manchester, New Hampshire (about an hour drive north of Boston). Some preliminary logistic info will be provided soon to give plenty of time to consider travel plans.

 

For complete details on outcomes from UTC #181, see the draft minutes.