Tuesday, February 3, 2026

Highlights from UTC Meeting #186

By: Peter Constable, Chair of the Unicode Technical Committee

The Unicode Technical Committee (UTC) met January 21 – 23 in Sunnyvale, CA. Thanks to Unicode member organization, Google, for hosting. Here are some highlights.

Progress on Unicode 18.0

Version 18.0 of the Unicode Standard is being prepared for publication in September of this year. At meetings 184 and 185, UTC had approved 12,995 characters for encoding in version 18.0. At this meeting, some additional characters were approved for this version. One of these new characters is the Omani rial sign, a currency symbol recently created by the Omani Central Bank. Other additions include 51 mathematical symbols and 10 standardized variation sequences proposed by the PHILIUMM Project.

UTC authorized the Unicode 18.0 Alpha preview release, which will be available February 10 for public review.

Future additions

A typical step in the process for encoding new characters is provisional assignment of code points for characters that UTC has deemed eligible for encoding. This allows working groups to begin development of content — property data, code charts, text for the core spec — for a future version. At this meeting, code points were provisionally assigned for several characters including three new scripts: Leke script, used in SE Asia; and Mwangwego and Shaaldaa scripts, used in Eastern Africa.

New UTS on links

UTC approved a new Unicode Technical Standard, UTS #58 Unicode Link Detection and Serialization. This standard includes character data, and this first version includes data for characters in Unicode 17.0. Starting with Unicode 18.0, this will become a synchronized standard, with a new version released together with each new version of the Unicode Standard.

New joint working group for orthographic sequences

At UTC #185, the Government of India proposed that Unicode develop specifications for orthographically valid cluster sequences for Hindi and other language orthographies. (See L2/26-061.) Such work would overlap the scopes of both the Unicode Technical Committee and the CLDR Technical Committee: Specs would deal with character sequences in a manner similar UAX #29, Unicode Text Segmentation, which is maintained by UTC; but each document would be for the orthography of a specific language, which puts this in the scope of CLDR-TC.

After UTC #185, Unicode leaders discussed options and proposed formation of a joint working group (JWG) between CLDR-TC and UTC. (See L2/26-045.) At this UTC meeting, this JWG was approved by UTC. It was similarly approved by CLDR-TC at one of their recent meetings. This new JWG will get organized and begin working during the next quarter.

Metadata embedded in “plain” text

It recently came to light that another organization has developed a specification to embed AI-related metadata into “unstructured” (i.e., “plain”) text. (See L2/26-042.) This has been motivated by the EU AI Act (AIA), which goes into enforcement in August of this year. Article 50 of the AIA obligates vendors to “mark” AI-generated content with machine-readable metadata so that content can be detectable as being artificially generated. This requirement applies to text content as well as other content types. However, Article 50 doesn’t specify what would count as “marking” of text, neither does it distinguish between different text formats: does it apply to generated source code? SMS messages? file names? But C2PA has taken a conservative approach, anticipating that the EU might enforce the requirement on any AI-generated text.

Unfortunately, the scheme added to the C2PA specification embeds sequences of Unicode variation selector characters in a manner that does not conform to the Unicode Standard.

UTC discussed this situation together with a representative from C2PA. On the one hand, it brought to light that the text of the Unicode Standard wasn’t sufficiently clear about conformance requirements in relation to variation sequences. But UTC was clear that this scheme is non-conformant. Other concerns were mentioned, including that it is a contradiction of terms to say that “unstructured” text can contain metadata. An outcome of this discussion was to recommend that Unicode establish a liaison relationship with C2PA, and that the topic be discussed further between the two organizations.

For complete details on outcomes from UTC #186, see the draft minutes.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


Friday, January 30, 2026

Unicode ICU 78.2 and CLDR 48.1 released

Postal Horn emoji


There are new maintenance releases of ICU and CLDR, with some small but significant changes. To find out more and to download these releases, go to: 

CLDR and ICU are planning an additional maintenance release in March instead of a major release.

The next major releases, CLDR 49 and ICU 79, are planned for October and will include the data from the next CLDR general submission period which is planned to start in early Q2 2026, as well as Unicode 18.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Friday, December 19, 2025

Opening CLDR Survey Tool early for DDL locales

[image]

We are announcing an early submission window for the CLDR Survey Tool, exclusively for Digitally Disadvantaged Languages (DDLs). These include languages across the world that lack full digital support, such as Qสผeqchiสผ with about 1.3M speakers, and many more.


The early submission window will allow more time for individuals and organizations that make DDL contributions, providing crucial data to close the digital support gap. The data will go into the CLDR v50 release, targeted at October 2026. Languages maintained by the CLDR Technical Committee are not available during this special window. They will be available for submission in Q2 2026.


See DDL: Help Center for more information on how to contribute to a DDL language.


If your language is not yet in CLDR, organizations can submit a formal request to add it; see adding a new language.


CLDR Organizations are needed for approval of CLDR data, so that it can be picked up by libraries, applications, programming languages, and operating systems. To register a new CLDR Organization, see adding an organization to CLDR. Individuals can also request languages and submit/approve data; however, the data cannot reach even Basic coverage without at least one CLDR Organization supporting it.



What is CLDR?


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). All major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Contributors supply data for their languages via the online Survey Tool. This data is widely used to support much of the world’s software and is also a factor in determining which languages are supported on mobile phones and computer operating systems.


The Survey Tool opened on December 18, 2025 for DDL languages. The tool will remain open for data submission and correction until July 2026. A public alpha will make the draft data available in early August 2026. Data contributed at this time will be scheduled for publication and available for use in October 2026.


Each additional CLDR language starts with a small set of Core Data, such as a list of characters used in the language. Submitters of new languages commit to bringing the coverage up to a minimum of Basic coverage (very basic formats for dates, times, numbers, and endonyms). 


Once a language reaches Basic coverage, it will have the minimum support for use in language selection, such as on mobile devices. That is the first step; for broader support the Moderate level is typically required. 


If you would like to contribute missing data for your language, see Survey Tool Accounts.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Tuesday, December 2, 2025

UTC #185 Highlights

 Unicode Technical Committee meeting #185 was held October 27 – 29 in Cupertino, CA, hosted by Apple. Here are some highlights.

Starting the Unicode 18.0 cycle

As we've been following an annual September release cycle for the Unicode Standard, the Q4 UTC meeting is the first meeting during a new cycle. While some decisions targeting the release might have been taken at a previous meeting, this is the first meeting in which the next release has particular focus. One of the decisions taken is to plan out the key milestones and dates for the next new cycle. Here's a summary of the timeline for Unicode 18.0:

  • November 2025: UTC #185 approved new character repertoire

  • January 2026: UTC #186 will finalize content for the alpha release

  • February – March: alpha release open for public review

  • April: UTC #187 will review alpha feedback and finalize content for the beta release

  • May – June: beta release open for public review

  • July: UTC #188 will finalize 18.0 content

  • September: Unicode 18.0 release

Unicode 18.0 character and emoji repertoire

During a release cycle, the primary focus for the alpha review is on the new character repertoire. The repertoire for the alpha review can be updated at the January UTC meeting; but we like to have that planned repertoire largely determined by the Q4 meeting so that working groups can focus early on preparing content that will be needed for the alpha.

UTC #184 had approved around 60 characters for publication in Unicode 18.0. (Some of those had been planned for Unicode 17.0 but, for various reasons, needed to be postponed.) These included the UAE Dirham sign, and the first tranche of a large set of symbols from the writings of Gottfried Leibniz for which proposals are in development. At UTC #185, nearly 13,000 additional characters were approved for encoding in Unicode 18.0. 

The approved additions include encoding of Small Seal script ("Seal"), a repertoire of 11,328 ideographic characters. Seal is distinct from modern Han ideographs (aka, "CJK"), but is an important precursor of CJK resulting from the first efforts to standardize writing across Chinese-speaking regions during China's Qin Dynasty. As such, Seal has important cultural significance in China and for Chinese speakers throughout the world.

Other additions included 1,276 characters allocated in three new blocks: Archaic Cuneiform Numerals — 311 Cuneiform characters from the fourth millenium BCE; and Jurchen and Jurchen Radicals — 965 ideographic characters that were used for writing the Jurchen language in the12th – 13th century CE. 

In addition, 321 other characters were approved as additions to a number of existing blocks. This includes many characters for Arabic and Latin scripts, many characters used in phonetic transcription, a number of symbols used in music notation, and a second set of the Leibniz symbols.

Finally, the new characters approved for Unicode 18.0 includes nine new emoji characters. Note that many emoji are represented as character sequences, so mentioning the new emoji characters doesn't provide a complete picture. Look for more information about Unicode 18.0 emoji in the coming months.

CJK & Unihan

UTC works on CJK character encoding in collaboration with IRG (Ideographic Research Group), a working group under ISO/IEC JTC 1/SC 2. There are over 100,000 CJK ideographs now encoded in Unicode, and with such a large repertoire of characters there are refinements to the already-encoded characters that continue to be made. At UTC #185, recommendations arising from a recent IRG meeting were reviewed, and a number of changes were approved for Unicode 18.0. Some of these are technical details that are not so visible, such as corrections to source references for certain characters (the references cited when the characters were encoded providing evidence of their usage and identity as distinct characters). Among the significant and visible changes approved by UTC are over 700 horizontal extensions, which will be reflected in the Unicode 18.0 code charts with additional glyphs for already-encoded characters.

For complete details on outcomes from UTC #185, see the draft minutes.

About the Unicode Standard

The world relies on digital communications. The Unicode Standard is a vital building block for global digital communications, providing the encoding for more than 155,000 characters used by thousands of languages and scripts throughout the world. 

Each character—letter, diacritic, symbol, emoji, etc.—is represented by a unique numeric code, and has defined properties data that define how characters behave in several text processing algorithms. 

With this combination, The Unicode Standard provides the foundation for implementations to support the world's writing systems, enabling billions of people across the globe to seamlessly communicate with one another across platforms and devices. The Standard is also the foundation for the suite of code, libraries, data, and products that the Unicode Consortium delivers for robust language support.


----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock



Wednesday, November 5, 2025

Introducing the Unicode Inflection Library Technical Preview Release

The problem of linguistic inflection has long been a barrier to effective software internationalization. The problem is even more visible today with multimodal UIs. In many languages, word forms change (inflect) based on grammatical context, creating a significant challenge for developers aiming to build truly global applications. Getting the wrong word inflection can be as bad as using the wrong preposition in English.

Today, the Unicode Consortium is announcing a major step forward with the Technical Preview Release of the Unicode Inflection Library. It provides direct access through C and C++ APIs, or can be used in conjunction with Message Format 2.0 functionality.

This library is designed to solve a problem that is particularly acute in languages with a large number of inflectional forms, such as the Slavic, Germanic, Romance, Semitic, Indic and agglutinative families of languages.

The issue extends beyond common words like adjectives, nouns, and verbs. In many of these languages, proper nouns—including geo-location names, brands, and people’s names—can also inflect. This complexity affects a large number of users and has been largely unaddressed by the industry, which has typically opted for narrow, language-specific solutions. Even languages like French require handling inflection for gender and number, demonstrating the problem is not limited to a few specific language families.

The Unicode Inflection Library provides a robust and standardized approach to this challenge. It leverages extensive data sets to handle complex grammatical transformations, enabling more accurate text generation, search functionality, and natural language processing. A key resource for this project is the availability of comprehensive lexicons from the Wikidata project, which provide the foundational data necessary for these operations.

Get Started and Participate

This is a community effort. We invite developers and linguists to explore the library's capabilities and contribute to its development. A detailed tutorial is available to help you get started:

Tutorial: https://github.com/unicode-org/inflection/wiki/Tutorial

Release: https://github.com/unicode-org/inflection/releases/tag/Inflection-0.1

Your feedback and contributions are critical for refining the library's rules, expanding language coverage, and ensuring its performance. By participating, you will help build a foundational tool that will make the digital world more accessible and linguistically accurate for hundreds of millions of users.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock



Thursday, October 30, 2025

Unicode CLDR 48 available

Postal Horn emojiUnicode CLDR 48 is now available and has been integrated into version 78 of ICU


Some of the most significant changes in this release are the following (for more detail, see the CLDR 48 release note page):

  • Updated for Unicode 17, including new names and search terms for new emoji, new sort order, and Han→Latin romanization additions for many characters.

  • Updated to the latest external standards and data sources, such as the language subtag registry, UN M49 macro regions, ISO 4217 currencies, etc.

  • Many enhancements of the CLDR specification (LDML)

  • Many additions to language data including:

    • Likely Subtags, for deriving the likely script and region from the language (used in many processes)

  • New formatting options:

    • Rational number formats added, allowing for formats like “5½” in tech preview

    • For timezones, usesMetazone adds two new attributes stdOffset and dstOffset so that implementations can use either “main” or  “rearguard” TZDB data

    • Combination formats added for relative dates + times, such as “tomorrow at 12:30”

    • Additional units added for scientific contexts (coulombs, farads, teslas, etc.) and for English systems (fortnights, imperial pints, etc.)

  • Many corrections and updates for Metazone data and calendars eras (including removal of eras and fixes to start dates)

  • This is the first release where the new CLDR Organization process is in place for DDL languages. As a result, several locales were able to reach higher levels (see below).

See the CLDR 48 release note page for information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). All major browsers and modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Via the Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems. 

Locale Coverage Levels

Level
Count
With Script
Regional Variants
Usage
Modern
104
5
305
Suitable for full UI internationalization
Moderate
13
0
1
Suitable for “document content” internationalization, eg. in spreadsheet
Basic
57
10
22
Suitable for locale selection, eg. choice of language on mobile phone

Changes in coverage


±

New Level

Locales

๐Ÿ“ˆ

Modern

Akan, Bashkir, Chuvash, Kazakh (Arabic), Romansh, Shan, Quechua

๐Ÿ“ˆ

Moderate

Anii, Esperanto

๐Ÿ“ˆ

Basic

Buriat, Piedmontese, Sicilian, Tuvinian

๐Ÿ“‰

Basic*

Baluchi (Latin), Kurdish


----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
๐Ÿ•‰️๐Ÿ’—๐ŸŽ️๐Ÿจ๐Ÿ”ฅ๐Ÿš€็ˆฑ₿♜๐Ÿ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock