Thursday, October 2, 2025

Unicode CLDR 48 Beta available for specification review

The Unicode CLDR 48 Beta is now available for specification review and integration testing. The release is planned for October 29th, 2025, but any feedback on the specification needs to be submitted well in advance of that date. The beta specification is available at Draft LDML Modifications. See also the Migration section of the new release page.


CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)


Via the Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.


The beta has already been integrated into the development versions of ICU 78, and ICU4X . We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Requesting Changes.

The following are some of the most significant changes to the specification (LDML).

Locale Identifiers and Names

  • Display Name Elements - Described the usage of the language element menu values core and extension, and alt="menu". Also revamped the description of how to construct names for locale IDs, for clarity.

Misc.

  • Character Elements - Added new exemplar types.

  • Person Name Validation - Added guidance for validating person names.

DateTime formats

  • Element dateTimeFormat - Added a new type relative for relative date/times, such as “tomorrow at 10:00”, and updated the guidelines for using the different dateTimeFormat types.

  • timeZoneNames Elements Used for Fallback - Added the gmtUnknownFormat to indicate when the timezone is unknown.

  • Metazone Names - Added usesMetazone to specify which offset is considered standard time and which offset is considered daylight.

  • Time Zone Format Terminology - Added the Localized GMT format (and removed the Specific location format). This affects the behavior of the z timezone format symbol. There is also now a mechanism for finding the region code from a short timezone identifier, which is used for the non-location formats (generic or specific).

  • Calendar Data - Specified more precisely the meaning of the era attributes in supplemental data, and how to determine the transition point in time between eras.

Numbers

  • Plural rules syntax - Added substantial clarifications and new examples. The order of execution is also clearly specified.

  • Compact Number Formats - Specified the mechanism for formatting compact numbers more precisely.

  • Rational Numbers - Added support for formatting fractions like 5½.

Units of Measurement

  • Unit Syntax - Simplified the EBNF product_unit and added an additional well-formedness constraint for mixed units.

  • Unit Identifier Normalization - Modified the normalization process.

  • Mixed Units - Modified the guidance for handling precision.

MessageFormat

  • Syntax and data model errors - Prioritized over other errors.

  • Default Bidi Strategy - Required and default.

  • Function :offset - Made Stable. (It was previously draft, and named :math.)

  • Draft functions :datetime, :date, and :time  - Updated to build on top of semantic skeletons.

  • Draft function :percent - Added.

There are many more changes that are important to implementations, such as changes to certain identifier syntax and various algorithms. See the Modifications section of the specification for details.

For more details see the draft CLDR 48  release page, which has information on the changes to data and structure, accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Friday, September 19, 2025

Unicode CLDR 48 Alpha available for testing

The Unicode CLDR 48 Alpha is now available for integration testing. 

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The alpha has already been integrated into the development versions of ICU 78 and ICU4X. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed via CLDR Tickets.

Some of the most significant changes in this release are the following (for more detail, see the CLDR 48 release note page):

  • Updated for Unicode 17, including new names and search terms for new emoji, new sort order, Han→Latin romanization additions for many characters.
  • Updated to the latest external standards and data sources, such as the language subtag registry, UN M49 macro regions, ISO 4217 currencies, etc.
  • Many enhancements of the CLDR specification (LDML) are due for addition by Oct 1.
  • Many additions to language data including:
    • Likely Subtags, for deriving the likely script and region from the language (used in many processes)
    • Language populations in countries: significant updates to improve accuracy and maintainability
  • New formatting options
    • Rational number formats added, allowing for formats like “5½”
    • For timezones, usesMetazone adds two new attributes stdOffset and dstOffset so that implementations can use either “vanguard” or “rearguard” TZDB data sources
    • Combination formats added for relative dates + times, such as “tomorrow at 12:30”
    • Additional units added for scientific contexts (coulombs, farads, teslas, etc.) and for English systems (fortnights, imperial pints, etc.)
  • Many corrections and updates for Metazone data, for calendars (including removal of eras and fixes to start dates).
  • This is the first release where the new CLDR Organization process is in place for DDL languages. As a result, several locales were able to reach higher levels (see below).


Locale Coverage Levels

Level
Count
With Script
Regional Variants
Usage
Modern
104
5
305
Suitable for full UI internationalization
Moderate
13
0
1
Suitable for “document content” internationalization, eg. in spreadsheet
Basic
57
10
22
Suitable for locale selection, eg. choice of language on mobile phone


Changes in coverage 

±

New Level

Locales

πŸ“ˆ

Modern

Akan, Bashkir, Chuvash, Kazakh (Arabic), Romansh, Shan, Quechua

πŸ“ˆ

Moderate

Anii, Esperanto

πŸ“ˆ

Basic

Buriat, Piedmontese, Sicilian, Tuvinian

πŸ“‰

Basic*

Baluchi (Latin), Kurdish


For the details, see the CLDR 48 release note page, which has information on accessing the data, reviewing charts of the changes, and — importantly — will cover Migration issues.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock



Tuesday, September 9, 2025

Unicode 17.0 Release Announcement


Announcing The Unicode® Standard, Version 17.0

The Unicode Standard is the foundation for all global digital communications, providing the encoding for text content used in all devices. The latest version of the standard, Version 17.0, is now available! This is a major update that includes new characters and code charts, updated data files, an updated Core Specification, and updated annexes and synchronized standards that cover implementation details for important aspects of text processing.     

              

This version adds 4,803 new characters, including four new scripts, eight new emoji characters, as well as many other characters and symbols, bringing the  total of encoded characters to 159,801.

One of the newly encoded symbols is SAUDI RIYAL SIGN. Addition of this in Unicode 17.0 will allow interoperable support for the new symbol announced earlier this year by the Saudi Central Bank to represent their riyal currency.

The new additions also include 4,298 additional CJK unified ideographs in a new block, CJK Unified Ideographs Extension J, as well as 18 other CJK ideographs added to the existing Extension C and Extension E blocks. This increases the number of encoded CJK ideographs to over 100,000! Also, nearly 2,500 already-encoded CJK ideographs are horizontally extended by the addition of source references and glyphs reflecting use of those ideographs in China and Korea.

The following four new scripts increase the total number of supported scripts in the Unicode Standard to 172:
  • Beria Erfe is a modern-use script used by Zaghawa communities in central Africa.
  • Tolong Siki is a modern-use script used by Kurukh communities in northeast India.
  • Tai Yo is the traditional script of Tai Yo communities in northern Vietnam.
  • Sidetic is an historic script used in ancient Anatolia.
Support for these in Unicode is the key initial step in bridging the digital divide for users of these scripts. 

See the delta code charts for details on all the new scripts and characters. For additional details regarding new emoji, see Emoji Recently Added, v17.0. For complete details on Unicode Version 17.0, see  https://www.unicode.org/versions/Unicode17.0.0/

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


Wednesday, July 30, 2025

Highlights from UTC Meeting #184

The Unicode Technical Committee (UTC) meeting #184 was held last week, July 22 – 24, in Redmond, Washington, hosted by Microsoft. Here are some highlights.

Finalizing Unicode 17.0

The top priority was to finalize technical decisions for Unicode 17.0 in preparation for a September 9 release. Beta feedback and a small number of new proposals were considered, and various decisions affecting Unicode 17.0 were taken. 

The most significant change from the Unicode 17.0 Beta is the removal of 44 characters, based on feedback requesting more time to review these characters and the associated proposals:

  • 09FF BENGALI LETTER SANSKRIT BA
  • 0B53 ORIYA SIGN DOT ABOVE
  • 0B54 ORIYA SIGN DOUBLE DOT ABOVE
  • 1FADD APPLE CORE
  • 40 Chisoi script characters and the Chisoi block at 16D80..16DAF

These characters have been postponed to Unicode 18.0. With this change, the total number of new characters for Unicode 17.0 will be 4,803, including CJK Extension J and four new scripts.

Glyph changes were also approved for 21 characters, all of which were encoded in earlier versions.

Certain character property changes were also approved. These include a change to the Word_Break property for 00B8 CEDILLA to accommodate orthographic usage for SENĆOŦEN, an indigenous language spoken in Western Canada. In relation to identifiers and security, the seven scripts added in Unicode 16.0 (Garay, Gurung Khema, Kirat Rai, Ol Onal, Sunuwar, Todhri, and Tulu-Tigalari) will be classified in UAX #31 as Excluded Scripts (Table 4), which means that these will not be included in the General Security Profile for secure identifiers.

First characters approved for Unicode 18.0

The tentative plan for new characters to be added in the next Unicode version is usually decided at the fall UTC meeting. The first approvals for Unicode 18.0, however, were decided last week at UTC #184. These include the 44 characters postponed from Unicode 17.0, mentioned above, as well as u+20CE UAE DIRHAM SIGN and 16 geometric symbols used in the manuscripts of the 17th-century polymath Gottfried Wilhelm Leibniz.

As typically happens at each UTC meeting, several code points were provisionally assigned for other new characters that will be candidates for future versions. 

For characters approved for 18 or provisionally assigned for future versions, see https://www.unicode.org/alloc/Pipeline.html#future.

Text Terminal Working Group progress

A temporary working group was created at UTC #175 to work on improved support for Unicode text in text-only terminal environments, particularly for scripts requiring advanced layout. Due to changes in availability of key participants early on, progress was hindered, but the working group is now meeting regularly. 

To scope the project, they will prioritize scripts classified in UAX #31 as Recommended. These include a number of scripts for which examples of fixed-width text have not been readily available, and the working group would welcome contributions from anyone with knowledge of prior art for fixed-width Indic text.

For complete details from UTC #184, see the draft minutes

About the Unicode Standard

The world relies on digital communications. The Unicode Standard is one of the building blocks for global digital communications, providing the encoding for more than 155,000 characters used by thousands of languages and scripts throughout the world. 


Each character—letter, diacritic, symbol, emoji, etc.—is represented by a unique numeric code, and has defined properties data that define how characters behave in several text processing algorithms. 


With this combination, The Unicode Standard provides the foundation for implementations to support the world's writing systems, enabling billions of people across the globe to seamlessly communicate with one another across platforms and devices. The Standard is also the foundation for the suite of code, libraries, data, and products that the Unicode Consortium delivers for robust language support.

----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
πŸ•‰️πŸ’—πŸŽ️🐨πŸ”₯πŸš€ηˆ±₿♜πŸ€

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock