The Unicode Blog: 2023

Wednesday, November 15, 2023

Looking to give back differently for this #GivingTuesday?

Adopt a Character or Emoji to give it the attention it deserves!

Now you can adopt a character and show off your hobby or business, favorite sport, or love. For that special someone who seems to have everything, you can also give a unique gift.

Allergies? 🤧 Traveling? ✈️ No worries, the cat emoji 😺 has no fur and requires no feeding! The dog emoji 🐶? No need to go out for a 3 am walk! Looking to be a Scrabble champ? The strong and fast letter Z is right for you!

Your good friend is studying to be a doctor. How about the stethoscope emoji as a gift? 🩺Or even an emoji to support your favorite college football team this season! 🏈

With nearly 150,000 characters there's something for everyone! The possibilities are endless! It's also a tax-deductible donation in the United States, to the extent allowed by law. Your company may also provide matching funds.

☯🏏 🏈 ⚽ 🔥🎁💍爱戀🥳 🙌 🎂💗💟₨ ₪ € ₭ ₱🥰 😍♕Ωπ

About Adopt-a-Character

The Adopt-a-Character program was launched in 2015 to support Unicode's mission to ensure everyone can communicate in their own languages. Adopt-a-Character funds have supported work on historic scripts, including Old Uyghur, Old Sogdian, Sogdian, Seal Script (China), and Mayan Hieroglyphs, and Egyptian Hieroglyphs. Additional support has been provided to encode the modern scripts Hanifi Rohingya, Tolong Siki, and Sunuwar, among others.

Characters can be adopted at three levels:

Gold - $5,000

For any particular character there can only be one Gold adoption! Be the only!

Silver - $1,000
For any particular character there can only be five Silver adoptions! Be one of the five to adopt your favorite characters as a Silver adopter!

Bronze - $100
For any character, there are an unlimited number of Bronze-level adoptions! Also a wonderful option!

Each adoption is recognized with a digital badge that you (or your recipient!) can proudly share via your social channels and via websites. Adoptions also come with a digital certificate that you can print to display or email to your giftee!

About the Unicode Consortium

The Unicode Consortium is the premier 501(c)3 non-profit, open source, open standards body for the Internationalization of software and services. It is arguably the most widely deployed software in the world available across 20 billion devices and counting! At its core, Unicode enables people around the world to communicate in any language.

And - if you want to simply make a donation to support Unicode’s work, you can do that, too!

This Giving Tuesday, let's come together to continue to celebrate and preserve linguistic diversity. Adopt a character and make a difference!

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Monday, November 13, 2023

UTC #177 Highlights

by Peter Constable, UTC Chair

Unicode Technical Committee (UTC) meeting #177 was held November 1 to 3 in Cupertino, California, hosted by Apple. Here are some highlights from the meeting.

Starting the Unicode 16.0 cycle

UTC approved a plan and timeline for the Unicode 16.0 release. Here’s a summary of the timeline:

January 2024: UTC #178 will finalize content for the alpha release
February – March: alpha release for public review
April: UTC #179 will finalize content for the beta release
May – June: beta release for public review
July: UTC #180 will finalize 16.0 content
September: Unicode 16.0 release

UTC is still adjusting to changes in how work for each release is managed. So, while this will be a “full” release, UTC will be conservative about taking on too many changes, particularly to algorithm specifications (UAXes, UTSes). Also, a new format for the core text will be used in this release: instead of PDF, it will be published using Web technologies (HTML, etc.) To get early validation on format changes, the alpha release will include a sampling of content from the core text.

Unicode 16.0 character and emoji repertoire

UTC had previously approved 1,179 characters for encoding in Unicode 16.0. At this UTC meeting, 15 additional characters were approved for version 16.0, including seven emoji characters. UTC has been planning to include nearly 4,000 additional Egyptian Hieroglyphs in Unicode 16.0. The proposal was discussed, and a small revision was requested. It’s expected these will be approved for Unicode 16.0 at the next UTC meeting. Apart from the additional hieroglyphs, we expect no further characters will be added to the Unicode 16.0 repertoire.

Beside characters approved for Unicode 16.0, code points were provisionally assigned for 184 new characters that are candidates for encoding in a future Unicode version.

See the Pipeline page for all characters currently approved for Unicode 16.0, along with code points provisionally assigned for future encoding.

Future of UAX #42, UCD in XML

UAX #42, Unicode Character Database in XML (UCDXML), was originally developed by Eric Muller. He and Laurentiu Iancu maintained UCDXML through many versions, and we’re very grateful for this contribution. Eric and Laurentiu are no longer available to maintain this, however, and no others have volunteered to take over maintenance. After discussion over several months in UTC and in the Properties and Algorithms working group, UTC has concluded the best option for the future of UAX #42 is to stabilize it, with data frozen at Unicode 15.1. A Public Review Issue will be posted to get feedback on this plan.

Future maintenance of UCS repertoire

UTC discussed a proposal for ISO/IEC JTC 1/SC 2 to adopt different process for future maintenance of the repertoire of ISO/IEC 10646 using a maintenance agency rather than the process that is used for developing entirely new standards, as done in the past. It was felt that this would be more agile and would align better to how expert input has guided actual encoding decisions for several years now. This proposal will be formally submitted to JTC 1/SC 2 as a proposal from the US national standards body.

Full details on these and other outcomes are provided in the minutes—see L2/23-231.

Wednesday, November 1, 2023

What do a leafless tree, a fingerprint, and a harp have in common?

This is not a set up to a riddle. This is Emoji 16.0.

By Jennifer Daniel, Chair of the ESC

This week, the Unicode Technical Committee gathered for our last meeting of 2023 to discuss the encoding, data files, and list of characters related to digitizing the world’s languages. Amongst the topics discussed were emoji and as a result seven new characters are on their way for inclusion into the Unicode Standard, into your keyboards, and into your hearts ;-)

The final recommendations culminated in seven emoji: one emoji per major category.

An incredibly powerful aspect of written language is that it consists of a finite number of characters that can “do it all”. And yet, as the emoji ecosystem has matured over time our keyboards have ballooned and emoji categories are about to hit — or have hit — a level of saturation. Upon reflecting on how emoji are used, the Unicode Emoji Subcommittee (ESC) has entered a new era where the primary way for emoji to move forward is not merely to add more of them to the Unicode Standard, but to consider how the ones added provide the most linguistic flexibility. As a result, the ESC approves fewer and fewer emoji proposals every year.

The few that are added this year have demonstrated their adaptability in different contexts — take for example, fingerprint. It is commonly used to represent multiple concepts. Fingerprints are a symbol of identity (unique as you), security (as a passkey), and forensics (what crime show logo is complete without a fingerprint?). While we think of fingerprints as a relatively modern phenomenon according to Forensics Digest, the earliest use of fingerprints dates back to 1000 B.C.

In fact all of this year’s emoji candidates have deep roots in history. Harps have been known since antiquity in Asia, Africa, and Europe, dating back at least as early as 3000 BCE. Today it has political, sporting, corporate, and religious symbolism 👼 Leafless trees have been around as long as ... well, trees (and poetry!) I suppose. Leafless trees literally represent droughts or winter and metaphorically indicate a state of barrenness and death.

Shovel isn’t just another noun — sure, yes, it’s a tool commonly found in your shed — in our keyboards, however, it’s also a verb. Digging yourself out of a hole, digging yourself into a hole, shoveling 💩, it does it all. But wait, there’s more. Splatter is one of those stealth emoji that when you look at you might be thinking, “really, another sex emoji?” (To be honest, show me someone who doesn’t think an emoji is a sex emoji and I’ll show you someone who lacks imagination). Splatter is a spill. Splatter is expressive. Splatter is soft — a perfect counterpoint to collision 💥 — the bouba to 💥’s kiki.

When can you get these new emoji?

A simple question that deserves a simple answer. Alas, you’re dealing with Unicode so the answer is complex. Did you know it can take up to two years to encode an emoji? It’s true. If we want the symbols we digitize to truly “just work” across the entirety of not just the Internet but all digital surfaces … it takes time. So, don’t expect to see these characters anytime soon. In fact, despite the previous batch of emoji (phoenix, lime, broken chain, etc.) getting approved last year they still haven’t landed on your device of choice yet but are well on their way to pop up in the first half of 2024.

Emoji 16.0 has a long road ahead and will appear on most devices in May-June 2025.

Tuesday, October 31, 2023

ICU 74 Released

Unicode® ICU 74 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 74 updates to Unicode 15.1, and to CLDR 44 locale data with various additions and corrections.

ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements. They subsume the changes for the ICU 73.2 and CLDR 43.1 maintenance releases.

Unicode 15.1 adds source code security mechanisms, improves line breaking for southeast Asian scripts, and adds important CJK unified ideographs.

CLDR 44 has added or improved data for a number of languages that have been newly added to ICU, and has improved measurement unit handling, conversion, and formatting.

ICU 74 implements these improvements, adds new C APIs for locale handling, adds a plug-in API for word segmentation, and switches the Java build system to Maven.

For details, please see https://icu.unicode.org/download/74.

Unicode CLDR v44 available

Unicode CLDR version 44 is now available and has been integrated into version 74 of ICU. In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

There are many other changes: to find out more, see the CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level	Langs	Usage
Modern	95	Suitable for full UI internationalization
	čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate	13	Suitable for “document content” internationalization, eg. in spreadsheet
	brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic	50	Suitable for locale selection, eg. choice of language on mobile phone
	asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

We are currently planning for CLDR version 45 to be a closed release with no submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.

Monday, October 30, 2023

Update from Unicode’s Community Engagement Team

By Elango Cheran, Vice Chair of the Community Engagement Team

If you have been receiving the Unicode Newsletter or following us on LinkedIn, Twitter, and the Friends of Unicode on Facebook, you have likely seen information about new Unicode events and resources.

These events and tools are the result of the coordinated efforts of Unicode’s Community Engagement (CE) Team. This team was formed in March of 2022 after our logistics partner for the Internationalization and Unicode Conference and Unicode decided to go in different directions.

A small group of volunteers and Unicode staff saw this as an opportunity to explore different types of events and see how we could reach a more global audience. Since July of last year, we have held seven online events. We conscientiously maintained a focus on the online medium for these initial events to ensure broader reach and access to knowledge, which is different from how Unicode has approached events in the past. From the recent online events, we have drawn in hundreds of new people from 65+ countries. And for each of those, we have made available and promoted the recordings on the Unicode YouTube channel.

And all that is to say, there is more to do!

As a small non-profit that depends on volunteers, we started modestly and have been pushing our boundaries, experimenting with our tools, and expanding our capabilities with each event.

The upcoming Unicode Technology Workshop is a natural extension of that experimentation. While this is in California as an in-person event, we hope that we can take lessons learned and apply this model to additional in-person events in other parts of the globe.

I am personally thankful for the opportunity to help Unicode connect with a more global audience, given how foundational and impactful Unicode’s work is on people, languages, and communities around the world. The work of our team is made possible by my committed colleagues, some of whom are from organizations such as Google, UC Berkeley, and Spotify.

It is encouraging to see the growing interest in events, as well as interest in people partnering with Unicode to launch such efforts. If you have ideas on types of outreach programming or educational tools that would help you or others on your internationalization journey, please reach out to us via events@unicode.org.

Friday, October 13, 2023

Unicode Technology Workshop on November 7-8 – Update on Sessions!

By the Unicode Technology Workshop Steering Committee

The Unicode Technology Workshop (UTW) is the internationalization event you want to attend this year.

Hear from internationalization experts from Adobe, Google, Meta, Square, UC Berkeley, and many more. Sessions include workshops, seminars, free-form discussions, and lightning talks centered around i18n libraries, locale data updates, globalization tooling, localization pipelines, input methods, and text rendering. Day 2 includes unconference sessions driven by attendees.

Topics for the workshops and seminars include:

Introduction to Unicode and Beyond
Intro to ICU4X Workshop
PersonNames in the Real World
Unicode Guide to Internationalization
Internationalizing the Internet’s Domain Name System
The First Steps to Go Global
Fixing Input Methods for Abugida Scripts
Automatic Grammar Agreement in Message Formatting
Prove it! Data Driven Conformance Testing
Unicode Source Code Handling
Script Encoding Initiative: Past and Future
Character to Glyph: how Unicode® Text Makes it to Your Screen
Critical Values for I18n Testing
ADLaM, the Power of a Script: Evolution, Community Impact & Challenges Post-Unicode
{ }: MessageFormat v2
What's New in CLDR and ICU
Unicode Properties & Algorithms
🔥😮‍💨🍄🪦💀🐷🐙😤
Lessons Launching Dozens of New Languages in a UI
Locale Aware Units and Units Inflection
“Ask Unicode Anything” with Mark Davis, Unicode’s Cofounder and CTO

Attendees will be encouraged at the event to bring up topics for unconference sessions and lightning talks.

Network with developers and users to help shape the future of Unicode technology. Expect two days of community building around the Unicode technology that makes software work for billions of people across all devices.

When and where: November 7-8, 2023. Bay Area (Hosted at Google). In-Person only!

Register Now at https://www.unicode.org/events/event-registration.html.

Friday, October 6, 2023

ICU4X 1.3: Now With Built-In Data, Case Mapping, Additional Calendar Systems, And More

By Robert Bastian, ICU4X Technical Committee

Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with their users in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

Since our last release in April 2023, the ICU4X team has been busy building additional features and improving the usability of the library. Today we're happy to announce the 1.3 release, including built-in data, a new datagen API, the first stable release of the case mapping component, support for more calendar systems, a technology preview of rule-based transliteration, and more.

We have heard feedback that ICU4X's data pipeline, while allowing powerful customization, has a significant learning curve. In ICU4X 1.3 we are therefore introducing a new feature called "compiled data", where we ship data generated from the latest CLDR and ICU versions in the library. This means that every ICU4X type gains a new constructor that does not take a data provider argument, but instead uses the compiled data. This data is using our existing "baked data" format, which, just being Rust code, allows the compiler to perform optimizations and granularly exclude unnecessary data. In fact, programs that are not using any of the new constructors will not see a binary size difference even with the compiled_data Cargo feature enabled (it is enabled by default).

In addition to adding compiled data, we have also revamped our data generation API icu_datagen. The new API is more ergonomic, allows for more flexible data generation, such as choosing which segmentation models to include, and also better optimizes the size of the generated data. For example, with the new "fallback mode" flag, data can be generated under the assumption that locale fallback is going to be used at runtime. This way, data for e.g. en-CA does not have to be included if it matches the data for en, because at runtime en will be tried if en-CA doesn't exist. This mode of data duplication is already used for compiled data, which comes with built-in fallback.

ICU4X 1.3 also stabilizes a new component: casemapping. Many scripts are bicameral, meaning they have an upper and lower case. Casemapping allows for converting between upper, lower, and title case, and the related casefolding operation allows for performing case-insensitive string matching. These operations can be rather nuanced and locale-dependent: for example, the letter “i” capitalizes to “İ” in Turkish, and modern Greek removes accents and adds diæreses when uppercasing.

This release also completes the set of calendars to include all CLDR calendars. In addition to the Gregorian, Thai Solar Buddhist, Coptic, Ethiopian, Indian National (Śaka), and Japanese calendars that have been supported since 1.0, ICU4X now also supports the Chinese, Korean (Dangi), Hebrew, Persian (Solar Hijri), R.O.C., and four variants of the Islamic calendar (civil, observational, tabular, and Umm al-Qura). This support includes formatting, though formatting for Chinese and Korean is currently in a preview state.

We're also launching a transliteration API as a technical preview. Transliteration is the conversion between scripts, such as from Arabic to Latin, preserving pronunciation as far as possible. CLDR supports many transliterations, and this release brings these CLDR transliterations to ICU4X. While data generation is not yet available, users can runtime-construct transliterators to convert between any scripts supported by CLDR.

Finally, ICU4X 1.3 brings a number of smaller features to other components. The experimental display names component now supports formatting language identifiers, in addition to language, script, and region display names; there are performance improvements across the board; and some APIs such as LocaleFallbacker have been moved to better locations.

Read the full ICU4X 1.3 release notes and then the ICU4X tutorial to start using ICU4X in your project.

Thursday, October 5, 2023

Unicode CLDR v44 Beta available for specification review

The Unicode CLDR v44 Beta is now available for specification review and integration testing. The release is planned for November 1st, but any feedback on the specification needs to be submitted well in advance of that date. The specification is available at Draft LDML Modifications. The biggest change is the new Person Names Formatting section.

The beta has already been integrated into the development version of ICU. We would especially appreciate feedback from ICU users and non-ICU consumers of CLDR data, and on Migration issues.

Feedback can be filed at CLDR Tickets.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

There are many other changes: to find out more, see the draft CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level	Langs	Usage
Modern	95	Suitable for full UI internationalization
	čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate	13	Suitable for “document content” internationalization, eg. in spreadsheet
	brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic	50	Suitable for locale selection, eg. choice of language on mobile phone
	asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

Friday, September 22, 2023

Unicode Version 15.1 – Tips for Implementers

The Unicode Version 15.1 release includes the UCD (Unicode Character Database), Code Charts, and Annexes, but the Core Specification is unchanged from Unicode Version 15.0. In addition to new characters, a small number of errata were fixed, along with improved representative glyphs.

Implementers should also take careful note of important changes that were made to the following UAXes:

For UAX #9 (Unicode Bidirectional Algorithm), the text for BD16, the interaction of control flow between W4 through W6, the use of sos, and the treatment of AN/EN with brackets in N0 were clarified, and a reference to UTS #55 was added.
For UAX #14 (Unicode Line Breaking Algorithm), line breaking at orthographic syllable boundaries was added, the handling of French-style quotation marks was improved, and allowed tailorings were more clearly characterized.
For UAX #29 (Unicode Text Segmentation), explicit conformance rules were added, support for ConjunctLinker clusters was added, the definition of “crlf” was updated, and multiple changes were made to the table of Word_Break Property Values.
For UAX #31 (Unicode Identifiers and Syntax), multiple changes were made to Section 2, Section 4 was completely rewritten, Section 7 was added, limited contexts for joining controls was moved to UTS #39, and a reference to UTS #55 was added.
For UAX #38 (Unicode Han Database), 6 new provisional properties were added, 7 provisional properties were removed, the syntax of several properties was updated, and the description of several properties was improved.
For UAX #45 (U-Source Ideographs), records for 39 new ideographs were added to its data file, Section 3 was added, “ExtI” was added as a new status, two obsolete status values were removed, and four status values were improved.

🌻🌻🌻🌻🌻 SUPPORT UNICODE 🌻🌻🌻🌻🌻

Finally, if you are already a contributor — or member of Unicode (or your company or organization is), thank you, Danke, Děkuju, धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we accomplish is only possible because of supporters like you.

To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider
adopting a character, making a gift of stock, or making a donation.

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction.

Please consult with a tax advisor for details.

Make your adoption today!

Thursday, September 14, 2023

Unicode CLDR v44 Alpha available for testing

The Unicode CLDR v44 Alpha is now available for integration testing.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:

Sep 27 — Beta (data)
Oct 04 — Beta2 (spec)
Nov 01 — Release

In CLDR 44, the focus is on:

Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang

v44 Level	Langs	Usage
Modern	95	Suitable for full UI internationalization
	čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate	13	Suitable for “document content” internationalization, eg. in spreadsheet
	brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic	50	Suitable for locale selection, eg. choice of language on mobile phone
	asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

Wednesday, September 13, 2023

Source Code Handling: Preventing Spoofing at the Source

By: Mark Davis, Cofounder and CTO

The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.

Background

Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “pаypal.com”, where the first ‘а’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 😵‍💫

In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.

In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.

Examples

Line-break spoofs can cause what appears to be a line of code to be actually commented out, as far as the compiler is concerned. This can happen with C11, for example:

To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!
The “pаypal.com” above is an example of a confusable spoof.
As for a bidirectional spoof, take pair of variables named Aא1 and A1א; these look identical, but the former consists of the letters A and א followed by the digit 1, whereas the latter consists of the letter A, the digit 1, and the letter א, in that order.

Such code might not even be malicious — it is too easy to accidentally give reviewers (or even the writer!) the wrong impression, leading to hidden software bugs — and just be very hard to understand; here’s an example:

The text “Error: {0} {1}", message” becomes RTL in translation.

The text “Error: {0} {1}", message” becomes RTL in translation.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.

Process

In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.

The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.

Impact

The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as Aא1 above.

Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.

This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.

Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!

Tuesday, September 12, 2023

Announcing The Unicode® Standard, Version 15.1

Version 15.1 of the Unicode Standard is now available. This minor version update includes updated code charts, data files and annexes. The core specification is unchanged from Unicode Version 15.0.

This version adds 627 characters, bringing the total number of characters to 149,813. The additions include 622 CJK unified ideographs in a new block, CJK Unified Ideographs Extension I. These new ideographs are urgently needed in China for use in public service databases, and are expected to be included in a forthcoming amendment to China’s GB 18030-2022 standard. The other new characters are five ideographic description characters that enhance the ability to describe rare or not-yet-encoded CJK ideographs.

There are six completely new emoji, such as for phoenix and lime and (finally) an edible mushroom. For 108 people emoji, you can now switch the direction that they are facing (for example, person walking facing right versus facing left).

Security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms. These updates complement the release of a new Unicode Technical Standard, UTS #55, Unicode Source Code Handling.

The new characters are limited to three blocks, and the code charts for several other blocks have changed. The most significant change to charts is for the CJK Unified Ideographs, CJK Unified Ideographs Extension A and CJK Unified Ideographs Extension B blocks with the addition of representative glyphs and source references for over 24,000 KP-source (North Korea) ideographs. There are also many other glyph corrections and improvements—see the 15.1 delta code charts for details.

Significant updates have been made to UAX #14, Unicode Line Breaking Algorithm and UAX #29, Unicode Text Segmentation adding better support for scripts of South and Southeast Asia, including grapheme cluster support for aksaras and consonant conjuncts, and line breaking at orthographic syllable boundaries.

For complete details on Unicode Version 15.1, see https://www.unicode.org/versions/Unicode15.1.0/.

Friday, September 1, 2023

NEW Virtual Event - Open House on Script and Character Encoding

Registration is Now Open!

The Unicode Standard aims to make the scripts used to write the languages of the world accessible on computers and devices. However, the process of getting characters and scripts into the Unicode Standard has often been puzzling. How does one successfully propose a script or a handful of characters? How are decisions made?

Join us for a virtual Open House event, where you will be able to ask these (and other) script and character encoding questions to seasoned Unicode experts.

When: Tuesday, Oct 17, 2023 at 11am-12pm Pacific Time (California)

Supporting Resources

Documenting and Preserving Languages: A Talk on Character Encoding, Keyboards, and Fonts by Deborah Anderson and Andrew Glass
Scripts and Character Encoding by Deborah Anderson, Script Ad Hoc Group Chair
Other Script and Character Encoding-related talks on the Unicode YouTube Channel

Volunteer Spotlight!

Lorna Evans, SIL International

Lorna first became involved with Unicode in 2000 as a conference participant. Her enthusiasm led her to volunteer as a lecturer at a Unicode conference, and for the past several years as an active proposal contributor and committee member. Lorna’s heart and passion are to assist digitally disadvantaged communities by bringing their language fonts, characters, and sounds to the Unicode standard.

Lorna began her language journey typesetting Bibles in Ethiopia in 1990 and was fascinated by the multiple fonts and characters needing representation. When she heard that Unicode was moving to support Ethiopic characters, she had to get involved.

When asked what she is most proud of, Lorna said, “Anytime I do a proposal to Unicode, it feels like the most important thing (for that language community).” She thrives on research and feels with each proposal, she is bringing digital access to people who need it most. Lorna is completely self-taught and currently focused on documenting Arabic script. She describes SIL International, an Associate member of the Unicode Consortium and her current employer, as the former kings of creating custom encoded fonts and she is working diligently to help SIL transition to Unicode.

As for her time involved with other Unicode staff and volunteers, she has enjoyed the camaraderie and attending technical committee and editorial meetings. Lorna is an active member of the Script Ad Hoc Subcommittee, as well as the primary representative to Unicode for SIL International.

Lorna shared that she grew up in Bolivia and says that salteñas, a savory pastry filled with beef stew, is still her favorite food.

Editor’s Note: We appreciate and thank Lorna for taking time to tell us a little about herself as well as her years of contributions

🌻🌻🌻🌻🌻 SUPPORT UNICODE 🌻🌻🌻🌻🌻

Finally, if you are already a contributor — or member of Unicode (or your company or organization is), thank you, Danke, děkujeme, धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we accomplish is only possible because of supporters like you.

To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider
adopting a character, making a gift of stock, or making a donation.

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction.

Please consult with a tax advisor for details.

Make your adoption today!

Tuesday, August 15, 2023

Unicode Consortium Board Votes to Elevate ICU4X to Technical Committee

Across the globe, people are using alternative ways to get online, such as smartphones, smart watches, and other compact devices. Formed as a Subcommittee of the ICU Technical Committee in 2020, ICU4X is a modular, lightweight, and secure library that brings internationalization to client-side and resource-constrained environments, written in Rust with bindings into many programming languages.

“We are all very excited about the ICU4X project. It dramatically expands the number of apps and systems that can easily deploy internationalization: with a smaller, modular footprint and the advantages of security and performance from Rust.” — Mark Davis, Co-founder and Board Chair

At its July meeting, the Board agreed that the ICU4X Subcommittee needed to have the authority to make technical decisions relating to the ICU4X architecture, structure, and coding, and voted that the ICU4X Subcommittee be elevated to the level of Technical Committee with the authority to make such decisions, effective immediately.

The ICU4X Technical Committee will be responsible for the design and implementation of the library, with the goal to ensure that mobile devices and other low-resource devices can access scalable internationalization services. It is particularly applicable for devices that cannot run full ICU (C/C++ and Java) — and for those in emerging markets with more limited resources and digitally disadvantaged languages.

Chair Shane Carr (Google) and Vice Chairs Zibi Braniecki (Amazon) and Nebojša Ćirić (Google) are the ICU4X Technical Committee Chair and Vice Chairs.

Congratulations to the ICU4X team!

Tuesday, August 8, 2023

Unicode Technology Workshop — Call for Submissions and Registration Open!

Save the Date! November 7-8, 2023. Bay Area (Hosted at Google)

About the Workshop

Join us in person for two days of community building around the Unicode technology that makes software work for billions of people. Expect two days of workshops, seminars, free-form discussions, and lightning talks centered around i18n libraries, locale data frameworks, globalization tooling, localization pipelines, input methods, and text rendering. Network with the developers and users to help shape the future of Unicode technology.

This is a new type of event for Unicode, with a focus on building more connections within the internationalization community. Expect to come away with deeper knowledge on how to solve tough problems in the i18n and l10n space and how to engineer products that work better for global users. GILT professionals, especially those who build or use Unicode technologies, are encouraged to attend and to host sessions. To encourage maximum collaboration amongst the attendees, this is an in-person-only event.

Call for Submissions

For those interested in participating in and contributing to the event, the call for submissions is now open. If you work on Unicode internationalization technologies or use Unicode internationalization technologies in your work, we want to hear from you. You can register your interest in contributing using the following link.

Call for Submissions

About the Unicode Consortium

The Unicode Consortium is the premier non-profit open source, open standards body for the internationalization of all software and services.

For more than 30 years, the Unicode Consortium has coordinated the efforts of a world-wide team of volunteer programmers and linguists to standardize, evolve, and maintain a global software foundation that allows virtually every computer system and service to help people connect using their native language.

For additional information about Unicode, visit home.unicode.org.

Wednesday, November 15, 2023

Monday, November 13, 2023

Wednesday, November 1, 2023

When can you get these new emoji?

Tuesday, October 31, 2023

Monday, October 30, 2023

Friday, October 13, 2023

Friday, October 6, 2023

Thursday, October 5, 2023

Friday, September 22, 2023

🌻🌻🌻🌻🌻 SUPPORT UNICODE 🌻🌻🌻🌻🌻

Thursday, September 14, 2023

Wednesday, September 13, 2023

Background

Examples

Process

Impact

Tuesday, September 12, 2023

Friday, September 1, 2023

Registration is Now Open!

Supporting Resources

Monday, August 21, 2023

Lorna Evans, SIL International

🌻🌻🌻🌻🌻 SUPPORT UNICODE 🌻🌻🌻🌻🌻

Tuesday, August 15, 2023

Tuesday, August 8, 2023

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog