Thursday, February 23, 2023

The Unicode CLDR v43 Alpha is now available for integration testing

[image] CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The Alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Data may change if release-blocking bugs are found. The planned schedule is:
  • 2023 Mar 15, Wed — public Beta (data)
  • 2023 Mar 29, Wed — public Beta2 (data & spec)
  • 2023 Apr 12, Wed — Release
CLDR 43 is a limited-submission release, focusing on just a few areas:
  1. Formatting Person Names
    • Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
  2. Adding substantially to the LikelySubtags data
    • This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance.
    • The data has been contributed by SIL.
  3. Other data updates
    • Alternate names for Turkey / Türkiye
    • Name for the new timezone Ciudad Juárez
  4. Structure
    • Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
    • Cleanup of the inheritance structure in CLDR
  5. Collation & Searching
    • Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.

To find out more about these and other changes, see the draft CLDR v43 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.


Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, February 7, 2023

Unicode 15.1 Alpha Review Opens for Feedback

[image] The repertoire for Unicode 15.1 is now open for early review and comment. As a reminder, during alpha review the repertoire is reasonably mature and stable, but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be entertained.

This early review is provided so that reviewers may consider the character repertoire and data file issues prior to the start of beta review (currently scheduled to start in May 2023). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.

Notable Changes

Unicode 15.1 adds exactly five characters, for a total of 149,191 characters. The five new characters are Ideographic Description Characters that are used in Ideographic Description Sequences, which represent a mechanism to visually describe the structure of ideographs.

In addition, the code charts for the CJK Unified Ideographs, CJK Unified Ideographs Extension A, and CJK Unified Ideographs Extension B blocks now include representative glyphs and source references for nearly 24,000 KP-source ideographs. Furthermore, the format of the code charts for the CJK Unified Ideographs block was updated to accommodate KP-source ideographs through the addition of a seventh column.

Version 15.1 does not add new emoji characters, however, 118 new RGI emoji ZWJ sequences will be defined.

Feedback for the alpha review should be reported under PRI #473 using the Unicode contact form by April 4, 2023.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Monday, February 6, 2023

Announcing New Unicode Adopt-a-Character Site

[image]
The Adopt-a-Character program was launched in 2015. Since that time, AAC funds have supported Unicode's mission to ensure everyone can communicate in their own language. This includes preserving historical scripts such as Egyptian hieroglyphics and providing better language support for digitally disadvantaged and under-resourced languages such as Hanifi Rohingya used in Myanmar and Bangladesh.

Now you can more easily adopt a character and show off your hobby or business, favorite sport, or love – while also supporting a good cause. You can also give the gift of a letter to someone in your life. The possibilities are endless – and each adoption helps Unicode’s goal to support the world’s languages.

All character adoptions are permanent. Adoption of a specific character at the limited gold and silver levels is on a first-come-first-served basis. All sponsors receive a digital badge and are recognized on Unicode’s website, Twitter feed, and Friends of Unicode Facebook page.

To start your adoption, visit our new page!

Unicode, Inc. is a non-profit, 501(c)3 organization and contributions may be eligible for a tax deduction. Please consult with a tax expert for details.



[badge]

Monday, January 23, 2023

New Unicode Consortium CEO

— Mark Davis, President & Unicode Cofounder


In January 1991 I became the first president of the Unicode Consortium, and in that position have presided over the board of directors since then. I’ve had the honor of occupying those roles for just over a gigasecond now, and it's time for a change.

Over time, it became apparent to me, the Consortium’s other officers, and the Board of Directors that our management model was no longer sufficient for what the organization had become over time, and what it needed to be in the future. So, we began to explore a new, more sustainable governance and management model. And an important part of that was succession planning

Among the first major steps in implementing this model was the hiring of Toral Cowieson as our first Executive Director and COO in 2021. Since then, Toral has helped professionalize the management of the Consortium. Working with the Board and the other Officers, Toral has also contributed to strengthening the Consortium’s governance.

The Board and I have also recognized that, as President, I have effectively occupied two distinct roles — CEO and CTO — and that these two different roles require the full attention of two different people. Accordingly, the Board has decided to split these two roles, formally creating the positions of CEO and CTO, while retiring the title of President.

And as its next step — I am delighted to announce — the Board has elected Toral Cowieson as CEO to replace me.

Toral has brought a wealth of experience in leadership across non-profits, corporations, and board service to Unicode. As executive director, she has connected with the people in the organization, provided thoughtful leadership, and instituted and guided changes in our operations and governance.

I’m not stepping off the stage completely. The Board has re-elected me as Chair of the Board, and elected me to the new position of CTO. I’ll also be continuing as chair of the CLDR technical committee as well as contributing to ICU and the UTC in focused areas.

The Unicode Consortium is the forum for companies, countries and other groups to work together on interoperable standards, code, and data — to support internationalizing software around the world. As a simple example, whenever you glance at the date on your cell phone, the text you see is Unicode characters, is formatted for your language according to CLDR language data (including for English), and uses ICU code libraries to make that all work.

As CTO, my main goal this year will be to work with the board, technical groups, and invited experts to continue maintaining and extending that foundation for so much of the world’s software, while formulating a strategy for meeting upcoming requirements and taking advantage of new technologies.

In addition, I am also pleased to announce some additional changes. I’ve worked extensively with each of these people, and have the fullest confidence that they will do great work in these new roles.

  • Peter Constable is a Technical Vice President and the Chair of the UTC. Since 2003, Peter has worked for Microsoft on various projects related to Unicode, internationalization, text display and fonts. He became a Unicode technical director in 2008 and later served as Treasurer.
  • Addison Phillips is the new Chair of the Message Formatting Working Group. Addison is also the chair of the W3C Internationalization Working Group and an active participant in the creation of internationalization standards such as Unicode. He and I are co-authors of IETF BCP 47, which is the standard for language and locale identifiers.
  • Elango Cheran is the Vice-Chair of the recently formed Community Engagement team and an internationalization engineer at Google. He actively contributes to the ICU and ICU4X projects, and to the MessageFormat Working Group.
Additional information available here:
Unicode Executive Officers
Unicode Fellows, Staff and Support
Unicode Technical Committee Chairs
Unicode Organization Chart

Photo by Michael Dziedzic on Unsplash


Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, January 17, 2023

What’s New in Emoji 15.1?

Doing more, with less

By: Jennifer Daniel, Chair of the Emoji Subcommittee

[image phoenix]

This past Fall, the Unicode Technical Committee announced the delay of Unicode 16.0. This wasn’t without precedent — COVID slowed down the release of Unicode 14.0 in 2020 and the world seemed to survive 😉. Subcommittees were well prepared and adjusted accordingly, discussing what this meant for their respective areas of expertise.

For the Emoji Subcommittee (ESC) — the group responsible for defining the rules, algorithms, and properties necessary to achieve interoperability between different platforms for those smiley faces that appear on your keyboard (Shout out 😁🥰🥹🤔🫣🫡😵‍💫!) — this delay presented an opportunity. Sure, we were so close to exhaling a sigh of relief (the intake period for Emoji 16.0 proposals had just completed). But upon learning we couldn’t ship any new codepoints until 2024 we turned our energy towards recommending new emoji based on existing ones. (These are called emoji ZWJ sequences. That's when a combination of multiple emoji display as a single emoji … like 👩 🏽 +🏭 = 🧑🏽‍🏭).

When Less is More

An incredibly powerful aspect of written language is that it consists of a finite number of characters that can "do it all". And yet, as the emoji ecosystem has matured over time our keyboards have ballooned and emoji categories are about to hit or have hit a level of saturation. Upon reflecting on how emoji are used, the ESC has entered a new era where the primary way for emoji to move forward is not merely to add more of them to the Unicode Standard. Instead, the ESC approves fewer and fewer emoji proposals every year.

But our work is not done. Not by a longshot. Language is fluid and doesn’t stand still. There is more to do! This “off-cycle” gives us a chance to address some long-standing major pain points using emoji. The first one that came to mind: skin-tone.

What is a family?

The encoding of multi-person multi-tone support has matured over the years; however, the implementation can seem random to the average person: While it’s true, all people emoji have toned options (with the exception of characters where you can’t see skin like 🤺) there are … misfits. Some two people emoji offer tone support ( 🧑🏻‍❤️‍🧑🏿) others do not ( 👯). A few non RGI emoji render with tone but with no affordance to change one of the two characters (For example, 🤼🏾‍♂).

And then … There is the suite of family emoji (👨‍👦👨‍👦‍👦👨‍👧👨‍👧‍👦👨‍👧‍👧👩‍👦👩‍👦‍👦👩‍👧👩‍👧‍👦👩‍👧‍👧 👨‍👨‍👦👨‍👨‍👦‍👦👨‍👨‍👧👨‍👨‍👧‍👦👨‍👨‍👧‍👧👩‍👩‍👦👩‍👩‍👦‍👦👩‍👩‍👧👩‍👩‍👧‍👦👩‍👩‍👧‍👧👨‍👩‍👦👨‍👩‍👦‍👦👨‍👩‍👧👨‍👩‍👧‍👦👨‍👩‍👧‍👧👪). These characters include two people, three people, sometimes four and none of them have any tone support (!). We seem to have a lot of family emoji and yet simultaneously not enough.

The 26 “family” emoji can be broken down into four groups:

[image families]

Despite the Unicode Standard containing 26 “family” emoji, each one of these glyphs is overly prescriptive with regard to delivering on a visual representation of a family. The inclusion of many permutations of families was well intentioned. But we can’t list them all, and by listing some of the combinations, it calls attention to the ones that are excluded.

What even is a family? For some, family is the people you were raised with. Others have embraced friends as their chosen family. Some families have children, other families have pets. There are multi-generational families, mutli-racial families and of course many families are any combination of all of these characteristics and more.

Fortunately, we don’t need to add 7000 variants to your keyboards (even this would fall short of capturing the breadth of "family" as a concept). Instead we can juxtapose individual emoji together to capture a concept with some reasonable level of specificity — not too unlike arranging letters together to create words to convey concepts 😉

[image toned families]

For emoji keyboards to advance in creating more intuitive and personalized experiences the Emoji Subcommittee is recommending a visual deprecation of the family emoji. This small set of emoji will be redesigned as part of a multi-phase effort to “complete the set” of toned variants for the remaining multi-person emoji. This of course begs the question: when there are as many families as there are people in the world, is there an effective way at conveying the concept of “family” without being overly prescriptive in defining what is and is not a family? Well, thankfully icons can do a lot of heavy lifting without requiring very much detail.
[image before-after]

When is an emoji running for the police or getting chased by them?

Another area the ESC is actively exploring is how the semantics of emoji sequences can differ when writing directionality changes. Some emoji characters have semantics that encode implicit directionality but when the string is mirrored and their meaning may be unintentionally lost or changed.

[image rightwards]
Left to Right Emoji Sequence
Quickly running towards an “exciting” police chase


[image leftwards]
Right to Left Emoji Sequence
Running away from the coppers


What, if anything, can we do to aid in ensuring that messages are meaningfully translated be them tiny pictures or tiny letters? As part of 15.1 we’re proposing a small set of emoji with strong directionality — with an initial focus on people — to face the opposite direction. Soon you too can run towards or away from … excitement.

Emoji 15.1

Given that the intake cycle of emoji proposals for Unicode 16.0 ended last July, the Emoji Subcommittee has also decided to temporarily delay the intake of Unicode Version 17.0 proposals until April 2024. Fortunately, you won’t have to wait until then to get new emoji. Among the list of recommendations includes 578 characters (most of them the candidates described above to support directionality). The list also includes a few humble additions including a broken chain, a lime, a non-poisonous mushroom, a nodding and shaking face, and a phoenix bird. Each one of these leverages a unique valid ZWJ sequence of emoji so while they look like atomic characters made of a single codepoint they are composed of two or more codepoints.

[image candidates]

Broken chain is the result of 🔗💥, with a variety of meanings, such as freedom, breaking a cycle, or perhaps a broken url ;-). Like the bi-directional emoji touched on above, nodding face and shaking face are the result of 🙂↔️and 🙂↕️ respectively. Oh, and of course there is a phoenix rising from the ashes (🐦🔥), a perfect metaphor to capture where we are today.

The Unicode Technical Committee (UTC) will review the required documents at its first meeting of 2023 in January – and if these candidates move forward, you can expect an update from the UTC later this Spring and Summer.


Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Wednesday, December 21, 2022

Unicode in 2022

2022 Image

Hello Everyone!

As we go into the New Year, the Unicode team thought we’d share some highlights from this past year. From source-code spoofing to preserving indigenous languages, the Unicode team has had another full year, including expanding the number of characters that appear on billions of devices around the world.


Nearly 150,000 characters!

On the character side, we reached a total of just shy of 150,000 characters (149,186 to be exact). Of the 4,489 characters added in the 15.0 release, the biggest set was 4,192 ideographs for use in Chinese, Japanese, and Korean. There are also two new scripts, Nag Mundari and Kawi. Nag Mundari is a script used to write the Mundari language of India, a language with 1.1 million speakers. Kawi is an important historic script of insular Southeast Asia, found in inscriptions and on artifacts in several languages dating from the 8th to the 16th centuries — and is undergoing a revival today amongst enthusiasts.

And we can’t forget the 20 new emoji characters — we’re looking forward to seeing which are the most popular: shaking face? Goose? Maracas? Pink heart? If you’re involved in implementing emoji, you’ll also want to look at latest changes in UTS #51 Unicode Emoji.

See the Unicode15.0.0 page for more details. We’re also changing how we do releases — for more, see 2023 Release Planning.

The Launch of ICU4X

ICU is used in every major device and operating system; it’s how you see a date or number on your phone, for example. This new project, ICU4X, was created to solve the needs of clients who wish to provide client-side internationalization for their products in resource-constrained environments and across many programming languages. After 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution. For details, see Announcing ICU4X 1.0.

When does i ≠ і?

Can you tell the difference between i and і? Yeah, most people can’t. The first set of changes to help counter source-code spoofing were included in the 15.0 versions of the UAX #9 Unicode Bidirectional Algorithm, UAX #31 Unicode Identifier and Pattern Syntax, and UTS #39 Unicode Security Mechanisms.

For 2023, there is a new draft UTS #55 Unicode Source Code Handling, providing guidance for programming language designers and tooling developers, and specifying mechanisms to avoid usability and security issues arising from improper handling of Unicode. More changes are on their way for UAX #9, UAX #31, and UTS #39 as well.

Åge Møller, Πέτρος Νικόλαος Καρατζής, à®°ாஜேந்திà®° சோழன்

We’re making great progress on internationalized formatting of people’s names. What does that mean? Software needs to be able to format people's names, such as John Smith or 宮崎駿. The formatting can be surprisingly complicated: for example, people may have a different number of names, depending on their culture — they might have only one name (“Zendaya”), only two (“Albert Einstein”), or three or more. So the software needs to handle missing or extra name fields gracefully.

There are many more complexities — for more details, see Formatting people’s names.

You have 2 unread messages.

Or, you have 3 items in your cart. Whenever a computer needs to construct a sentence using “placeholders” such as 3, it is formatting a message. The current industry standard is ICU’s message formatting; a project started about 3 years ago, with the goal of improving on that to build a more robust and extensible mechanism. There is now a Tech Preview in ICU — we’d urge developers to try it out!

See message-format-wg for details on the syntax and message2/package-summary.html for the API (note that the ICU’s convention for tech previews is to mark as Deprecated), and the test code in MessageFormat2Test.java for examples of usage.

(There are of course other fixes, upgrades and new features in ICU: see ICU 72 and ICU 71 for more details.)

Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ

In CLDR, we now have 95 languages at the Modern level (suitable for full UI internationalization), 6 at the Moderate level (suitable for “document content” internationalization), and 29 at the Basic level (suitable for locale selection). We added a tech preview of formatting for person names, plus additions for Unicode 15.0 (emoji names and search keywords), names for new scripts, new CJK collation, and so on. For more information, see CLDR v42.

Revitalization and Preservation of Indigenous Languages

The Nattilik language community was unable to use their language reliably for even simple, everyday digital text exchanges such as email or text messaging. The Typotheque Syllabics Project, an initiative based out of Toronto and The Hague, Netherlands, undertook research with language keepers across various Syllabics-using Indigenous communities in Canada. By collaborating with Nattilik language keepers and elders in the community, key issues the Nattilik community of Western Nunavut faced were identified, and it was discovered that there were 12 missing syllabic characters from the Unicode Standard. The Consortium worked with the Typotheque Syllabics Project to add 16 characters to the script to support Nattilik and other languages in Unicode version 14.0, and improved the glyphs in Unicode version 15.0. See this blog post from June.

The Past and Future of Flag Emoji

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others. Curious to learn more? Read more about the Past and Future of Flag Emoji.

Available Now! New YouTube Playlist and Technical Quick Start Guide

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. Unicode technical leadership and other experts shared background on our core projects with participants from more than 30 countries. If you missed the webinar, no worries! The recorded sessions are available on this YouTube playlist. And if you are new to Unicode and internationalization or simply want a refresh, you can also check out our Technical Quick Start Guide. This handy guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There are also useful links to more detailed information and how you can get involved. Read more here.

Support Unicode 💞💕💌💯✨🌟🤠🛟🎁

Finally, if you are already a contributor to — or member of Unicode (or your company or organization is!), thank you, Danke, Děkuju, धन्यवाद, merci, 谢谢你, grazie, நன்à®±ி, and gracias! What we have accomplished is only possible because of supporters like you.

And if you want to support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode is a US-based non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Monday, November 14, 2022

The Unicode® Standard – 2023 Release Planning

By Peter Constable, Chair of the Unicode Technical Committee

[image] At the Q4 Unicode Technical Committee (UTC) meeting held from November 1-3, our member representatives unanimously agreed to a release plan for 2023 and tentative plan for 2024. Along with some tooling updates, our plans aim to ensure that we are more agile to meet the evolving internationalization landscape and better able to meet the needs of Unicode members and other consumers of the Standard.

More information can be found in the Release Management Group’s Recommendations for 2023-2024.

BACKGROUND

For several years now, the UTC has worked on an annual cycle for new versions of The Unicode Standard and related specifications. New versions used to be released in March of each year, but in 2021, due to COVID-19, the release was delayed until September. 

MOVING FORWARD

Going forward, our plan is to continue with a new release each year in September. That annual, predictable cycle works well for Unicode's other major projects—CLDR and ICU—and helps implementers in their planning. 

In 2023, we will keep up that cadence with a September release, but we also need to take some time to evaluate and update our processes for developing each new version of the Standard.

Therefore, the 2023 release will be a “dot” release: Unicode 15.1. It will include important updates to Unicode Standard Annexes and to the Unicode Character Database, and have a limited set of new characters — but new scripts and most other character additions will be held until the 2024 release. A major new area is the planned release of a Unicode Technical Standard for avoiding source-code spoofing, along with associated changes in other specifications.

Regarding emoji, if there are any new emoji in the 15.1 release, they would leverage existing code points, as was done for the 13.1 release, rather than the addition of entirely new characters.

2024 AND BEYOND

For 2024, we anticipate returning to our regular cadence, with a major release in September 2024. Unicode 16.0 will include additional new scripts, emoji and other characters, as well as other updates.



Learn more about how you can support the Unicode Consortium and our mission, including information on our Adopt-a-Character program, here!
[badge]

Tuesday, November 8, 2022

Available Now! New YouTube Playlist and Technical Quick Start Guide


Youtube Image

By Elango Cheran

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. More than 180 people across 30 countries joined us for this online event.

The Consortium is pleased to now make available the videos from this event. If you are new to Unicode and internationalization or want an overview of the most recent projects, check out our new YouTube playlist and Technical Quick Start Guide.

Our Technical Leadership and other experts provide a handy overview on such topics as:
Also included in the playlist is the audience Q&A with Elango Charan, the webinar’s emcee,  and Mark Davis, Cofounder and President of Unicode.

The Unicode Technical Quick Start Guide is also now available. The guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.”  There is also an overview of the technical committees and useful links to more detailed information and how you can get involved.



Learn more about how you can support the Unicode Consortium and our mission, including information on our Adopt-a-Character program, here!
[badge]