The Unicode Blog: UTC

Showing posts with label UTC. Show all posts

Tuesday, August 27, 2024

Highlights from Unicode Technical Meeting #180

by Peter Constable, UTC Chair

Unicode Technical Committee (UTC) meeting #180 was held July 23 – 25 in Redmond, Washington, hosted by Microsoft. Here are some highlights.

Finalizing Unicode 16.0

One priority was to finalize technical decisions for Unicode 16.0 in preparation for a September 10 release. Beta feedback and a small number of new proposals were considered and various decisions affecting 16.0 were taken. Regarding the set of encoded characters and emoji sequences for Unicode 16.0, no changes were made from the Beta.

Unicode 16.0 will include major additions and improvements for Egyptian Hieroglyphs, most of which were already included in the Beta. One aspect of the improvements is a refinement in the encoding model for rotational variants using variation sequences. Since the Beta, it was recognized that ten of the Egyptian Hieroglyph encoded as characters in Unicode 5.2 would be better represented using rotational variation sequences. This led to some new UTC decisions affecting the 16.0 release:

Ten standardized variation sequences for Egyptian Hieroglyph rotational variants were added, while one standardized variation sequence that had been added in Unicode 15.0 was rescinded.
In the Unikemet.txt data file with Egyptian Hieroglyph properties, the kEH_Core property has been changed from a binary property to having an enumeration of values, one of which is “L(egacy)” indicating characters encoded in Unicode 5.2 that are not part of the core set and are not expected to be supported in fonts.

Another significant change affecting the 16.0 release is a glyph change for U+0620 ARABIC LETTER KASHMIRI YEH, and a change to its joining group in ArabicShaping.txt (180-C23, 180-C24). This affects not only the glyph shown in the code chart, but also the positional forms shown in the Arabic section of the core spec. The need for this arose from incorrect information in the core spec resulting in fonts that don’t provide a final form that matches users’ expectations. See L2/24-152 for background details.

While no further changes were made to the set of emoji in Unicode 16.0, a change will be made in how emoji characters are displayed in the code charts. The technology used to produce the chart pages is not able to display full-color emoji, and up to now the code charts have not made it clear when pictographic symbols have the Emoji property. In Unicode 16.0, characters with the Emoji property will be indicated in the code charts with a small triangular badge in the top left corner of the cell. A white triangle will indicate an emoji character that should have default emoji (full color) presentation:

A black triangle will indicate an emoji character that should have default text (monochrome) presentation:

A rectangular sign with a grid and numbers

The script descriptions in the core spec are used to provide background information on each script as well as information to guide implementations. For many scripts, it has been a challenge to provide comprehensive guidance for implementations, particularly when there are complex rendering requirements. However, some implementers have written Unicode Technical Notes providing guidance for implementation of a particular script. Although these are not normative specifications approved by UTC, they can still be valuable information conducive to interoperable implementations. For Unicode 16.0, UTC decided to have two existing UTNs referenced within the core spec sections for the respective scripts:

As mentioned for the Beta, the core spec for Unicode 16.0 will be published as per-chapter HTML pages.

Characters for future versions

At UTC #180, code points were provisionally assigned for 1,063 new characters, including 38 Arabic characters, 45 characters for phonetic transcription, and 965 ideographs and radicals for Jurchen script. With these characters in the pipeline, work can get started on property data, charts, and other content that will be needed for them to be encoded in a future version of the standard.

Some initial decisions were also taken on the character additions for Unicode 17.0: as IRG had finalized its recommendations for CJK Unified Ideographs Extension J, that block of 4,300 new ideographs has been approved for encoding in Unicode 17.0.

Also, a proposal was approved to disunify one existing CJK unified ideograph character, U+5CC0 in Unicode 17.0. When U+5CC0 was encoded in Unicode 1.1, it was deemed that two similar ideographs should be unified. The proposal demonstrated that this unification should not have been made, and that was confirmed earlier this year by IRG. The changes for 17.0 will include encoding of a new character, U+2B73A, and revision of the source references for 5CC0, 2B73A and 2F879. A complication in this case is that ideographic variation sequences for the two distinct glyphs have been registered for use in Japan. No changes in “J” source references will be made, and it is not expected that implementations for Japanese will be affected. For additional information, see section 7 of L2/24-165.

Variation sequences and historic scripts

People working with historic scripts often deal with glyph variations. Variation sequences seem like an appropriate encoding mechanism to use in such cases, though asking UTC to standardize variation sequences for many historic variations could seem like a challenge. With that in mind, a proposal was presented to encode a block of additional "user-defined" variation selector characters. These would be additional PUA characters with a constraint that they would only be used as variation selectors.

That proposed solution is problematic: existing stability policies and commitments prevent assigning more PUA code points and also prevent constraining existing PUA for certain uses. At the same time, there was opinion within UTC that the need expressed was reasonable, and there was openness to considering alternative solutions. One potential alternative that gained some interest was to establish a registration process, similar to what is defined in UTS #37 for ideographic variation sequences but intended for use with historic scripts.

For complete details on outcomes from UTC #180, see the draft minutes.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Thursday, May 2, 2024

Unicode Technical Committee (UTC) Updates from Meeting #179

by Peter Constable, UTC Chair

The Unicode Technical Committee (UTC) met last week (April 23 to 25) in San Jose, California. Thanks to Unicode member company Adobe for hosting. Here are some highlights from the large number of items that were covered.

Preparing Unicode 16.0 Beta

An important objective was to cover all technical decisions that would be needed for the Unicode 16.0 Beta preview. The Beta will be available for public review and comment on May 21, 2024, and will include all charts, data and annexes for The Unicode Standard as well as other synchronized standards, including UTS 10, Unicode Collation Algorithm, and UTS 51, Unicode Emoji. Also, for the first time, the Beta release will include a complete draft of the core text of the standard.

The character repertoire for Unicode 16.0 was slightly adjusted, with the removal of two characters: U+0CDC KANNADA ARCHAIC SHRII and U+0C5C TELUGU ARCHAIC SHRII. These characters were first approved in January 2022 (UTC #170) and assigned for addition in Unicode 16.0 in April 2023 (UTC #175). However, in the ISO process for Amendment 2 of ISO/IEC 10646:2022 (which is to be synchronized with Unicode 16.0), the India national body requested more time for review by experts in India. To avoid a risk of Unicode 16.0 and Amendment 2 of 10646 not being in sync, UTC decided to delay these two characters for a later version.

Various character property (UCD) and algorithm changes were made based on issues reported during the Alpha review or found while the UTC Properties and Algorithms Working Group prepared data files for 16.0. Two notable areas for changes are grapheme cluster segmentation (UAX #29) and line breaking (UAX #14):

For grapheme clusters, some changes will be made to extended grapheme cluster segmentation for improved handling of orthographic syllables in Indic scripts.

For line breaking, several changes will be made to data and rules to fix various edge cases, and to incorporate behaviour for hyphens that has already been implemented in CLDR and ICU for several years.

Also related to properties, the organization of the ScriptExtensions.txt file will be changing. Previously, lines of data were grouped by characters that had the same script extension property values. Going forward, lines will be ordered by code point. (This is only a change in the order the data is listed; the parsing of lines is unchanged.) This will make it much easier to compare changes in property values between different Unicode versions.

In relation to emoji, the set of new emoji for version 16.0 is unchanged. During the Beta review, the draft update for UTS #51, Unicode Emoji, will include some proposed revisions related to recommendations for display of emoji family combinations. These revisions have not yet been reviewed and approved by UTC, so will require careful review and will be subject to confirmation or change at the next UTC meeting, after the Beta review period is over.

UTC action item backlog

UTC has had a growing backlog of open action items, some over ten years old. For this meeting, the various UTC working groups triaged their action items that were five or more years old, and outcomes were discussed by the UTC. Some action items were completed; some were closed as no longer relevant. Many that required more research were closed as UTC action items and replaced by issues in the relevant working group’s GitHub repo. Note that tracking them in this other way doesn’t necessarily mean they will get higher priority. However, since the working groups are using GitHub issues to organize their regular work, this should bring more attention to these issues. UTC will repeat this process at UTC #181, six months from now.

As a side effect of this review of old action items, a document was submitted to UTC (L2/24-123) proposing that UTC transition from the way it has handled action items in the past to tracking issues in a public GitHub repo to allow contributions from a broader set of volunteers. That document identifies some problems and limitations of the existing processes, and suggests that a new process could provide improvements. UTC spent some time discussing this document. It was noted that the idea was valuable, though such a change in processes would not be a small change and would involve some not-so-obvious challenges. It would also be something that affects the Unicode Consortium as a whole, not just UTC. For that reason, this proposal will need to be considered as part of a broader discussion of Consortium processes, resources and infrastructure.

New investigation: automatic space handing at inter-script boundaries

East Asian text often combines different scripts, and a common typographic practice is to insert space between script runs. UTC briefly discussed a new document, L2/24-057, which proposes development of an algorithm for automatic spacing between script runs. The Properties and Algorithms Working Group has assembled experts to discuss this topic. Interested experts are invited to participate in discussion via issues (with "auto-spacing" label) in the public unicodetools repo in GitHub.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Monday, November 13, 2023

UTC #177 Highlights

by Peter Constable, UTC Chair

Unicode Technical Committee (UTC) meeting #177 was held November 1 to 3 in Cupertino, California, hosted by Apple. Here are some highlights from the meeting.

Starting the Unicode 16.0 cycle

UTC approved a plan and timeline for the Unicode 16.0 release. Here’s a summary of the timeline:

January 2024: UTC #178 will finalize content for the alpha release
February – March: alpha release for public review
April: UTC #179 will finalize content for the beta release
May – June: beta release for public review
July: UTC #180 will finalize 16.0 content
September: Unicode 16.0 release

UTC is still adjusting to changes in how work for each release is managed. So, while this will be a “full” release, UTC will be conservative about taking on too many changes, particularly to algorithm specifications (UAXes, UTSes). Also, a new format for the core text will be used in this release: instead of PDF, it will be published using Web technologies (HTML, etc.) To get early validation on format changes, the alpha release will include a sampling of content from the core text.

Unicode 16.0 character and emoji repertoire

UTC had previously approved 1,179 characters for encoding in Unicode 16.0. At this UTC meeting, 15 additional characters were approved for version 16.0, including seven emoji characters. UTC has been planning to include nearly 4,000 additional Egyptian Hieroglyphs in Unicode 16.0. The proposal was discussed, and a small revision was requested. It’s expected these will be approved for Unicode 16.0 at the next UTC meeting. Apart from the additional hieroglyphs, we expect no further characters will be added to the Unicode 16.0 repertoire.

Beside characters approved for Unicode 16.0, code points were provisionally assigned for 184 new characters that are candidates for encoding in a future Unicode version.

See the Pipeline page for all characters currently approved for Unicode 16.0, along with code points provisionally assigned for future encoding.

Future of UAX #42, UCD in XML

UAX #42, Unicode Character Database in XML (UCDXML), was originally developed by Eric Muller. He and Laurentiu Iancu maintained UCDXML through many versions, and we’re very grateful for this contribution. Eric and Laurentiu are no longer available to maintain this, however, and no others have volunteered to take over maintenance. After discussion over several months in UTC and in the Properties and Algorithms working group, UTC has concluded the best option for the future of UAX #42 is to stabilize it, with data frozen at Unicode 15.1. A Public Review Issue will be posted to get feedback on this plan.

Future maintenance of UCS repertoire

UTC discussed a proposal for ISO/IEC JTC 1/SC 2 to adopt different process for future maintenance of the repertoire of ISO/IEC 10646 using a maintenance agency rather than the process that is used for developing entirely new standards, as done in the past. It was felt that this would be more agile and would align better to how expert input has guided actual encoding decisions for several years now. This proposal will be formally submitted to JTC 1/SC 2 as a proposal from the US national standards body.

Full details on these and other outcomes are provided in the minutes—see L2/23-231.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, May 2, 2023

UTC #175 Highlights

by Peter Constable, UTC Chair

We had another productive Unicode Technical Committee (UTC) meeting last week,hosted at Adobe headquarters in downtown San Jose, California. Here are some highlights from the meeting.

Unicode 15.1 Beta

UTC has authorized the Beta release for Unicode 15.1. There were various, relatively minor technical changes to be made based on feedback during the Alpha review period, plus one major change that I’ll describe below. The Beta is scheduled for release on May 23, for a six week public review period to end July 4. That closing date will provide time for working groups to review feedback and provide recommendations for the next UTC meeting July 25 – 27.

CJK Extension I & GB 18030

A major change for Unicode 15.1 that was decided on was to encode 603 characters in a new CJK Unified Ideographs Extension I block. (See L2/23-106.) This was part of long discussions about GB 18030-2022 and Amendment 1 of that standard which China is currently developing. China has an urgent need for these characters, and the draft of their amendment has them allocated in reserved code positions of Unicode and ISO/IEC 10646, which is not viable from the perspective of the international standards. So, UTC has taken initiative to have China's need accommodated in a standards-conforming manner.

There was discussion as to whether the new characters should be added to Unicode 15.1 or to Unicode 16.0: it was generally preferred to wait for 16.0, but 15.1 was tentatively chosen in case that makes a significant difference for China’s process.

UTC recommended the addition of CJK Extension I to the INCITS/CS&I committee (mirror for JTC 1/SC 2—also met last week) who agreed to recommend to SC 2 the addition of that block in Amendment 2 of ISO/IEC 10646. See L2/23-114 and L2/23-115 for more information.

Orthographic syllable support in UAX #14

Another significant addition for Unicode 15.1 is that UTC approved extending UAX #14 Unicode Line Breaking Algorithm to support breaking of various South and Southeast Asian scripts at orthographic syllable boundaries. The algorithm for this is based on a proposal from Norbert Lindenberg and others (see L2/22-086), with details for incorporation into UAX #14 provided by Robin Leroy (see L2/23-072). A prototype implementation had been created as a public review issue (see PRI #472), and feedback had been positive. This will be a very significant enhancement in Unicode 15.1 providing important improvements in support for several South and Southeast Asian scripts.

Unicode display in text terminals

A new UTC project was initiated at this meeting to develop specifications for supporting display of scripts that require complex shaping in text terminals. This was introduced with a presentation by Renzhi Li and Dustin Howett of Microsoft (see L2/23-107). Even though the majority of computing device usage today is via GUIs, text terminals are still used in many scenarios. Thus, there was considerable interest among UTC participants in this proposal. An ad-hoc working group, chaired by Dustin Howett, will be formed to develop specifications. If interested in participating, let me know and I’ll connect you with Dustin.

Full details on these and other outcomes will be provided in the draft minutes that will be available soon (as L2/23-076 in the document registry).

Tuesday, January 17, 2023

What’s New in Emoji 15.1?

Doing more, with less

By: Jennifer Daniel, Chair of the Emoji Subcommittee

[image phoenix]

This past Fall, the Unicode Technical Committee announced the delay of Unicode 16.0. This wasn’t without precedent — COVID slowed down the release of Unicode 14.0 in 2020 and the world seemed to survive 😉. Subcommittees were well prepared and adjusted accordingly, discussing what this meant for their respective areas of expertise.

For the Emoji Subcommittee (ESC) — the group responsible for defining the rules, algorithms, and properties necessary to achieve interoperability between different platforms for those smiley faces that appear on your keyboard (Shout out 😁🥰🥹🤔🫣🫡😵‍💫!) — this delay presented an opportunity. Sure, we were so close to exhaling a sigh of relief (the intake period for Emoji 16.0 proposals had just completed). But upon learning we couldn’t ship any new codepoints until 2024 we turned our energy towards recommending new emoji based on existing ones. (These are called emoji ZWJ sequences. That's when a combination of multiple emoji display as a single emoji … like 👩 🏽 +🏭 = 🧑🏽‍🏭).

When Less is More

An incredibly powerful aspect of written language is that it consists of a finite number of characters that can "do it all". And yet, as the emoji ecosystem has matured over time our keyboards have ballooned and emoji categories are about to hit or have hit a level of saturation. Upon reflecting on how emoji are used, the ESC has entered a new era where the primary way for emoji to move forward is not merely to add more of them to the Unicode Standard. Instead, the ESC approves fewer and fewer emoji proposals every year.

But our work is not done. Not by a longshot. Language is fluid and doesn’t stand still. There is more to do! This “off-cycle” gives us a chance to address some long-standing major pain points using emoji. The first one that came to mind: skin-tone.

What is a family?

The encoding of multi-person multi-tone support has matured over the years; however, the implementation can seem random to the average person: While it’s true, all people emoji have toned options (with the exception of characters where you can’t see skin like 🤺) there are … misfits. Some two people emoji offer tone support ( 🧑🏻‍❤️‍🧑🏿) others do not ( 👯). A few non RGI emoji render with tone but with no affordance to change one of the two characters (For example, 🤼🏾‍♂).

And then … There is the suite of family emoji (👨‍👦👨‍👦‍👦👨‍👧👨‍👧‍👦👨‍👧‍👧👩‍👦👩‍👦‍👦👩‍👧👩‍👧‍👦👩‍👧‍👧 👨‍👨‍👦👨‍👨‍👦‍👦👨‍👨‍👧👨‍👨‍👧‍👦👨‍👨‍👧‍👧👩‍👩‍👦👩‍👩‍👦‍👦👩‍👩‍👧👩‍👩‍👧‍👦👩‍👩‍👧‍👧👨‍👩‍👦👨‍👩‍👦‍👦👨‍👩‍👧👨‍👩‍👧‍👦👨‍👩‍👧‍👧👪). These characters include two people, three people, sometimes four and none of them have any tone support (!). We seem to have a lot of family emoji and yet simultaneously not enough.

The 26 “family” emoji can be broken down into four groups:

[image families]

Despite the Unicode Standard containing 26 “family” emoji, each one of these glyphs is overly prescriptive with regard to delivering on a visual representation of a family. The inclusion of many permutations of families was well intentioned. But we can’t list them all, and by listing some of the combinations, it calls attention to the ones that are excluded.

What even is a family? For some, family is the people you were raised with. Others have embraced friends as their chosen family. Some families have children, other families have pets. There are multi-generational families, mutli-racial families and of course many families are any combination of all of these characteristics and more.

Fortunately, we don’t need to add 7000 variants to your keyboards (even this would fall short of capturing the breadth of "family" as a concept). Instead we can juxtapose individual emoji together to capture a concept with some reasonable level of specificity — not too unlike arranging letters together to create words to convey concepts 😉

[image toned families]

For emoji keyboards to advance in creating more intuitive and personalized experiences the Emoji Subcommittee is recommending a visual deprecation of the family emoji. This small set of emoji will be redesigned as part of a multi-phase effort to “complete the set” of toned variants for the remaining multi-person emoji. This of course begs the question: when there are as many families as there are people in the world, is there an effective way at conveying the concept of “family” without being overly prescriptive in defining what is and is not a family? Well, thankfully icons can do a lot of heavy lifting without requiring very much detail.
[image before-after]

When is an emoji running for the police or getting chased by them?

Another area the ESC is actively exploring is how the semantics of emoji sequences can differ when writing directionality changes. Some emoji characters have semantics that encode implicit directionality but when the string is mirrored and their meaning may be unintentionally lost or changed.

Left to Right Emoji Sequence
Quickly running towards an “exciting” police chase

[image leftwards]

Right to Left Emoji Sequence
Running away from the coppers

What, if anything, can we do to aid in ensuring that messages are meaningfully translated be them tiny pictures or tiny letters? As part of 15.1 we’re proposing a small set of emoji with strong directionality — with an initial focus on people — to face the opposite direction. Soon you too can run towards or away from … excitement.

Emoji 15.1

Given that the intake cycle of emoji proposals for Unicode 16.0 ended last July, the Emoji Subcommittee has also decided to temporarily delay the intake of Unicode Version 17.0 proposals until April 2024. Fortunately, you won’t have to wait until then to get new emoji. Among the list of recommendations includes 578 characters (most of them the candidates described above to support directionality). The list also includes a few humble additions including a broken chain, a lime, a non-poisonous mushroom, a nodding and shaking face, and a phoenix bird. Each one of these leverages a unique valid ZWJ sequence of emoji so while they look like atomic characters made of a single codepoint they are composed of two or more codepoints.

[image candidates]

Broken chain is the result of 🔗💥, with a variety of meanings, such as freedom, breaking a cycle, or perhaps a broken url ;-). Like the bi-directional emoji touched on above, nodding face and shaking face are the result of 🙂↔️and 🙂↕️ respectively. Oh, and of course there is a phoenix rising from the ashes (🐦🔥), a perfect metaphor to capture where we are today.

The Unicode Technical Committee (UTC) will review the required documents at its first meeting of 2023 in January – and if these candidates move forward, you can expect an update from the UTC later this Spring and Summer.

Tuesday, January 5, 2016

Feedback on Draft additional repertoire for ISO/IEC 10646:2016 (5th edition) CD2

The Unicode Technical Committee is soliciting feedback on pending additions to the draft repertoire of characters, to help discover any errors in character names, incorrect glyphs, or other problems. There is a short window of opportunity to review and comment on the repertoire additions noted below.

The following additional repertoire from ISO/IEC 10646:2016 (5th Edition), which is in committee ballot, is under review. See the associated repertoire in: Draft additional repertoire for ISO/IEC 10646:2016 (5th edition) CD2.

The Unicode Standard is developed in synchrony with ISO/IEC 10646. After ISO balloting is completed on any repertoire additions, no further changes or corrections will be possible. (See the FAQ Standards Developing Organizations for additional information on the stages in ISO standards development.) Advance feedback on these repertoire additions will help inform the UTC discussions about its own contribution to the ISO balloting process.

Documents referenced in the draft repertoire with numbers such as L2/15-088 are available in the UTC Document Registry.

For information about how to discuss this Public Review Issue and how to supply formal feedback, please see the feedback and discussion instructions.

Tuesday, January 22, 2013

Making UTC Document Register Public

The Unicode Technical Committee (UTC) is making its document register freely available for public access, starting on April 15, 2013. This decision has been taken in the interest of increasing public involvement in the ongoing deliberations of the UTC regarding the development of the Unicode Standard and the other standards and reports that it maintains. Open access to the document register will also make it easier to search the documents, both current and historical, for topics of interest, using widely available search engines. The UTC document register contains online documents dating back to 1997 and online registers for paper document distributions dating back to 1991.

The date for opening up access has been set to April 15 to provide sufficient time for anyone who might have issues concerning this change to raise their concerns to the Unicode Consortium. In particular, any author of a document which was submitted to the UTC under the old rules, with the assumption that the document would be available only to current members of the Consortium for review, who has concerns about that document being made publicly accessible, is encouraged to contact the Unicode Consortium. Please identify precisely the document of concern and the reasons why you might not wish for it to be included in the publicly accessible set. Please note that the change to make the document register publicly accessible does not change anything with regard to copyright status of existing documents – these documents are not being put in the public domain; rather, the UTC is simply removing the requirement for password access to view them.

Tuesday, August 27, 2024

Highlights from Unicode Technical Meeting #180

Adopt a Character and Support Unicode’s Mission

Thursday, May 2, 2024

Unicode Technical Committee (UTC) Updates from Meeting #179

Preparing Unicode 16.0 Beta

UTC action item backlog

New investigation: automatic space handing at inter-script boundaries

Adopt a Character and Support Unicode’s Mission