By Peter Constable, Chair of UTC
Unicode Technical Committee (UTC) meeting #183 was held April 22 – 24. Thanks to member company Microsoft for hosting at its Mountain View, CA campus. Here are some highlights.
Unicode 17.0 Beta
Unicode 17.0 is scheduled for release in September of this year. At UTC #183, technical decisions were taken for updates to be reflected in the Beta release, which will be available for public review later this month.
The most significant changes affecting Unicode 17.0 are encoding of 14 additional characters:
- A new currency symbol, SAUDI RIYAL SIGN, was proposed by the Saudi Central Bank and will be added to Unicode 17.0. This has been assigned to code point U+20C1.
- Note: We know that many vendors will want to implement support for this quickly. Keep in mind that, while it's unlikely that the code point will change, this isn't completely guaranteed until Unicode 17.0 is finalized at the next UTC meeting, in July.
- For more background, see a recent Unicode Blog article, Support for the New Saudi Riyal Currency Symbol.
- Thirteen new CJK unified ideographs will be added, twelve of which are needed for use in China. These were reviewed by experts in the Ideographic Research Group (IRG—a working group within ISO/IEC JTC 1/SC2), who recommended immediate encoding. For more information, see Sections 25 and 27 of the CJK & Unihan Working Group recommendations (L2/25-090).
Three characters that were to be newly-added have been removed. The Unicode 17.0 Alpha included the addition of Sidetic script, with 29 characters. (Sidetic is an historic script used in ancient Anatolia.) Based on expert feedback during the Alpha review, three of the characters were deemed not ready for encoding, and so will be removed from Unicode 17.0. Hence, the Beta will include only 26 Sidetic characters.
With these repertoire changes, Unicode 17.0 Beta will include 4,847 new characters.
There were other notable changes related to CJK Unified Ideographs. Thanks to ongoing research by IRG experts, a number of corrections will be made affecting already-encoded ideographs, including changes to the region-specific glyphs shown in the code charts and to source references (the details that map CJK Unified Ideographs to the specific ideograph forms used in different regions). One significant change being made is the
horizontal extension of 2,145 existing CJK Unified Ideographs with the addition of glyphs and source data for those characters reflecting use in China. For details, see section 28 of
L2/25-090.
Operational criteria for security-related classification of characters
One Unicode specification, UTS 39, Unicode Security Mechanisms, provides guidance on Unicode characters that should or should not be used in identifier systems where security is an issue, such as Internet domain names. It defines a General Security Profile for identifiers, which gives all Unicode characters a status of allowed or restricted. This is based on a classification of characters by a character property, Identifier_Type.
Up to now, there has been a basic description of the different Identifier_Type values, but not detailed operational criteria for assigning characters to the various types. UTC reviewed a proposal for such operational criteria—see
L2/25-069, Factors used in determining the Identifier_Type of characters. These criteria were informed by work done in ICANN in defining rules used for determining permitted DNS and second-level domain name labels. UTC approved these criteria to be incorporated into UTS #39 and used for this purpose going forward.
Related to this, the Identifier_Type classifications of over 1000 characters will be revised in Unicode 17.0, in line with these criteria. (Similar changes were made during UTC #182 for a large number of CJK Unified Ideographs.)
New Unicode Technical Standards in development
When I sent email mentioning highlights from UTC #182, I mentioned two technical documents in early stages of development that were available for public review:
- PRI #509, Proposed Draft UTS #58, Unicode Link Detection and Serialization
- PRI #510, Proposed Draft UTR #59, East Asian Spacing
UTC #183 advanced both of these from Proposed Draft to Draft status.
Also, the specification for East Asian spacing will be changed from a Unicode Technical Report (UTR) to a Unicode Technical Standard (UTS). Technical reports are used to provide technical information, which could include potential algorithms that could be useful for implementations. But they are not used as a basis for specifying data or algorithms where interoperability between implementations is required. As pointed out in document
L2/25-138, this new Unicode technical document will be referenced by CSS specifications for the
text-autospace property which is in development and being implemented in browsers. Hence, it is appropriate for this Unicode document to be designated as a UTS.
In addition, UTC reviewed a proposal for another UTS and authorized its development: Proposed Draft UTS #61,
Unicode Set Notation. Unicode specs for properties and algorithms often need to refer to sets of code points or strings using property assignments. Certain conventions have been used in UTC specs as well as in certain Unicode-provided tools and implementations, including the
Unicode Utilities and
ICU, and in the Unicode CLDR
LDML spec. However, the conventions used in these various contexts have not been mutually consistent and interoperable. The proposed new UTS is a first step toward convergence of the conventions across these contexts. The proposed draft UTS has been posted for public review, and UTC invites feedback on it:
- PRI #523, Proposed Draft UTS #61, Unicode Set Notation
Note: some working group reports are referred to for background details, but be sure to check the minutes for definitive outcomes, which sometimes differ from what working groups recommended. For complete details, see the draft UTC #183 minutes.