By Peter Constable
A huge thank you to Google for hosting Unicode Technical Committee (UTC) meeting #182 last week, January 22 – 24th in Sunnyvale, CA!
For complete details, see the draft UTC #182 minutes.
Unicode 17.0 alpha repertoire
UTC took technical decisions for the Unicode 17.0 alpha, which will be released for public review next week. No changes were made to the new character repertoire approved for Unicode 17.0 at UTC #181, but changes were made to some details for certain characters.
- Some character name changes were approved for some new characters (three Arabic honorific characters and Tolong Siki letters).
- For Tangut, the glyph and stroke count will be changed for one character, and the default UCA ordering for Tangut components will be changed.
- Four variation sequences will be added for “Sibe” forms of quotation marks (U+2018, U+2019, U+201C, U+201D).
- For CJK, some representative glyphs for 11 characters will be changed (one “G” source and ten “V” source”). Also, 1,685 “G” source references will be updated. Various Unihan property value changes were also approved.
Data files
A significant change was approved for data files for The Unicode Standard and the other version-synchronized standards, UTS #10, UTS #39, UTS #46 and UTS #51. Up through Version 16.0, data files for The Unicode Standard have been published in version folders on the Website in the /Public/ folder (e.g., https://www.unicode.org/
For example, instead of UCD and UTS #51 data files being organized as follows,
https://www.unicode.org/
https://www.unicode.org/
They will instead be organized like this:
https://www.unicode.org/
https://www.unicode.org/
This is close to what has been done in the Public/draft folder for pre-release data files. The organization in that folder will be adjusted for the Unicode 17.0 alpha to match what will be used for the release.
Property/data changes
Some significant property changes were approved for Unicode 17.0, including the following:
- The Identifier_Type property defined in UTS #39 is used by some identifier systems to limit the set of valid identifiers. In Version 16.0, all CJK ideographs have had a property value that makes them valid in such identifier systems. UTC #182 approved a change to the Identifier_Type value for a large number of CJK ideographs to make them invalid, matching what ICANN has done for IDNA root zone labels.
- The Extended_Pictographic code point property was created to make segmentation behaviours defined in UAX #14 and UAX #29 forward compatible for future emoji characters. When it was created in Unicode 11.0, all unassigned code points in the range 1F000..1FFFD were given this property. When non-emoji characters are assigned in that range, they should not have that property, but UTC has not been consistent to remove that property for those code points. This will be corrected in Unicode 17.0.
UTC #181 also authorized a proposed draft for a possible new UAX #60 to document data for non-CJK ideographs based on L2/25-052; a public review issue for this will be posted soon. This would be analogous to UAX #38 but apply to ideographic scripts such as Nüshu and Tangut.
Please review!
UTC invites feedback on the following proposed specs:
- PRI #509, Proposed Draft UTS #58, Unicode Link Detection and Serialization
- PRI #510, Proposed Draft UTR #59, East Asian Spacing
As mentioned above, Identifier_Type property values for CJK characters are being changed based on analysis provided by ICANN. Three other documents submitted to UTC propose other Identifier_Type changes based on similar analysis. UTC invites review and feedback on these documents:
- L2/25-032 Alphabetic Characters not recommended in UTS#39 but part of the DNS Root Zone or Second-Level Reference LGR
- L2/25-033 Characters excluded from both MSR and Reference LGR but allowed in UTS#39
- L2/25-034 Characters recommended in both UTS#39 and MSR but excluded from the Root Zone or Reference LGR