Friday, September 22, 2023

Unicode Version 15.1 – Tips for Implementers

The Unicode Version 15.1 release includes the UCD (Unicode Character Database), Code Charts, and Annexes, but the Core Specification is unchanged from Unicode Version 15.0. In addition to new characters, a small number of errata were fixed, along with improved representative glyphs. 

Implementers should also take careful note of important changes that were made to the following UAXes:
  • For UAX #9 (Unicode Bidirectional Algorithm), the text for BD16, the interaction of control flow between W4 through W6, the use of sos, and the treatment of AN/EN with brackets in N0 were clarified, and a reference to UTS #55 was added.
  • For UAX #14 (Unicode Line Breaking Algorithm), line breaking at orthographic syllable boundaries was added, the handling of French-style quotation marks was improved, and allowed tailorings were more clearly characterized.
  • For UAX #29 (Unicode Text Segmentation), explicit conformance rules were added, support for ConjunctLinker clusters was added, the definition of “crlf” was updated, and multiple changes were made to the table of Word_Break Property Values.
  • For UAX #31 (Unicode Identifiers and Syntax), multiple changes were made to Section 2, Section 4 was completely rewritten, Section 7 was added, limited contexts for joining controls was moved to UTS #39, and a reference to UTS #55 was added.
  • For UAX #38 (Unicode Han Database), 6 new provisional properties were added, 7 provisional properties were removed, the syntax of several properties was updated, and the description of several properties was improved.
  • For UAX #45 (U-Source Ideographs), records for 39 new ideographs were added to its data file, Section 3 was added, “ExtI” was added as a new status, two obsolete status values were removed, and four status values were improved.



🌻🌻🌻🌻🌻  SUPPORT UNICODE  🌻🌻🌻🌻🌻 

Finally, if you are already a contributor — or member of Unicode (or your company or organization is), thank you, Danke, Děkuju,  धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we accomplish is only possible because of supporters like you. 
 
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider 
adopting a charactermaking a gift of stock, or making a donation. 

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. 

Please consult with a tax advisor for details.

Make your adoption today!

Thursday, September 14, 2023

Unicode CLDR v44 Alpha available for testing

[image] The Unicode CLDR v44 Alpha is now available for integration testing.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data and on Migration issues. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:
  • Sep 27 — Beta (data)
  • Oct 04 — Beta2 (spec)
  • Nov 01 — Release
In CLDR 44, the focus is on:
  1. Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
  2. Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
  3. Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
  4. Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
    1. Modern: Cherokee, Lower Sorbian, Upper Sorbian
    2. Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
    3. Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang
There are many other changes: to find out more, see the draft CLDR v44 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 44, the following levels were reached:

v44 Level
Langs
Usage
Modern
95
Suitable for full UI internationalization
čeština, ‎Deutsch, ‎français, Kiswahili‎, Magyar‎, O‘zbek‎, Română‎‎, Tiếng Việt‎, Ελληνικά‎, Беларуская‎, ‎ᏣᎳᎩ‎, Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, አማርኛ‎, ‎नेपाली‎, অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, 中文, 日本語‎, … ‎
Moderate
13
Suitable for “document content” internationalization, eg. in spreadsheet
brezhoneg, ‎føroyskt, IsiXhosa, ‎sardu, чӑваш, …
Basic
50
Suitable for locale selection, eg. choice of language on mobile phone
asturianu, ‎Rumantsch, Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, …

We are currently planning for CLDR version 45 to be a closed release with no submission period. The focus will be on improving the Survey Tool used for data submission, making necessary infrastructure changes, and some high priority data quality fixes.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Wednesday, September 13, 2023

Source Code Handling: Preventing Spoofing at the Source

header image
By: Mark Davis, Cofounder and CTO

The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.

Background

Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “pаypal.com”, where the first ‘а’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 😵‍💫

In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.

In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.

Examples

  • Line-break spoofs can cause what appears to be a line of code to be actually commented out, as far as the compiler is concerned. This can happen with C11, for example:
    precondition image
    To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!

  • The “pаypal.com” above is an example of a confusable spoof.

  • As for a bidirectional spoof, take pair of variables named Aא1 and A1א; these look identical, but the former consists of the letters A and א followed by the digit 1, whereas the latter consists of the letter A, the digit 1, and the letter א, in that order.
Such code might not even be malicious — it is too easy to accidentally give reviewers (or even the writer!) the wrong impression, leading to hidden software bugs — and just be very hard to understand; here’s an example:

The text “Error: {0} {1}", message” becomes RTL in translation.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.

Process

In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.

The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.

Impact

The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as Aא1 above.

Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.

This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.

Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, September 12, 2023

Announcing The Unicode® Standard, Version 15.1


Version 15.1 of the Unicode Standard is now available. This minor version update includes updated code charts, data files and annexes. The core specification is unchanged from Unicode Version 15.0.

This version adds 627 characters, bringing the total number of characters to 149,813. The additions include 622 CJK unified ideographs in a new block, CJK Unified Ideographs Extension I. These new ideographs are urgently needed in China for use in public service databases, and are expected to be included in a forthcoming amendment to China’s GB 18030-2022 standard. The other new characters are five ideographic description characters that enhance the ability to describe rare or not-yet-encoded CJK ideographs.

There are six completely new emoji, such as for phoenix and lime and (finally) an edible mushroom. For 108 people emoji, you can now switch the direction that they are facing (for example, person walking facing right versus facing left).

Security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms. These updates complement the release of a new Unicode Technical Standard, UTS #55, Unicode Source Code Handling.

The new characters are limited to three blocks, and the code charts for several other blocks have changed. The most significant change to charts is for the CJK Unified Ideographs, CJK Unified Ideographs Extension A and CJK Unified Ideographs Extension B blocks with the addition of representative glyphs and source references for over 24,000 KP-source (North Korea) ideographs. There are also many other glyph corrections and improvements—see the 15.1 delta code charts for details.

Significant updates have been made to UAX #14, Unicode Line Breaking Algorithm and UAX #29, Unicode Text Segmentation adding better support for scripts of South and Southeast Asia, including grapheme cluster support for aksaras and consonant conjuncts, and line breaking at orthographic syllable boundaries.

For complete details on Unicode Version 15.1, see https://www.unicode.org/versions/Unicode15.1.0/.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Friday, September 1, 2023

NEW Virtual Event - Open House on Script and Character Encoding

Registration is Now Open!

The Unicode Standard aims to make the scripts used to write the languages of the world accessible on computers and devices. However, the process of getting characters and scripts into the Unicode Standard has often been puzzling. How does one successfully propose a script or a handful of characters? How are decisions made? 

Join us for a virtual Open House event, where you will be able to ask these (and other) script and character encoding questions to seasoned Unicode experts.

WhenTuesday, Oct 17, 2023 at 11am-12pm Pacific Time (California)

Register now. Please note this session will be recorded and available via the Unicode YouTube channel.

Supporting Resources

Documenting and Preserving Languages: A Talk on Character Encoding, Keyboards, and Fonts  by Deborah Anderson and Andrew Glass

Scripts and Character Encoding by Deborah Anderson, Script Ad Hoc Group Chair

Other Script and Character Encoding-related talks on the Unicode YouTube Channel


Monday, August 21, 2023

Volunteer Spotlight!

Lorna Evans, SIL International

Lorna first became involved with Unicode in 2000 as a conference participant. Her enthusiasm led her to volunteer as a lecturer at a Unicode conference, and for the past several years as an active proposal contributor and committee member. Lorna’s heart and passion are to assist digitally disadvantaged communities by bringing their language fonts, characters, and sounds to the Unicode standard. 

Lorna began her language journey typesetting Bibles in Ethiopia in 1990 and was fascinated by the multiple fonts and characters needing representation. When she heard that Unicode was moving to support Ethiopic characters, she had to get involved. 

When asked what she is most proud of, Lorna said, “Anytime I do a proposal to Unicode, it feels like the most important thing (for that language community).” She thrives on research and feels with each proposal, she is bringing digital access to people who need it most. Lorna is completely self-taught and currently focused on documenting Arabic script. She describes SIL International, an Associate member of the Unicode Consortium and her current employer, as the former kings of creating custom encoded fonts and she is working diligently to help SIL transition to Unicode.

As for her time involved with other Unicode staff and volunteers, she has enjoyed the camaraderie and attending technical committee and editorial meetings. Lorna is an active member of the Script Ad Hoc Subcommittee, as well as the primary representative to Unicode for SIL International.

Lorna shared that she grew up in Bolivia and says that salteñas, a savory pastry filled with beef stew, is still her favorite food. 

Editor’s Note: We appreciate and thank Lorna for taking time to tell us a little about herself as well as her years of contributions 



🌻🌻🌻🌻🌻  SUPPORT UNICODE  🌻🌻🌻🌻🌻 

Finally, if you are already a contributor — or member of Unicode (or your company or organization is), thank you, Danke, Děkuju,  धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we accomplish is only possible because of supporters like you. 
 
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider 
adopting a charactermaking a gift of stock, or making a donation. 

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. 

Please consult with a tax advisor for details.

Make your adoption today!



Tuesday, August 15, 2023

Unicode Consortium Board Votes to Elevate ICU4X to Technical Committee

Across the globe, people are using alternative ways to get online, such as smartphones, smart watches, and other compact devices. Formed as a Subcommittee of the ICU Technical Committee in 2020, ICU4X is a modular, lightweight, and secure library that brings internationalization to client-side and resource-constrained environments, written in Rust with bindings into many programming languages. 

We are all very excited about the ICU4X project. It dramatically expands the number of apps and systems that can easily deploy internationalization: with a smaller, modular footprint and the advantages of security and performance from Rust.” — Mark Davis, Co-founder and Board Chair

At its July meeting, the Board agreed that the ICU4X Subcommittee needed to have the authority to make technical decisions relating to the ICU4X architecture, structure, and coding, and voted that the ICU4X Subcommittee be elevated to the level of Technical Committee with the authority to make such decisions, effective immediately.

The ICU4X Technical Committee will be responsible for the design and implementation of the library, with the goal to ensure that mobile devices and other low-resource devices can access scalable internationalization services. It is particularly applicable for devices that cannot run full ICU (C/C++ and Java) — and for those in emerging markets with more limited resources and digitally disadvantaged languages.

Chair Shane Carr (Google) and Vice Chairs Zibi Braniecki (Amazon) and Nebojša Ćirić (Google) are the ICU4X Technical Committee Chair and Vice Chairs.

Congratulations to the ICU4X team! 




Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a charactermaking a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, August 8, 2023

Unicode Technology Workshop — Call for Submissions and Registration Open!

Save the Date! November 7-8, 2023. Bay Area (Hosted at Google)

About the Workshop

Join us in person for two days of community building around the Unicode technology that makes software work for billions of people. Expect two days of workshops, seminars, free-form discussions, and lightning talks centered around i18n libraries, locale data frameworks, globalization tooling, localization pipelines, input methods, and text rendering. Network with the developers and users to help shape the future of Unicode technology.


This is a new type of event for Unicode, with a focus on building more connections within the internationalization community. Expect to come away with deeper knowledge on how to solve tough problems in the i18n and l10n space and how to engineer products that work better for global users. GILT professionals, especially those who build or use Unicode technologies, are encouraged to attend and to host sessions. To encourage maximum collaboration amongst the attendees, this is an in-person-only event.


Register Now


Call for Submissions

For those interested in participating in and contributing to the event, the call for submissions is now open. If you work on Unicode internationalization technologies or use Unicode internationalization technologies in your work, we want to hear from you. You can register your interest in contributing using the following link.


Call for Submissions


About the Unicode Consortium

The Unicode Consortium is the premier non-profit open source, open standards body for the internationalization of all software and services. 


For more than 30 years, the Unicode Consortium has coordinated the efforts of a world-wide team of volunteer programmers and linguists to standardize, evolve, and maintain a global software foundation that allows virtually every computer system and service to help people connect using their native language. 


For additional information about Unicode, visit home.unicode.org.




Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a charactermaking a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]


Wednesday, August 2, 2023

622 New CJK Ideographs to be Available in Unicode Version 15.1

The Unicode Standard will include 622 new CJK characters in Version 15.1, which will be released on September 12, 2023. The characters are in a new block, CJK Unified Ideographs Extension I, with code point assignments as reflected in the proposal document.

The characters in the Extension I block have been deemed to be very urgently needed for use in China. The Extension I proposal was based on characters that appeared in a draft amendment of China’s mandatory GB 18030 standard. For this reason, the Unicode Technical Committee (UTC) considered it imperative to arrive at a stable encoding for these characters as quickly as possible.

With the Unicode 15.1 beta review period completed, and with endorsements from liaison partners, China Electronics Standardization Institute (CESI) and ISO/IEC JTC 1/SC 2, UTC at its recent meeting was able to commit to including these characters for this next release of the Unicode Standard. The code point assignments are now stable, and vendors can begin working on implementations with confidence.

The Unicode Consortium would like to thank experts in UTC’s CJK & Unihan Group and ISO’s Ideographic Research Group (IRG) for their expedited work in preparing a proposal for encoding CJK Extension I, and would also like to thank Mr. Chen Zhuang and partners in CESI for their cooperation in this process.




Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a charactermaking a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Thursday, June 15, 2023

ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates & compatibility fixes

ICU 73.2 & CLDR 43.1 released: GB18030 compliance updates & compatibility fixes ICU LogoUnicode® ICU 73.2 and CLDR 43.1 have just been released.
There are significant changes for GB18030-2022 compliance support:
  • CLDR extends the support for “short” Chinese sort orders to cover some additional, required characters for Level 2. This is carried over into ICU collation.
  • ICU has a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005.
There are also changes for compatibility:
  • There are optional variants of time formats with AM/PM (only for English) using ASCII spaces in CLDR that can also be used in ICU via custom data generation. This is intended to help certain implementers transition to the improved patterns, which have used a narrow no-break space between the time and AM/PM since CLDR 42.
  • The changes to the word segmentation behavior of @ sign that were in CLDR 42 (ICU 72) have been reverted. These caused problems for certain parsers that did not expect @ to join to letters.
ICU 73.2 updates to CLDR 43.1 locale data. These are maintenance releases for ICU 73 and CLDR 43, with limited sets of bug fixes and no API or structural changes. ICU 73.2 and CLDR 43.1 include several other bug fixes, including person name formatting, and Cyrillic transforms.

For details, please see:


Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Thursday, June 1, 2023

Unlocking the Power of CLDR Person Name Formatting: A Solution for Formatting Names in a Globalized World

By Mike McKenna, Chair of CLDR Person Names Subcommittee

[image]

CLDR Person Names has moved from “tech preview” to “draft” status and is available for initial testing by implementors through ICU4J.

How a person’s name is displayed and used can convey respect, familiarity, or even be interpreted as rude if used improperly. That’s why it’s important to format names correctly, especially because naming practices vary across the globe. In many cultures, names can indicate gender, status, birthplace, nationality, ethnicity, religion, and more.

Until now, there have been no good standards for how to format people’s names in various contexts. A number of Unicode members wanted to address this problem and provide a mechanism that anyone could use to format people’s names in a wide variety of applications, such as contact lists, air travel, billing applications, CRMs, social media, and any other application that asks for user information and presents it back to the user or others.

The Unicode® Person Name Formats defines patterns used to take a person’s name and format it correctly in a given language or locale depending on a chosen context. With the Unicode Common Locale Data Repository (CLDR), locale codes and name sequences can be selected to create a specific pattern for formatting a person’s name — including preferences for formal, informal, or abbreviated versions. As a result, designers and developers can correctly display names according to the user’s native locale and culture, especially important when integrating names in different character scripts, such as Japanese, Chinese, or Russian.

The Unicode Consortium added Person Name formatting to CLDR in v42 and has been refined and enhanced for v43, which just released in April. In CLDR v43, with the help of linguists from around the world, we completed data for formatting people’s names for CLDR locales at modern coverage. Its formal name is "Unicode Technical Standard #35 Unicode Locale Data Markup Language (LDML); Part 8: Person Names". ICU has added the PersonNameFormatter class and is available in ICU 73.

To learn more, and get an idea of the implications for user experience and application design, see the following paper, which provides an illustration of the many contexts in which names can be formatted through CLDR Person Names.

LDML (UTS#35) Part 8: Person Names - a story teller’s case study



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, May 23, 2023

Unicode 15.1 Beta Review Open

[image] The beta review period for Unicode 15.1 has started, and is open until July 4, 2023. The beta is intended primarily for review of character property data and changes to algorithm specifications (Unicode Standard Annexes).

Normally at this phase of a release, the character repertoire is considered stable and very unlikely to change. Also, the plan for Unicode 15.1 had been for a minor release with only a very limited set of new characters.

Recent developments have led to a tentative change in those plans, however.

China has a very urgent need for encoding of certain CJK ideographs used in public services databases. To accommodate this urgent need, the Unicode Technical Committee (UTC) decided at its April 2023 meeting to encode 603 new characters in Unicode 15.1 as CJK Unified Ideographs Extension I. This new block is included in the delta charts for the Unicode 15.1 beta. However, inclusion of these characters in Unicode 15.1 is contingent on support for this addition from China, and on support for this addition in the corresponding ISO/IEC 10646 standard from ISO/IEC JTC 1/SC 2 at their upcoming meeting in June. While support for the new block is anticipated, there is a small chance that minor changes to this repertoire will be made after the beta, or that UTC will pull this block entirely from the 15.1 release.

Several of the Unicode Standard Annexes have significant modifications and associated data changes for version 15.1. For example, UAX #14, Unicode Line Breaking Algorithm has significant enhancements to support line breaking at orthographic syllable boundaries in several South and Southeast Asian scripts. Also, in conjunction with the parallel development of a new standard, UTS #55, Unicode Source Code Handling (see Public Review Issue #474), there are significant revisions to UAX #31, Unicode Identifiers and Syntax that will provide better specifications and guidance related to security, and also improved guidance for applications that define identifier systems using Unicode.

While draft content for the beta has been published as of May 23rd, the work groups preparing updates to the content could continue to make changes to data or specs during the Beta review period. Any substantive changes for the beta will be frozen by June 5th.

Please review the documentation, adjust your code, test the data files, and report errors and other issues to the Unicode Consortium by July 4, 2023. The review period will only be for six weeks, so prompt feedback is appreciated. Feedback instructions are on the beta page.

See https://www.unicode.org/versions/beta-15.1.0.html for more information about testing and providing feedback on the 15.1.0 beta.

See https://www.unicode.org/versions/Unicode15.1.0/ for the current draft summary of Unicode Version 15.1.0.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, May 16, 2023

LDML (UTS#35) Part 7: Keyboards

[image] CLDR-TC has authorized a new Public Review Issue, #476, for a major revision of LDML (UTS#35) Part 7: Keyboards. CLDR-TC and CLDR Keyboard-SC would appreciate feedback on whether there are specific changes or enhancements that should be made in the proposed specification.

Today, every platform must independently evaluate, prioritize, and implement all new or updated keyboard layouts, leading to major inconsistencies and delays especially where digitally disadvantaged languages are concerned. Consequently, language communities and other keyboard authors must see their designs developed independently for every platform/operating system, resulting in unnecessary duplication of technical and organizational effort.

“Keyboard 3.0” is designed from the ground up to be usable as a solution to support both hardware and on-screen (touch) layouts for all platforms in a single source file for each language.

With Keyboard 3.0, leading members of the language communities will be able to submit their layout once to CLDR, and it will be available to all platforms as part of the latest version of CLDR, making adoption much easier for platforms. Platform vendors will not need to develop and maintain their own keyboard layout data, especially for languages that they don’t yet support.

This work contributes to the goals of the United Nations International Decade of Indigenous Languages by improving the path for Digitally Disadvantaged Language communities to develop platform support for their languages. Users should see improvements in consistency between platforms, as layouts can be shared.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, May 2, 2023

UTC #175 Highlights

by Peter Constable, UTC Chair

We had another productive Unicode Technical Committee (UTC) meeting last week,hosted at Adobe headquarters in downtown San Jose, California. Here are some highlights from the meeting.

Unicode 15.1 Beta

UTC has authorized the Beta release for Unicode 15.1. There were various, relatively minor technical changes to be made based on feedback during the Alpha review period, plus one major change that I’ll describe below. The Beta is scheduled for release on May 23, for a six week public review period to end July 4. That closing date will provide time for working groups to review feedback and provide recommendations for the next UTC meeting July 25 – 27.

CJK Extension I & GB 18030

A major change for Unicode 15.1 that was decided on was to encode 603 characters in a new CJK Unified Ideographs Extension I block. (See L2/23-106.) This was part of long discussions about GB 18030-2022 and Amendment 1 of that standard which China is currently developing. China has an urgent need for these characters, and the draft of their amendment has them allocated in reserved code positions of Unicode and ISO/IEC 10646, which is not viable from the perspective of the international standards. So, UTC has taken initiative to have China's need accommodated in a standards-conforming manner.

There was discussion as to whether the new characters should be added to Unicode 15.1 or to Unicode 16.0: it was generally preferred to wait for 16.0, but 15.1 was tentatively chosen in case that makes a significant difference for China’s process.

UTC recommended the addition of CJK Extension I to the INCITS/CS&I committee (mirror for JTC 1/SC 2—also met last week) who agreed to recommend to SC 2 the addition of that block in Amendment 2 of ISO/IEC 10646. See L2/23-114 and L2/23-115 for more information.

Orthographic syllable support in UAX #14

Another significant addition for Unicode 15.1 is that UTC approved extending UAX #14 Unicode Line Breaking Algorithm to support breaking of various South and Southeast Asian scripts at orthographic syllable boundaries. The algorithm for this is based on a proposal from Norbert Lindenberg and others (see L2/22-086), with details for incorporation into UAX #14 provided by Robin Leroy (see L2/23-072). A prototype implementation had been created as a public review issue (see PRI #472), and feedback had been positive. This will be a very significant enhancement in Unicode 15.1 providing important improvements in support for several South and Southeast Asian scripts.

Unicode display in text terminals

A new UTC project was initiated at this meeting to develop specifications for supporting display of scripts that require complex shaping in text terminals. This was introduced with a presentation by Renzhi Li and Dustin Howett of Microsoft (see L2/23-107). Even though the majority of computing device usage today is via GUIs, text terminals are still used in many scenarios. Thus, there was considerable interest among UTC participants in this proposal. An ad-hoc working group, chaired by Dustin Howett, will be formed to develop specifications. If interested in participating, let me know and I’ll connect you with Dustin.

Full details on these and other outcomes will be provided in the draft minutes that will be available soon (as L2/23-076 in the document registry).



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Monday, April 17, 2023

ICU4X 1.2: Now With Text Segmentation and More on Low-Resource Devices

By Shane Carr, Chair of the ICUX Subcommittee

Across the globe, people are coming online with smaller and more varied devices including smartphones, smart watches, and gadgets. An offshoot of the International Components for Unicode (ICU) Committee, the ICU4X Committee is responsible for enabling these next-generation devices to communicate with each other in thousands of languages. Written in Rust, ICU4X brings lightweight, modular, and secure internationalization libraries to low-resource devices and many programming languages.

Since our first big release in September 2022, the ICU4X team has been busy building additional features and infrastructure. Today, the team is excited to announce ICU4X 1.2, featuring the first stable release of the Segmenter component, more Unicode properties, property names, a technology preview of language and script display names, HarfBuzz bindings, CLDR 43, full compliance with the Unicode Bidirectional Algorithm (UAX #9), and many smaller features and improvements to the ICU4X components.

Text segmentation is the process of dividing strings into meaningful units, such as words, sentences, or grapheme clusters (characters). It is a fundamental task in a wide range of applications, including cursor movement, highlighting spans of text, evaluating text for spelling and grammatical correctness, information retrieval, and text layout.

ICU4X 1.2 supports the two standards Unicode Text Segmentation (UAX #29) for word, sentence, and grapheme cluster segmentation and Unicode Line Breaking Algorithm (UAX #14) for line segmentation.

Given ICU4X's focus on being lightweight for deployment in resource-constrained environments, the team focused on ways to reduce data size versus ICU4C. The highest-impact differences come from the use of runtime tailoring (reducing the number of rule tables) and machine learning models (eliminating the need for Southeast Asian word dictionaries). Overall, ICU4X data for segmentation is 20.1% smaller than the equivalent data in ICU4C, and 60.7% smaller for line break segmentation.

In addition to being smaller in size, ICU4X's line and word segmenters are 19.1% and 52.2% faster in non-complex scripts and 46.9% and 32.1% faster in Chinese than the equivalents in ICU4C, respectively.

The machine learning models in ICU4X are used for word and line breaking in Southeast Asian languages including Thai, Lao, Khmer, and Myanmar. The models use an LSTM, are trained on large datasets, and achieve high accuracy while retaining small model size. By leveraging modern computer architecture features such as SIMD, the team optimized the performance of the LSTM inference to be about 3× faster than the naive implementation. However, the dictionary model remains the fastest, about two orders of magnitude faster than the LSTM. ICU4X offers both types of models for clients to choose.

Another focus of ICU4X 1.2 has been to support your text layout stack. A text layout engine requires more than the scope of either ICU4C and ICU4X, but any layout engine requires at least two ICU features: line break segmentation and the ability to correctly order bidirectional text. ICU4X 1.2 supports the segmentation and bidirectional text needs of Skia’s SkParagraph and HarfBuzz.

Finally, ICU4X 1.2 brings a number of smaller features to other components. The experimental Display Names component now supports language and script display names, in addition to region display names; the Properties component supports converting UCD property and value enum discriminants to their long and short names, and vice-versa; and all components have been upgraded to support CLDR 43.

Read the full ICU4X 1.2 release notes and then the ICU4X tutorial to start using ICU4X in your project.

To learn more about the latest release, be sure to attend our ICU4X Virtual Open House this Wednesday, April 19th at 9am PT.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Thursday, April 13, 2023

ICU 73 Released

ICU LogoUnicode® ICU 73 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 73 updates to CLDR 43 locale data with various additions and corrections.

ICU 73 improves Japanese and Korean short-text line breaking, reduces C++ memory use in date formatting, and promotes the Java person name formatter from tech preview to draft.

ICU 73 and CLDR 43 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15.1 which is planned for September.)

ICU 73 updates to the time zone data version 2023c (March 2023). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/73.



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Wednesday, April 12, 2023

Unicode CLDR v43 released

[image] CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

It is important to review the Migration section for changes that might require action by implementations using CLDR directly or indirectly (eg, via ICU).

CLDR 43 is a limited-submission release, focusing on just a few areas:
  1. Formatting Person Names
    • Completing the data for formatting person names, allowing it to advance out of “tech preview”. For more information on the benefits of this feature, see Background.
  2. Locales
    • Adding substantially to the LikelySubtags data: This is used to find the likely writing system and country for a given language, used in normalizing locale identifiers and inheritance. The data has been contributed by SIL.
    • Inheritance: Adding components to parentLocales, and documenting the different inheritance for rgScope data, which inherits primarily by region.
  3. Other data updates
    • In English, Türkiye is now the primary country name for the country code TR, and Turkey is available as an alternate. Other locales have been reviewed to see whether similar changes would be appropriate.
    • Name for the new timezone Ciudad Juárez.
  4. Structure
    • Adding some structure and data needed for ICU4X & JavaScript, for calendar eras and parentLocales.
  5. Collation & Searching
    • Treat various quote marks as equivalent at a Primary strength, also including Geresh and Gershayim.
To find out more about these and other changes, see the CLDR v43 release page.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]