Friday, February 26, 2021

Unicode 14.0 Alpha Review

Vithkuqi chart image The repertoire for Unicode 14.0 is now open for early review and comment. During alpha review the repertoire is reasonably mature and stable, but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be entertained.

This early review is provided so that reviewers may consider the character repertoire issues prior to the start of beta review (currently scheduled to start in June, 2021). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.

Feedback for the alpha review should be reported under PRI #428 using the Unicode contact form by April 12, 2021.


Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, February 24, 2021

Enhancements to Unicode Regular Expressions

Regex image A Proposed Update UTS #18, Unicode Regular Expressions is now available for review and feedback.

Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).

A major enhancement to UTS #18 in 2020 focused on the addition of Character Classes with strings. The initial impetus for this was to handle emoji effectively in browsers, as most emoji consist of more than one code point. Supporting strings directly in character classes frees up programs from having to download large amounts of data or handle complicated syntax. Using a property like RGI_Emoji allows a regular expression to match both individual codes such as "😁" and multi-codepoint strings such as "🇫🇷". This extension to strings is also important for internationalization. For example, the alphabets used by many languages contain multi-code-point strings, so this extension allows them to be handled easily.

Additional enhancements are in progress this year, based on working with members of the ECMAScript committee, including more clarifications, better guidance on implementation, and addressing some tricky issues dealing with complementing (inverting) Character Classes. The end goal of all of these enhancements in 2020 and 2021 is to significantly raise the level of Unicode support in programming languages and libraries.

For more information, see https://www.unicode.org/review/pri427/.


Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Tuesday, February 2, 2021

Unicode Consortium looking to hire an Executive Director

Since its founding, the Unicode Consortium has grown and expanded its charter and scope. We’re embarking on a new chapter in the evolution of the Consortium by initiating the search for a leader with proven executive talents to fill the newly-created position of Executive Director. Learn more: https://www.unicode.org/consortium/edappinfo.html


Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]