Friday, May 31, 2024

New Event on June 25 - Webinar on Bidirectional Text (Part 1): The Basics of Bidi

Registration is Now Open!

A number of scripts, such as Hebrew, Arabic and Urdu, write their letters horizontally on a page or screen, running right to left. A complication for these scripts is that other characters, such as digits, flow left-to-right, and can occur on the same line, or even alongside other left-to-right text, such as Latin. Text that handles both right-to-left and left-to-right text is called “bidirectional” text (“bidi” in short).

How to handle bidi text on browsers and in other software is challenging for both general users and implementers. This webinar will describe the basics with examples. It will be followed by a live question-and-answer period. A more in-depth question and answer session will take place August 13, 2024.

Who? If you are a translator/localizers, localization tooling maker, I18n infrastructure developer, linguist and language researcher, application developer, or a content author, you will want to join us for this webinar. Bring your questions to the people involved for the live Q&A.

When? Tuesday, 25 June 2024 starting at 8am (San Francisco), 11am (New York), and 5pm (Berlin).

Registration is Open Now! Please note this session will also be recorded and available via the Unicode YouTube channel.


Getting Started with Bidirectional Text (Part 1): The Basics of Bidi

Frequently Asked Questions: https://unicode.org/faq/bidi.html

Articles:
Additional Articles from W3C:

About the Unicode Consortium

The Unicode Consortium is the premier non-profit open source, open standards body for the internationalization of all software and services.

For more than 30 years, the Unicode Consortium has coordinated the efforts of a worldwide team of volunteer programmers and linguists to standardize, evolve, and maintain a global software foundation that allows virtually every computer system and service to help people connect using their native language.

For additional information about Unicode, visit home.unicode.org.

Unicode Resources


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐚🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, May 21, 2024

Unicode 16.0 Beta Review Open

[image]
The beta review period for Unicode® 16.0 has started and is open until July 2,2024.

The beta is intended primarily for review of character property data and changes to algorithm specifications (Unicode Standard Annexes). Also, for the first time, a complete draft of the core specification text is available for review during the beta period.

At this phase of a release, the character repertoire is considered stable. For this release, 5,185 new characters will be added, bringing the total number of encoded characters in Unicode 16.0 to 154,998. The new additions include seven new scripts:
  • Garay is a modern-use script from West Africa
  • Gurung Khema, Kirat Rai, Ol Onal, and Sunuwar are four modern-use scripts from Northeast India and Nepal
  • Todhri is an historic script used for Albanian
  • Tulu-Tigalari is an historic script from Southwest India
Other character additions include seven new emoji characters plus 3,995 additional Egyptian Hieroglyphs and over 700 symbols from legacy computing environments. See the delta code charts for details on all the new scripts and characters.

In addition to new characters, new “Moji Jōhō Kiban” (文字情報盀) Japanese source references will be added for over 36,000 CJK unified ideographs. This will be reflected in the code charts for virtually all CJK unified ideograph blocks by additional representative glyphs in the “J” column. Note that these glyph additions are not reflected in the delta charts mentioned above, but can be seen in the main (“single-block”) charts for the Unicode 16.0 Beta.

Various changes to properties, algorithms, and Unicode Standard Annexes will be made for Unicode 16.0. This version will add two new Unicode Standard Annexes:
  • UAX #53, Unicode Arabic Mark Rendering, provides a specification for interoperable font and shaping implementations for Arabic script. (This was previously published separately from the Unicode Standard as a technical report.)
  • UAX #57, Unicode Egyptian Hieroglyph Database (Unikemet), provides data essential for understanding the identity of over 5,100 Egyptian Hieroglyph characters encoded in Unicode 16.0. (This is similar to data for CJK unified ideographs provided in UAX #38.)
A new UCD file, DoNotEmit.txt, will provide data in machine readable form that can be useful for software implementations but that previously was provided only as tables within the core specification text. See the Unicode 16.0 Beta landing page for other noteworthy property and algorithm changes.

For full details regarding the Beta, see Public Review Issue #502. Feedback should be reported under PRI #502 using the Unicode contact form by July 2, 2024.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐚🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Monday, May 20, 2024

Unicode CLDR Version 46 Submission Open

[image] The Unicode CLDR Survey Tool is open for submission for version 46. CLDR provides key building blocks for software to support the world’s languages (dates, times, numbers, sort-order, etc.) All major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

Version 46 is focusing on:
  • Unicode 16 additions: new emoji, script names, collation data (Chinese & Japanese), …
  • Emoji search keywords: Expanding keyword coverage to make it easier for users to find the right emoji
  • New Languages targeting Basic:
    • Ewe (ee),
    • Ga (gaa)
    • Kinyarwanda (rw)
    • Northern Sotho (nso)
    • Oromo (om),
    • Sesotho (st)
    • Setswana (tn),
  • Up-leveling: Akan (ak)
Submission of new data opened recently, and is slated to finish on June 11. The new data then enters a vetting phase, where contributors work out which of the supplied data for each field is best. That vetting phase is slated to finish on July 1. A public alpha makes the draft data available around August 28, and the final release targets October 16.

Each new locale starts with a small set of Core data, such as a list of characters used in the language. Submitters of those locales need to bring the coverage up to Basic level (very basic basic dates, times, numbers, and endonyms) during the next submission cycle.

Once a language reaches Basic coverage, it has the minimum support for use in language selection, such as on mobile devices. In the next submission cycle, the name for that language is also added for translation for all languages at Modern coverage.

If you would like to contribute missing data for your language, see Survey Tool Accounts. For more information on contributing to CLDR, see the CLDR Information Hub.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐚🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Thursday, May 2, 2024

Unicode Technical Committee (UTC) Updates from Meeting #179

by Peter Constable, UTC Chair

The Unicode Technical Committee (UTC) met last week (April 23 to 25) in San Jose, California. Thanks to Unicode member company Adobe for hosting. Here are some highlights from the large number of items that were covered.

Preparing Unicode 16.0 Beta

An important objective was to cover all technical decisions that would be needed for the Unicode 16.0 Beta preview. The Beta will be available for public review and comment on May 21, 2024, and will include all charts, data and annexes for The Unicode Standard as well as other synchronized standards, including UTS 10, Unicode Collation Algorithm, and UTS 51, Unicode Emoji. Also, for the first time, the Beta release will include a complete draft of the core text of the standard.

The character repertoire for Unicode 16.0 was slightly adjusted, with the removal of two characters: U+0CDC KANNADA ARCHAIC SHRII and U+0C5C TELUGU ARCHAIC SHRII. These characters were first approved in January 2022 (UTC #170) and assigned for addition in Unicode 16.0 in April 2023 (UTC #175). However, in the ISO process for Amendment 2 of ISO/IEC 10646:2022 (which is to be synchronized with Unicode 16.0), the India national body requested more time for review by experts in India. To avoid a risk of Unicode 16.0 and Amendment 2 of 10646 not being in sync, UTC decided to delay these two characters for a later version.

Various character property (UCD) and algorithm changes were made based on issues reported during the Alpha review or found while the UTC Properties and Algorithms Working Group prepared data files for 16.0. Two notable areas for changes are grapheme cluster segmentation (UAX #29) and line breaking (UAX #14):
  • For grapheme clusters, some changes will be made to extended grapheme cluster segmentation for improved handling of orthographic syllables in Indic scripts.
  • For line breaking, several changes will be made to data and rules to fix various edge cases, and to incorporate behaviour for hyphens that has already been implemented in CLDR and ICU for several years.
Also related to properties, the organization of the ScriptExtensions.txt file will be changing. Previously, lines of data were grouped by characters that had the same script extension property values. Going forward, lines will be ordered by code point. (This is only a change in the order the data is listed; the parsing of lines is unchanged.) This will make it much easier to compare changes in property values between different Unicode versions.

In relation to emoji, the set of new emoji for version 16.0 is unchanged. During the Beta review, the draft update for UTS #51, Unicode Emoji, will include some proposed revisions related to recommendations for display of emoji family combinations. These revisions have not yet been reviewed and approved by UTC, so will require careful review and will be subject to confirmation or change at the next UTC meeting, after the Beta review period is over.

UTC action item backlog

UTC has had a growing backlog of open action items, some over ten years old. For this meeting, the various UTC working groups triaged their action items that were five or more years old, and outcomes were discussed by the UTC. Some action items were completed; some were closed as no longer relevant. Many that required more research were closed as UTC action items and replaced by issues in the relevant working group’s GitHub repo. Note that tracking them in this other way doesn’t necessarily mean they will get higher priority. However, since the working groups are using GitHub issues to organize their regular work, this should bring more attention to these issues. UTC will repeat this process at UTC #181, six months from now.

As a side effect of this review of old action items, a document was submitted to UTC (L2/24-123) proposing that UTC transition from the way it has handled action items in the past to tracking issues in a public GitHub repo to allow contributions from a broader set of volunteers. That document identifies some problems and limitations of the existing processes, and suggests that a new process could provide improvements. UTC spent some time discussing this document. It was noted that the idea was valuable, though such a change in processes would not be a small change and would involve some not-so-obvious challenges. It would also be something that affects the Unicode Consortium as a whole, not just UTC. For that reason, this proposal will need to be considered as part of a broader discussion of Consortium processes, resources and infrastructure.

New investigation: automatic space handing at inter-script boundaries

East Asian text often combines different scripts, and a common typographic practice is to insert space between script runs. UTC briefly discussed a new document, L2/24-057, which proposes development of an algorithm for automatic spacing between script runs. The Properties and Algorithms Working Group has assembled experts to discuss this topic. Interested experts are invited to participate in discussion via issues (with "auto-spacing" label) in the public unicodetools repo in GitHub.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐚🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

SILICON Joins as Supporting Member of the Unicode Consortium

[image]The Unicode Consortium is pleased to announce that SILICON has joined as a Supporting Member.

The Stanford Initiative on Language Inclusion & Conservation in Old & New Media (SILICON) is a humanities-led tech initiative at Stanford University aiming to promote and sustain Digitally Disadvantaged Languages and, more broadly, address digital inequalities. Bridging gaps between Engineering, the Humanities, Computer Science, and the Social Sciences, the initiative seeks to help build tomorrow’s digital tools: improved OCR algorithms and AI generative text models; more globally inclusive text corpora, interfaces, keyboards, and digital fonts.

SILICON is interested in accelerating the timeline for digitally disadvantaged languages to be fully usable by their communities, by facilitating ongoing conversation between people involved in Unicode’s encoding work, designers of the fonts and keyboards, script and language communities, and technical experts, linguists, and technologists. We will also be working towards usable OCR for newly-encoded languages, with an eye towards developing corpora for LLM training.

“In the 21st century, the intertwining fate of language death and digital exclusion underscores a critical challenge: the marginalization and potential extinction of diverse linguistic heritage. With over 98% of the world’s ~7000 languages categorized as ‘Digitally Disadvantaged Languages’ by the Unicode Consortium, the urgency to bridge this digital divide is unmistakable. SILICON is delighted to support the pivotal role played by Unicode, long at the forefront of advancing the cause of Digitally Disadvantaged Languages globally.” - Tom Mullaney, Professor of History at Stanford University and Co-Director of SILICON 

“We are excited to welcome SILICON as a Supporting member of the Unicode Consortium. By integrating SILICON’s interdisciplinary expertise, we look forward to working together to advance digital inclusiveness.” - Toral Cowieson, CEO of Unicode

Supporting members of the Consortium have a half vote as well as representation on up to two technical committees. A list of Consortium members can be found here.


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐚🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock