The Unicode Blog: 2020

Friday, December 18, 2020

Unicode 2020 Bulldog Award

The 2020 Unicode Bulldog Award recognizes Kristi Lee for her significant contributions to the work of the Unicode Consortium’s CLDR Technical Committee. Upon joining the CLDR committee as Microsoft’s representative, Kristi quickly focused on improving CLDR development and release processes, enabling the CLDR team to work far more efficiently and effectively. This has improved the functionality of the CLDR Survey Tool, and thus better serves the users of CLDR releases. Among many other improvements, she instituted and organized periodic CLDR face-to-face meetings where the team can focus on strategic planning. Through all these efforts, Kristi has brought strong leadership to enable more streamlined development and a better focus on future directions.

In 2020, Kristi was formally made Vice Chair of the CLDR Technical Committee, a role she had effectively filled for some time!

Thursday, October 29, 2020

ICU 68 Released

Unicode® ICU 68 has just been released. ICU 68 updates to CLDR 38 locale data with many additions and corrections. ICU 68 brings support for locale-dependent smart unit preferences (road distance, temperature, etc.), implements locale ID canonicalization conformant with CLDR, and includes many other bug fixes and enhancements.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details, please see http://site.icu-project.org/download/68.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, October 28, 2020

Unicode CLDR Language Data v38 released

The final release of Unicode CLDR version 38 is now available. Unicode CLDR provides an update to the key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

CLDR v38 focused on enhancing the support for existing locales: Support for units of measurement in inflected languages (phase 1), adding annotations (names and search keywords) for many more non-emoji symbols (~400), plus for Emoji v13.1. In this version, there is also substantially higher coverage for (in order of completeness): Norwegian Nynorsk, Hausa, Igbo, Breton, Quechua, Yoruba, Fulah (Adlam script), Chakma, Asturian, Sanskrit, and Dogri.

The Survey Tool has improvements in performance, and introduced structured forum requests to improve coordination among translators. We would like to thank the 393 language experts who contributed to this release.

There are some changes that affect existing specifications and data: for example, the plural rules for French changed to add a new category; the specification for using aliases is more rigorous, and some alias data has changed — along with the specification for handling locale identifier canonicalization. For more information, see Migration.

The overall changes to the data items were:

Added	Deleted	Changed
155,131	33,805	45,895

See additional details in the CLDR v38 Release note.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, October 23, 2020

Announcing ICU4X 0.1

We are thrilled to announce the first pre-release version of the ICU4X internationalization components. ICU4X aims to provide high quality internationalization components with a focus on:

Modularity
Flexible data management
Performance, memory, safety and size
Universal access from programming languages and ecosystems (FFI)

ICU4X draws from the experience of projects such as ICU4C, ICU4J, ECMA-402, CLDR, and Unicode.

Target

ICU4X is initially focusing on a subset of internationalization APIs standardized in ECMA-402 in order to cover the needs of client-side ecosystems and thin clients.

ICU4X targets a wide range of programming languages and environments, aiming to expose its APIs to languages such as Javascript, WebAssembly, Dart, C++, Python, PHP, and others.

With our focus on client-side ecosystems a lot of effort will be placed on minimizing the size, memory, and CPU utilization, and allowing for asynchronous data management.

More information on the design can be found in the project’s Announcement article.

Status

This first pre-release 0.1 version is written in Rust and introduces a small subset of APIs and scaffolding for flexible data management.

We would like to invite everyone to try it out. Take a look at the documentation and provide feedback on the API design. We’re also looking for feedback on the algorithms and data structures we use, especially from contributors with experience in Rust and ICU algorithms

More information on the release can be found in the Release Notes.

Roadmap

The next version, 0.2, will focus on validating the ability to expose ICU4X APIs to other programming environments and extending the data management system to be asynchronous.

The project is fully open source and invites all interested parties to join the effort of designing and developing a modular internationalization components system in Rust.

To learn more on how to contribute to the project, visit the CONTRIBUTE document in the project’s repository.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, October 9, 2020

Unicode CLDR Locale Data v38 beta available for testing

The beta version of Unicode CLDR version 38 is now available. The data will not be changed except for showstoppers, but the LDML v38 spec can still be changed. The final release of v38 is planned for October 28, 2020. If you find any problems, please file a ticket.

Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

CLDR v38 includes:

Enhancements to existing locale data: adding support for units of measurement in inflected languages (phase 1), adding annotations (names and search keywords) for Unicode symbols that are non-emoji (~400), and annotations for Emoji v13.1.
Survey Tool upgrades: substantial performance improvements, plus structured forum entries to improve coordination among translators.

LDML v38 includes:

To make the canonicalization of locale identifiers clear and unambiguous, provided major restructuring of the specification for it. (This was done in concert with fixes to the alias data to work better with the specification.)
To support inflected units of measurement:
- minimalPairs adds new elements
  caseMinimalPairs and genderMinimalPairs
- unit adds a new element gender
- grammaticalData adds new elements
  grammaticalDerivations, deriveCompound, and deriveComponent
- unitPattern adds a new attribute case
- grammaticalCase, grammaticalGender, grammaticalDefiniteness add a new attribute scope
- compoundUnitPattern1 adds new attributes case and gender
- compoundUnitPattern adds a new attribute case
To allow for overriding dictionary-based segmentation breaks, added the Unicode Dictionary Break Exclusion Identifier, with the new key “dx”.
For picking the correct units of measurement for locales, defined the userPreferences skeleton more precisely.
For accurate plural categories in compact numbers, added the 'c' operand to plural rules to provide formatting for languages such as French.

See additional details in the draft CLDR v38 Release note.

The overall changes to the data items were:

Added	Deleted	Changed	Total
155,131	33,805	45,895	2,175,821

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, September 18, 2020

Emoji 13.1 — Now final, to be widely available in 2021

Emoji 13.1 is now final with 217 new emoji sequences! Of these, 210 are skin tone variants; the other seven new emoji are:

Most of the skin tone variants are for the multi-person emoji groupings couples with heart and couples kissing.

This minor release was created to add new emoji before 2022. The Unicode Consortium is a volunteer organization and we would be completely without new emoji in 2021 if it weren’t for the dedication of many volunteers who make this possible. Thank you! ✨

The new emoji are listed in Emoji Recently Added v13.1. The images provided on that page are just samples: vendors for mobile phones, PCs, and web platforms create their own images.

New emoji in this release should begin appearing on devices in the coming months. These new emoji will also be available for adoption. Donations for adoptions help the Unicode Consortium’s work on digitally disadvantaged languages.

For implementers:

There are no new atomic characters. Instead, each emoji is a sequence of existing characters.
UTS #51 and associated data files have been updated for Emoji 13.1.
CLDR v38 alpha has also been updated for Emoji 13.1. This includes names, search keywords, and sort orderings for the new emoji, available for over 80 languages. It is scheduled for release at the end of October.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Tuesday, September 15, 2020

Unicode CLDR Locale Data v38 alpha available for testing

The alpha version of Unicode CLDR version 38 is now available for data testing. The final release of v38 is planned for October 22, 2020. If you find any problems with the data, please file a ticket.

Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

CLDR v38 includes:

Enhancements to existing locale data: adding support for units of measurement in inflected languages (phase 1), adding annotations (names and search keywords) for Unicode symbols that are non-emoji (~400), and annotations for Emoji v13.1.
New locales added: Dogri and Sanskrit.
Survey Tool upgrades: substantial performance improvements, plus structured forum entries to improve coordination among translators.

See additional details in the draft CLDR v38 Release note

The overall changes to the data items were:

Added	Deleted	Changed
155,131	33,805	45,895

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Tuesday, September 1, 2020

Emoji 15.0 Submissions Re-Open April 15, 2021

The Unicode Consortium is postponing the submissions of new emoji for Unicode version 15.0 until April 15, 2021. This delay follows on the postponement of the release of the upcoming Unicode 14.0 version from March to September 2021.

This delay impacts related specifications and data, such as new emoji characters. As a consequence, the deadline for submission of new emoji character proposals for Emoji 14.0 was extended until September 1, 2020.

Pausing Processing of New Emoji Proposals ⏸️

The Emoji Subcommittee is in the process of revising the submission form. Until the new submission form is ready on April 15, 2021, proposals will be returned to sender. During this period the committee will also be prioritizing Emoji 15.0 initiatives as described in document L2/20-197.

Submissions for Emoji 15.0 Open April 2021 ▶️

The Emoji Subcommittee will be accepting new emoji character proposals for Emoji 15.0 from April 15, 2021 onward. Any new emoji characters incorporated into Emoji 15.0 can be expected to appear on devices such as computers, phones, and tablets in 2023.

Edited 2021-03-31 to reflect modification of the opening date from April 2 to April 15.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, August 20, 2020

Tableaux des caractères Unicode 13.0 désormais disponibles en langue française

Les tableaux des caractères Unicode 13.0 en langue française sont désormais disponibles sur le site web d’Unicode. Après un long travail de traduction réalisé par des experts francophones (du Canada, de France et de Belgique), une grande partie du système proposé aux locuteurs anglophones pour l’accès en ligne aux tableaux de caractères (https://www.unicode.org/charts/) a été reproduite en langue française pour les utilisateurs francophones et est disponible sous ce lien : https://www.unicode.org/charts/fr/. Cette page du site propose un accès aux différents blocs définis dans les tableaux des caractères Unicode 13.0, rangés par catégorie (écritures, symboles, ponctuation, etc.). La recherche par code hexadécimal d’un caractère est également proposée sur cette page. Et une recherche par nom de caractère est possible sur cette autre page : https://www.unicode.org/charts/fr/charindex.html (un clic sur le lien intitulé « Index des noms » vous y conduira directement).

Les tableaux des caractères Unicode 13.0 en langue française sont également disponibles sous la forme d’un fichier unique à cette adresse : https://www.unicode.org/Public/13.0.0/charts/fr/ ; il n’est toutefois pas prévu de fournir des tableaux en langue française mettant en lumière les caractères ajoutés au répertoire de la version actuelle (c’est-à-dire des fichiers équivalents à ceux que l’on trouve sous ce lien : https://www.unicode.org/charts/PDF/Unicode-13.0/).

Ces tableaux sont également accessibles depuis : https://www.unicode.org/versions/Unicode13.0.0/#Code_Charts.

Marc Lodewijck a été le principal contributeur à la réalisation des tableaux de caractères en langue française pour la version 13.0 d’Unicode, un travail auquel ont largement participé, en particulier, Patrick Andries, Alain LaBonté, Michel Suignard et François Yergeau, ainsi que quelques autres personnes.

Avertissement : la fourniture des tableaux des caractères Unicode 13.0 en langue française n’implique nullement que le Consortium Unicode créera de tels tableaux (en français ou dans d’autres langues que l’anglais) pour les versions à venir du standard Unicode. Contrairement aux noms des caractères Unicode en langue anglaise, leurs équivalents en langue française ne constituent pas un élément normatif du standard Unicode.

Unicode 13.0 code charts now available in French

The Unicode 13.0 code charts are now also available in French on the Unicode web site. Following an extensive translation work by French-speaking experts (from Canada, France, and Belgium), a large part of the online code chart mechanism available to English speakers at https://www.unicode.org/charts/ has been duplicated in French at https://www.unicode.org/charts/fr/. That link allows the access to the various blocks defined in the Unicode 13.0 code charts, based on their categories (scripts, symbols, punctuation, etc.). The search by hex code is also available on the same page. And you may access an index of character names on the following page: https://www.unicode.org/charts/fr/charindex.html (clicking on the link labeled “Index des noms” will take you straight to it).

Access to the Unicode 13.0 version of the French-language archival code charts (single file) is also available at https://www.unicode.org/Public/13.0.0/charts/fr/; however there is no plan to provide a French version of the delta code charts (equivalent to https://www.unicode.org/charts/PDF/Unicode-13.0/).

These code charts are also accessible from: https://www.unicode.org/versions/Unicode13.0.0/#Code_Charts.

Marc Lodewijck has been the main contributor to the creation of the French-language Unicode 13.0 code charts, and more have helped in making this possible, including Patrick Andries, Alain LaBonté, Michel Suignard, François Yergeau, and a few other people.

Disclaimer: Providing these French-language code charts for Unicode 13.0 does not imply that the Unicode Consortium will create such code charts (in French or other languages other than English) for future versions of the Unicode Standard. Unlike Unicode character names in English, their French-language equivalents are not a normative part of the Unicode Standard.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, June 18, 2020

Unicode Regular Expressions v21 Released

Regular expressions are a powerful tool for using patterns to search and modify text, and are vital in many programs, programming languages, databases, and spreadsheets.

Starting in 1999, UTS #18: Unicode Regular Expressions has supplied guidelines and conformance levels for supporting Unicode in regular expressions. The new version 21 broadens the scope of properties for regular expressions (regex) to allow for properties of strings (such as for emoji sequences). For example, the following matches all emoji flags except the French flag:

/[\p{RGI_Emoji_Flag_Sequence}--\q{🇫🇷}]/

Among the improvements are:

Provides a new Annex D: Resolving Character Classes with Strings for handling negations of sets of strings.
Updates the full property list to include the latest UCD properties, plus Emoji properties and UTS #39 properties.
Removes obsolete text passages, and makes editorial changes for clarity.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Unicode Consortium Announces New Additions to Leadership Team

We are pleased to announce the following leadership additions at the Unicode Consortium. “Each of these individuals brings deep expertise in their field,” said Mark Davis, president of the Consortium. “They have already made significant improvements in their new roles.”

Unicode Emoji Subcommittee

Chair: Jennifer Daniel

Jennifer Daniel’s first contribution to Unicode was standardizing gender inclusive representations in emoji. As a designer, author and former graphics editor at the New York Times, she now explores communication and messaging through verbal, written, auditory and visual expression at a small ad company called Google. Jennifer is a co-author and illustrator of a number of graphics books including How to Be Human, Space!, and the Origins of Almost Everything. Her work has been recognized by the Walker Art Museum, Society of Illustrators and published in the New Yorker, The Washington Post, and Time Magazine to name a few. She has had the honor to serve as a judge for the Society of News Design, Online News Association, Society of Illustrators, American Illustration, Data is Beautiful and the Art Director's Club. She lives in Berkeley, California but also in cyberspace.

Vice Chair: Ned Holbrook

Ned Holbrook is a typographic engineer at Apple, specializing in text layout and fonts. He was one of the participants in the industry-wide effort to standardize variable font technology in OpenType. He previously worked on wireless networking, virtualization, digital audio, embedded graphics, and remote filesystems.

Unicode CLDR Committee

Vice Chair: Kristi Lee

Kristi Lee is the CLDR technical committee vice-chair, and she represents Microsoft in the CLDR technical committee. She joined Microsoft in 1997 and has worked in a number of different divisions and product development groups. Her focus has been delivering solutions to international customers in localization and internationalization. She holds a mathematics degree from University of Washington. Currently, she is in the Corporate division in Microsoft and works with engineering groups across Microsoft including Windows, .NET, Office, and others on topics relating to CLDR and i18n.

Executive Officer

General Counsel: Anne Gundelfinger

Anne is an experienced legal executive with 30 years in private practice and in-house legal roles. From 2013-2019 she served as vice president for global intellectual property for Swarovski, a global fashion jewelry brand based in central Europe. Before that she held various positions over a decade in the Intel legal department including vice president for global public policy, vice president for global sales & marketing legal affairs, and director of trademarks & brands. Early in her career she was an associate at Fenwick & West and director of trademarks at Sun Microsystems. Since retiring from Swarovski, Anne has been a consultant and has served as a World Intellectual Property Organization domain name panelist under the Uniform Dispute Resolution Policy of ICANN. Anne has long been a leader in the global IP bar. She served on the Board of Directors of the International Trademark Association for nearly a decade and served as the Association’s president in 2005.

Mark Davis, the former chair of the emoji subcommittee, will continue to contribute to the emoji subcommittee and serve as president of the Unicode Consortium. “I’d also like to thank John Emmons for his many years of service as chair and vice chair of the CLDR technical committee,” said Davis. “Especially for his work in promoting support for digitally disadvantaged languages.”

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, June 12, 2020

Unicode 13.0 Paperback Available

The Unicode 13.0 core specification is now available in paperback book form with a new, original cover design by Huijun Shan. This edition consists of a pair of modestly priced print-on-demand volumes containing the complete text of the core specification of Version 13.0 of the Unicode Standard.

Each of the two volumes is a compact 6×9 inch US trade paperback size. The two volumes may be purchased separately or together, although they are intended as a set. Please visit the separate description pages for Volume 1 and Volume 2 to order each volume in the set. The cost for the pair is US $29.58, plus shipping and taxes (if applicable).

Note that these volumes do not include the Version 13.0 code charts, nor do they include the Version 13.0 Standard Annexes and Unicode Character Database, which are all freely available on the Unicode website.

Purchase The Unicode Standard, Version 13.0 - Core Specification Volume 1 and Volume 2

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, June 10, 2020

PRI #418: Registration of additional sequences in the MSARG collection

The Unicode Consortium has posted a new issue for public review and comment.

Public Review Issue #418: A submission for the “Registration of additional sequences in the MSARG collection” has been received by the IVD registrar.

This submission is currently under review according to the procedures of UTS #37, Unicode Ideographic Variation Database, with an expected close date of 2020-09-11. Please see the submission page for details and instructions on how to review this issue and provide comments:

https://www.unicode.org/ivd/pri/pri418/

The IVD (Ideographic Variation Database) establishes a registry for collections of unique, and sometimes shared, variation sequences for ideographs, which enables standardized interchange in plain text, in accordance with UTS #37.

Friday, April 24, 2020

ICU 67 Released

Unicode® ICU 67 has just been released. ICU 67 updates to CLDR 37 locale data with many additions and corrections. This release also includes the updates to Unicode 13, subsuming the special CLDR 36.1 and ICU 66 releases. ICU 67 includes many bug fixes for date and number formatting, including enhanced support for user preferences in the locale identifier. The LocaleMatcher code and data are improved, and number skeletons have a new “concise” form that can be used in MessageFormat strings.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details, please see http://site.icu-project.org/download/67.

Thursday, April 23, 2020

Unicode Locale Data v37 released!

The final version of Unicode CLDR version 37 is now available. It focuses on adding new locales, enhancing support for units of measurement, adding annotations (names and search keywords) for symbols, and adding annotations for Emoji v13.

Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

Expanded locale preferences for units of measurement. The new unit preference and conversion data allows formatting functions to pick the right measurement units for the locale and usage, and accurately convert input measurement into those units.

Emoji 13.0. The emoji annotations (names and search keywords) for the new Unicode 13.0 emoji are added. The collation sequences are updated for new Unicode 13.0, and for emoji.

Annotations (names and keywords) expanded to cover more than emoji. This release includes a small set of Unicode symbols (arrow, math, punctuation, currency, alphanum, and geometric) with more to be added in future releases. For example, see v37/annotations/romance.html.

New locales. New languages at Basic coverage: Fulah (Adlam), Maithili, Manipuri, Santali, Sindhi (Devanagari), Sundanese. New languages at Modern coverage: Nigerian Pidgin. See Locale Coverage Data for the coverage per locale, for both new and old locales.

Grammatical features added. Grammatical features are added for many languages, a first step to allowing programmers to format units according to grammatical context (eg, the dative version of "3 kilometers").

Updates to code sets. In particular, the EU is updated (removing GB).

For more details, access to the data and charts, and important notes for smoothly migrating implementations, see Unicode CLDR Version 37.

Friday, April 10, 2020

Technical Alert: Unicode Technical Website Down

TECHNICAL ALERT: the Unicode Consortium's technical website is hosted in a data center that has experienced a catastrophic failure. We are working to get back online, but this may take a couple weeks. We apologize for the inconvenience. BTW: this failure occurred after we announced we are delaying the release of Unicode 14.0.

Wednesday, April 8, 2020

Unicode 14.0 Delayed for 6 Months

Due to COVID-19, the Unicode Consortium has decided to postpone the release of version 14.0 of the Unicode Standard by 6 months, from March to September of 2021. This delay will also impact related specifications and data, such as new emoji characters.

The Unicode Consortium relies heavily on the efforts of volunteers. “Under the current circumstances we’ve heard that our contributors have a lot on their plates at the moment and decided it was in the best interests of our volunteers and the organizations that depend on the standard to push out our release date,” said Mark Davis, President of the Consortium. “This year we simply can’t commit to the same schedule we’ve adhered to in the past.”

ICU and CLDR to stay on schedule

The two other main Unicode projects, ICU and CLDR, are maintaining their 6-month cycles for releases in the spring and fall, although the feature sets this year may be lighter. The CLDR project supplies language- and locale-specific data and specifications, while the ICU project supplies internationalization code libraries that allow operating systems and applications to use Unicode and CLDR data and specifications. These projects are impacted less by current conditions since they have always operated via virtual meetings and are more compartmentalized, meaning that it is easier to withhold a particular feature if it falls behind schedule without jeopardizing the whole release. Sub-projects of CLDR and ICU, such as the CLDR Message Formatting project, will also be little affected.

Emoji

This announcement does not affect the new emoji included in Unicode Standard version 13.0 announced on March 10, 2020.

Because of the lead time for developers to incorporate emoji into mobile phones, emoji that are finalized in January don’t appear on phones until the following September or so. For example, the emoji that were included in Release 13.0 in March 2020 won’t generally be on phones until the fall of 2020. With the delay of the release of Unicode 14.0, the deadline for submission of new emoji character proposals for Emoji 14.0 is also being postponed until September 2020.

The Consortium is considering whether it is feasible to release emoji sequences in an Emoji 13.1 release. These sequences make use of existing characters. An example from Emoji 13.0 is the black cat, which is internally a combination of the cat emoji and black large square emoji. Since sequences rely only on combinations of existing characters in the Unicode Standard, they can be implemented on a separate schedule, and don’t require a new version of Unicode or the encoding of new characters. Such an Emoji 13.1 release would be in time for release on mobile phones in 2021.

The Emoji Subcommittee will be accepting new emoji character proposals for Emoji 14.0 from June 15, 2020 until September 1, 2020. Any new emoji characters incorporated into Emoji 14.0 would appear on phones and other devices in 2022.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, March 11, 2020

ICU 66 Released

Unicode® ICU 66 has been released. It updates to Unicode 13, including new characters, scripts, emoji, and corresponding API constants. It also updates to CLDR 36.1 with Unicode 13 updates and bug fixes.

These new, extra Q1 releases are for integration by vendors that could not otherwise release their products with the newest version of Unicode. These are low-impact releases with no other significant feature additions or implementation changes. The next feature releases will be CLDR 37 and ICU 67, scheduled for 2020 April.

For details please see http://site.icu-project.org/download/66.

Tuesday, March 10, 2020

Announcing The Unicode® Standard, Version 13.0

Version 13.0 of the Unicode Standard is now available, including the core specification, annexes, and data files. This version adds 5,390 characters, for a total of 143,859 characters. These additions include four new scripts, for a total of 154 scripts, as well as 55 new emoji characters.

The new scripts and characters in Version 13.0 add support for modern language groups in Africa, Pakistan, South Asia, and China:

Arabic script additions to write Hausa, Wolof, and other languages in Africa, and other additions to write Hindko and Punjabi in Pakistan
A character for Syloti Nagri in South Asia
Bopomofo additions for Cantonese

Support for scholarly work was extended worldwide, including:

Yezidi, historically used in Iraq and Georgia for liturgical purposes, with some modern revival of usage
Chorasmian, historically used in Central Asia across Uzbekistan, Kazakhstan, and Turkmenistan to write an extinct Eastern Iranian language
Dives Akuru, historically used in the Maldives until the 20th century
Khitan Small Script, historically used in northern China

Popular symbol additions include:

55 emoji characters, including several new emoji for smileys, gender neutral people, animals, and the potted plant. For the full list of new emoji characters, see emoji additions for Unicode 13.0, and Emoji Counts. For a detailed description of support for emoji characters by the Unicode Standard, see UTS #51, Unicode Emoji.
Six Creative Commons license symbols that are used to describe functions, permissions, and concepts related to intellectual property that have widespread use on the web
Two Vietnamese reading marks that mark ideographs as having a distinct, colloquial reading
214 graphic characters that provide compatibility with various home computers from the mid-1970s to the mid-1980s and with early teletext broadcasting standards

Support for Chinese, Japanese, and Korean (CJK) unified ideographs was enhanced in Version 13.0 by the addition of 4,939 characters in Extension G, which is the first block to be encoded in Plane 3, as well as by significant corrections and improvements to the Unihan database. Changes to Unihan include updated regular expressions for many properties, the addition of several new properties, and the removal of three obsolete provisional properties. See UAX #38, Unicode Han Database (Unihan) for more information on the updates.

Important chart font updates, including:

An update to the code charts for the Adlam script, now using the Ebrima font. That font has an improved design and has gained widespread acceptance in the user community.
A completely updated font for the CJK Radicals Supplement and the Kangxi Radicals blocks. This font is also used to show the radicals in the CJK unified ideographs code charts, as well as in the radical-stroke indexes.

Additional support for lesser-used languages and scholarly work was extended, including:

A character used in Sinhala to write Sanskrit

Unicode properties and specifications determine the behavior of text on computers and phones. Changes in Version 13.0 include the following Unicode Standard Annexes and Technical Standards that have notable modifications:

Five important Unicode annexes updated for Version 13.0:

Three important Unicode specifications updated for Version 13.0:

UTS #10, Unicode Collation Algorithm — sorting Unicode text
UTS #39, Unicode Security Mechanisms — reducing Unicode spoofing
UTS #46, Unicode IDNA Compatibility Processing — compatible processing of non-ASCII URLs

The Unicode Standard is the foundation for all modern software and communications around the world, including operating systems, browsers, laptops, and smart phones—plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). The Unicode Standard, its associated standards, and data form the foundation for CLDR and ICU releases.

Over 140,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, March 6, 2020

Unicode Locale Data v37α available for testing

The alpha version of Unicode CLDR version 37 is now available for testing. The beta v37 will contain updates to the LDML spec and is planned for March 25, and the release of v37 is planned for April 22.

Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

v37 is an update release with focus on units and annotations (emoji and symbol names and search keywords).

Expanded locale preferences for units of measurement. The new unit preference and conversion data allows formatting functions to pick the right measurement units for the locale and usage, and convert input measurement into those units. See additional details in Specification Changes.

Emoji 13.0. The emoji annotations (names and search keywords) for the new Unicode 13.0 emoji are added. The collation sequences are updated for new Unicode 13.0, and for emoji.

Annotations (names and keywords) expanded to cover more than emoji. This release includes a small set of Unicode symbols (arrow, math, punctuation, currency, alphanum, and geometric) with more to be added in future releases. For example, see v37/annotations/romance.html.

9 New locales added. Caddo [cad], Hindi in Latin script [hi_Latn], Kashmiri in Devanagari script [ks_Deva], Maithili [mai], Manipuri (Meitei Mayek) [mni_Mtei], Nigerian Pidgin [pcm], Santali [sat], Santali (Devanagari) [sat_Deva], and Sindhi (Devanagari) [sd_Deva]. See Locale Coverage Data for the coverage per locale, for both new and old locales.

Grammatical features added. Grammatical features are added for many languages, a first step to allowing programmers to format units according to grammatical context (eg, the dative version of "3 kilometers").

Updates to code sets. In particular, the EU is updated (removing GB).

For more details and important notes for smoothly migrating implementations, see the draft release note Unicode CLDR Version 37. For access to the data, see the GitHub tag: release-37-alpha2.

Over 130,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Tuesday, February 18, 2020

Unicode Consortium Announces Version 13.0 Cover Design

The Unicode Consortium is pleased to announce the new design selected for the cover of the forthcoming print-on-demand publication of The Unicode Standard, Version 13.0. The Unicode Consortium issued an open call for artists and designers to submit cover design proposals. All submitted designs were reviewed by an independent panel.

Unicode 13.0 Book Cover Concept

The selected cover artwork is an original design by Huijun Shan, an award-winning senior UI designer, who has a B.S. in Communication Engineering from Nanjing University. The design was inspired by building blocks for children using letters put together with a scientific color scheme.

Two runner-up designs by Du Lilyu and Saagar Setu were also selected. Lilyu's design cleverly incorporates the Unicode logo into the version number, while Setu's design signifies the endless running for the next Unicode release. Du Lilyu is a graphic designer in China, while Saagar Setu is a Unicode enthusiast based in Ahmedabad, India.

Du Lilyu:

Saagar Setu:

Over 130,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, January 29, 2020

Unicode Emoji 13.0 — Now final for 2020

The Emoji 13.0 are now final, with 62 new emoji such as:

Smiling face with tear	Polar bear
Bubble tea	Pickup truck
Fondue	Teapot
Piñata	Transgender flag

There are also 55 gender and skin-tone variants, including new gender-inclusive emoji. See the seven cases in boxes below:

The new emoji are listed in Emoji Recently Added v13.0, with sample images. These images are just samples: vendors for mobile phones, PCs, and web platforms will typically use different images. In particular, the Emoji Ordering v13.0 chart shows how the new emoji sort compared to the others, with new emoji marked with rounded-rectangles. The other Emoji Charts for Version 13.0. have been updated to show the emoji.

The new emoji typically start showing up on mobile phones in September/October — some platforms may release them earlier. The new emoji will soon be available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages.

For implementers:

The Emoji 13.0 test file (emoji-test.txt) provides data for vendors to begin working on their emoji fonts and code ahead of the release of Unicode 13.0, scheduled for March 10.
The emoji specification (UTS #51) will have additional guidelines on gender and skin tone, and other clarifications. The definitions in UTS #51 and data files have been enhanced to be more consistent and useful. The final text will be available on March 10.
The CLDR names and search keywords for the new emoji in over 80 languages, and the sort order for emoji, will be finalized by mid-April with the release of CLDR v37.

Over 130,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, January 23, 2020

Unicode Maya Hieroglyph Project

The National Endowment for the Humanities (NEH) recently announced their grants for 2020 to support 188 humanities projects across the United States. The Unicode Consortium's project to make Maya texts accessible to both expert and non-expert user communities through creating an annotated digital archive is one of the funded projects. This project will be led by principal investigator, Gabrielle Vail.

The NEH announcement included mention of the grant to the Unicode Consortium: "Another grant will augment the international Unicode computer text encoding standards to digitally render additional historical and modern scripts, including Mayan ... hieroglyphs."

Thousands of texts written in a hieroglyphic script by prehispanic Maya cultures have been preserved throughout the Maya lowlands, in museums, and in private collections. Various media were used, including painting, carving, and incising. Texts can be found on large-scale stone monuments, the walls of buildings, painted polychrome vessels, codices made from fig-bark paper, and small objects made from stone, bone, and wood. The project will focus on building a digital archive to include texts from Classic period monumental sites. These Classic period texts are most often of a dynastic or political nature.

The Unicode Maya project will advance research on Classic period sites circa 250-900 CE (Common Era) to determine the sign repertoire or character list, a list of quadrats (specific configurations that can be arranged and combined to form a glyphic cartouche or block), a glossary of attested terms from the Classic period, a lexicon mapping of Classic period terms to the Colonial period and modern Mayan dictionaries, and finally, the creation of OpenType fonts.

About the Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards.

The membership of the consortium represents a broad spectrum of corporations and organizations, many in the computer and information processing industry. Members include: Adobe, Apple, Emojipedia, Facebook, Google, Government of Bangladesh, Government of Tamil Nadu, Huawei, IBM, Microsoft, Monotype Imaging, Netflix, Sultanate of Oman MARA, Oracle, SAP, Tamil Virtual University, The University of California (Berkeley), plus well over a hundred Associate, Liaison, and Individual members. For a complete member list go to https://home.unicode.org/membership/members/.

Over 130,000 characters are available for adoption, to help the Unicode Consortium’s work on digitally disadvantaged languages.

Wednesday, January 22, 2020

Call for Participation Announced for IUC 44

The IUC 44 Program Committee invites you to submit your session, tutorial, or panel abstracts for the 44th Internationalization & Unicode® Conference (IUC 44) in Santa Clara, California, October 14-16, 2020.

Join other industry leaders as they map the future of internationalization, ignite new ideas, and showcase the latest technologies and best practices for creating, managing, and testing global, web, and multilingual software solutions. Be a leader. Direct the future of multilingual text and software internationalization!

Submission types could include case studies, best practices, innovative technology, or evolving standards, to name a few. In addition, understanding the design of a development platform is often critical to implementing best practices in applications. The Internationalization and Unicode Conference seeks to offer technical tutorials on the internationalization capabilities and architecture of development platforms, including Mobile, Desktop, Cloud, and Virtual Operating Systems, Social Network Platforms, and Machine Translation and Machine Learning Systems.

Please submit your proposals for presentations or tutorials by Friday, March 6, 2020.

The Program Committee will notify authors by Friday, April 3, 2020. Speaker agreements and materials such as photos, bios and final presentation abstracts will be required from selected presenters by Friday, April 17, 2020.

Tutorial Presenters receive complimentary conference registration, and two nights lodging, while Session Presenters receive a fifty percent conference discount and two nights lodging.

Please visit our website to view examples of content from past conferences.

About The Unicode Consortium

The Unicode®Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards.

The membership of the consortium represents a broad spectrum of corporations and organizations, many in the computer and information processing industry. Members include: Adobe, Apple, Emojipedia, Facebook, Google, Government of Bangladesh, Government of India, Huawei, IBM, Microsoft, Monotype Imaging, Netflix, Sultanate of Oman MARA, Oracle, SAP, Tamil Virtual University, The University of California (Berkeley), plus well over a hundred Associate, Liaison, and Individual members.

For more information, please contact the Unicode Consortium.

About the Event Producer

OMG® is the Event Producer for the Internationalization & Unicode Conferences. OMG is an international, open membership, not-for-profit computer industry standards consortium. OMG Task Forces develop enterprise integration standards for a wide range of technologies and an even wider range of industries. OMG's modeling standards, including the Unified Modeling Language™ (UML®) and Model Driven Architecture® (MDA®), enable powerful visual design, execution and maintenance of software and other processes, including IT Systems Modeling and Business Process Management. OMG's middleware standards and profiles are based on the Common Object Request Broker Architecture (CORBA®) and support a wide variety of industries. OMG has offices at 109 Highland Avenue, Needham, MA 02494 USA. This email may be considered to be commercial email, an advertisement or a solicitation.

For more information about OMG, visit us online at https://www.omg.org.

Friday, January 10, 2020

New Unicode Working Group: Message Formatting

One of the challenges in adapting programs to work with different languages is message formatting. This is the process of formatting and inserting data values into messages in the user’s language. For example, “The package will arrive at {time} on {date}” could be translated into German as “Das Paket wird am {date} um {time} geliefert”, and the particular {time} and {date} variables would be automatically formatted for German, and inserted in the right places.

The Unicode Consortium has provided message formatting for some time via the ICU programming libraries and CLDR locale data repository. But until now we have not had a syntax for localizable message strings standardized by Unicode. Furthermore, the current ICU MessageFormat is relatively complex for existing operations, such as plural forms, and it does not scale well to other language properties, such as gender and inflections.

The Unicode CLDR Technical Committee is formalizing a new working group to develop a technical specification for message format that addresses these issues. That working group is called the Message Format Working Group and is chaired by Romulo Cintra from CaixaBank. Other participants currently represented are Amazon, Dropbox, Facebook, Google, IBM, Mozilla, OpenJSF, and Paypal.

For information on how to get involved, visit the working group’s GitHub page: https://github.com/unicode-org/message-format-wg

Open discussions will take place on GitHub, and written notes will be posted after every meeting.

Over 130,000 characters are available for adoption, to help the Unicode Consortium’s work on digitally disadvantaged languages.