Wednesday, May 18, 2016

ICU joins the Unicode Consortium

ICU ProjectToday we are welcoming the ICU project into the Unicode Consortium.

Every smartphone and laptop uses the Unicode encoding and Unicode CLDR data for language support: from Arabic to Japanese to Zulu — and even plain English. The Unicode Consortium provides the data, but has not provided software to directly use that data, until now.

The ICU (International Components for Unicode) project has long provided software that implements the Unicode data and algorithms. ICU is a mature, very widely deployed set of C/C++ and Java software libraries, open-sourced since 1999 under the stewardship of IBM. When you see a date or number written in your language on your smartphone, for example, or a list of sorted names, the formatting and sorting are done with ICU.

There has long been a close working relationship between the various Unicode Consortium committees and the ICU team, with many people working on Unicode projects as well as ICU. That has ensured that Unicode data and algorithms can be effectively and quickly implemented.

IBM made the decision to transfer ICU to the Unicode Consortium so that ICU could benefit from the formal and open governance that the Unicode Consortium offers. “IBM has a long history in our commitment to open standards as a driver of innovation for our customers worldwide,” said Helena Chapman, IBM Globalization Executive. By moving ICU under the Unicode Consortium, it provides a cross-industry, open source collaboration that will drive greater consistency and interoperability across computing platforms to the benefit of global technology users world-wide. IBM has been an active member of the Unicode Consortium since its inception, and is pleased to see this further consolidation of foundational open source globalization standards.

The ICU team has become a new Consortium technical committee, along with the other Unicode committees. ICU will be released under the Unicode open-source license (similar to the previous license), just like the Unicode Character Database and the CLDR data. For users of ICU, we’ll try to make this transition as smooth as possible.

The Unicode Consortium and the ICU team would like to thank IBM for many years of project stewardship, as well as for major past and ongoing contributions to the project.

For more information, see

Monday, May 16, 2016

PRI #326: Combined registration of the MSARG collection sequences

The Unicode Consortium has posted a new issue for public review and comment.

Public Review Issue #326: A submission for the “Combined registration of the MSARG collection and of sequences in that collection” has been received by the IVD registrar.

This submission is currently under review according to the procedures of UTS #37, Unicode Ideographic Variation Database, with an expected close date of 2016-08-12. Please see the submission page for details and instructions on how to review this issue and provide comments:

The IVD (Ideographic Variation Database) establishes a registry for collections of unique, and sometimes shared, variation sequences for CJK Unified Ideographs, which enables standardized interchange in plain text, in accordance with UTS #37, Unicode Ideographic Variation Database.

Wednesday, May 4, 2016

Not Just Emoji

Every programmer knows about Unicode. Most other people have no idea what it is, even though they use Unicode every day. Every character you type on your smartphone or laptop — and every character you read — is defined by the Unicode Consortium.

The awareness of the Unicode Consortium has grown recently, with the spread of emoji. But from the news articles, it’s easy to get the impression that emoji is the only thing we do. In reality, there are over 120,000 characters defined, and as you see below, only a small fraction of them are emoji.

Emoji and Non-Emoji

For example, this June we’ll be adding 7,500 characters — and of those new characters, fewer than 1% of them are emoji. The majority of the characters are from 6 new scripts: some in modern use, and some historic.

CLDR is the other main project for the Unicode Consortium. It provides the building blocks for supporting a variety of different languages. We’ve just released CLDR v29, and are about to start data submission for v30. Especially if you are a native speaker of a “digitally disadvantaged” language, we encourage you to join the other contributors to CLDR to help with this effort.

The Unicode Consortium is a volunteer-driven 501(c)(3) non-profit organization. Some people may work on emoji, while others work on ancient scripts, or Chinese ideographs. Others work on the language support in CLDR, or other projects.

You can help fund the work of the consortium — even if you don’t contribute technically — by adopting your favorite character through the Adopt A Character program.

— Mark Davis, President