Saturday, October 30, 2010

Unicode 6.0 Sorting

Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated for Unicode Version 6.0, adding support for 2,088 characters in sorting, searching, and matching. Also in this release new data files for support of the Unicode Common Locale Data Repository (CLDR), which provides customization for different languages.

Reorderable Categories. The data files for CLDR order characters strictly by certain major categories. This allows programmers to parametrically reorder these groups of characters to put them in the desired order for different languages. For example, numbers can be ordered after letters, or Cyrillic before Latin. The reorderable categories are:

whitespace, punctuation, general symbols, currency symbols, and numbers, then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally, CJK.

Distinguishing Symbols from Punctuation. UCA provides an option for ignoring certain characters when comparing strings. By default, these are whitespace, punctuation, and general symbols. The data files for CLDR modify that default so that symbols are compared significantly, while still ignoring whitespace and punctuation. Thus, for example, "I♥NY" is not sorted the same as "I☠NY".

Special Database Values. The data files for CLDR provide special weights for two noncharacters:

1. A special noncharacter <HIGH> (U+FFFF) for specification of a range in a database, allowing "Sch" ≤ X ≤ "Sch<HIGH>" to pick all strings starting with "sch" plus those that sort equivalently.

2. A special noncharacter <LOW> (U+FFFE) for merged database fields, allowing "Disílva<LOW>John" to sort next to "Disilva<LOW>John".

The version of CLDR using these new data files is planned for release at the start of December, 2010.

The text of the UCA standard has been clarified in different areas. Implementers should pay special attention to the changes regarding ill-formed sequences, noncharacters, and unassigned code points in CJK blocks.

For more information, see:

* The UCA Standard 6.0.0: http://www.unicode.org/reports/tr10/
* The UCA charts: http://unicode.org/charts/collation/
* The UCA data: http://unicode.org/Public/UCA/6.0.0/
* Merged database fields: http://unicode.org/reports/tr10/#Interleaved_Levels

About The Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry.

Members are: Adobe, Apple, Google, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural Language Technology Research, SAP, The University of California (Berkeley), The University of California (Santa Cruz), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members.

For more information, please contact the Unicode Consortium. http://www.unicode.org/contacts.html

Unicode 6.0 Internationalized Domain Names

Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #46, Unicode IDNA Compatibility Processing, has been updated for Unicode Version 6.0, adding support for 2,088 characters in internationalized domain names (IDN).

The specification provides two main features for use with the new specification for internationalized domain names released in August 2010 (IDNA2008):

1. A comprehensive mapping to reflect user expectations for casing and other variants of domain names. This mapping is allowed by IDNA2008, and follows the same principles as in the previous version of that specification (IDNA2003, in force from 2003 until August). It thus provides users consistency between old and new versions.

2. A compatibility mechanism that supports internationalized domain names valid under the IDNA2003 specification and the IDNA2008 specification. This second feature allows browsers, search engines, and other clients to handle both old and new domain names during the transitional period until registries update their rules to follow IDNA2008.

UTS #46 supplies normative data tables that are synchonized with the latest version of Unicode, allowing implementations to update without recalculation.

This new release of UTS #46 also provides a custom option to recognize legacy international domain names containing special ASCII characters such as "_".

About The Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry.

Members are: Adobe, Apple, Google, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural Language Technology Research, SAP, The University of California (Berkeley), The University of California (Santa Cruz), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members.

For more information, please contact the Unicode Consortium. http://www.unicode.org/contacts.html

Tuesday, October 12, 2010

Unicode Version 6.0: Support for Popular Symbols in Asia

The newly finalized Unicode Version 6.0 adds 2,088 characters, with over 1,000 new symbols.

A long-awaited feature of Unicode 6.0 is the encoding of hundreds of symbols for mobile phones. These emoji characters are in widespread use, especially in Japan, and have become an essential part of text messages there and elsewhere. Unicode 6.0 now provides for data interchange between different mobile vendors and across the internet. The symbols include symbols for many domains: maps and transport, phases of the moon, UI symbols (such as fast-forward) and many others.

A late-breaking addition is the newly created official symbol for the Indian rupee. With the help of the Indian government and our colleagues in ISO, the consortium was able to accelerate the encoding process. Once computers and mobile phones update to the new version of Unicode, people will be able to use the rupee sign like they use $ or € now.

This October 2010 release includes the Unicode Character Database (UCD), Unicode Standard Annexes (UAXes), and code charts. With the release of these components, implementers are able update their software to Unicode 6.0 without delay. The final text of the core specification will be available in early 2011.

To access Unicode 6.0, see http://www.unicode.org/versions/Unicode6.0.0.

For more information on emoji, see http://unicode.org/faq/emoji_dingbats.html

For a formatted version of this message with images, see http://unicode.org/press/pr-6.0.html.