Saturday, October 30, 2010

Unicode 6.0 Sorting

Mountain View, CA, USA – October 29, 2010 – The new version of Unicode Technical Standard #10, Unicode Collation Algorithm (UCA), has been updated for Unicode Version 6.0, adding support for 2,088 characters in sorting, searching, and matching. Also in this release new data files for support of the Unicode Common Locale Data Repository (CLDR), which provides customization for different languages.

Reorderable Categories. The data files for CLDR order characters strictly by certain major categories. This allows programmers to parametrically reorder these groups of characters to put them in the desired order for different languages. For example, numbers can be ordered after letters, or Cyrillic before Latin. The reorderable categories are:

whitespace, punctuation, general symbols, currency symbols, and numbers, then Latin, Greek, Coptic, Cyrillic, ..., Egyptian Hieroglyphs, and finally, CJK.

Distinguishing Symbols from Punctuation. UCA provides an option for ignoring certain characters when comparing strings. By default, these are whitespace, punctuation, and general symbols. The data files for CLDR modify that default so that symbols are compared significantly, while still ignoring whitespace and punctuation. Thus, for example, "I♥NY" is not sorted the same as "I☠NY".

Special Database Values. The data files for CLDR provide special weights for two noncharacters:

1. A special noncharacter <HIGH> (U+FFFF) for specification of a range in a database, allowing "Sch" ≤ X ≤ "Sch<HIGH>" to pick all strings starting with "sch" plus those that sort equivalently.

2. A special noncharacter <LOW> (U+FFFE) for merged database fields, allowing "Disílva<LOW>John" to sort next to "Disilva<LOW>John".

The version of CLDR using these new data files is planned for release at the start of December, 2010.

The text of the UCA standard has been clarified in different areas. Implementers should pay special attention to the changes regarding ill-formed sequences, noncharacters, and unassigned code points in CJK blocks.

For more information, see:

* The UCA Standard 6.0.0: http://www.unicode.org/reports/tr10/
* The UCA charts: http://unicode.org/charts/collation/
* The UCA data: http://unicode.org/Public/UCA/6.0.0/
* Merged database fields: http://unicode.org/reports/tr10/#Interleaved_Levels

About The Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry.

Members are: Adobe, Apple, Google, Government of Bangladesh, Government of India, IBM, Microsoft, Monotype Imaging, Oracle, The Society for Natural Language Technology Research, SAP, The University of California (Berkeley), The University of California (Santa Cruz), Yahoo!, plus well over a hundred Associate, Liaison, and Individual members.

For more information, please contact the Unicode Consortium. http://www.unicode.org/contacts.html