Wednesday, June 8, 2022

Unicode CLDR Version 42 Submission Open

[ballot box image] The Unicode CLDR Survey Tool is open for submission for version 42. CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

Version 42 is focusing on:
  • Additional Coverage
    • Unicode 15.0 additions: new emoji, script names, collation data (Chinese & Japanese), …
    • New Languages: Adding Haryanvi, Bhojpuri, Rajasthani at a Basic level.
    • Up-leveling: Xhosa, Hinglish (Hindi-Latin), Nigerian Pidgin, Hausa, Igbo, Yoruba, and Norwegian Nynorsk.
  • Person Name Formatting: for handling the wide variety in the way that people’s names work in different languages.
    • People may have a different number of names, depending on their culture--they might have only one name (“Zendaya”), two (“Albert Einstein”), or three or more.
    • People may have multiple words in a particular name field, eg “Mary Beth” as a given name, or “van Berg” as a surname.
    • Some languages, such as Spanish, have two surnames (where each can be composed of multiple words).
    • The ordering of name fields can be different across languages, as well as the spacing (or lack thereof) and punctuation.
    • Name formatting need to be adapted to different circumstances, such as a need to be presented shorter or longer; formal or informal context; or when talking about someone, or talking to someone, or as a monogram (JFK).
Submission of new data opened recently, and is slated to finish on June 22. The new data then enters a vetting phase, where contributors work out which of the supplied data for each field is best. That vetting phase is slated to finish on July 6. A public alpha makes the draft data available around August 17, and the final release targets October 19.

Each new locale starts with a small set of Core data, such as a list of characters used in the language. Submitters of those locales need to bring the coverage up to Basic level (very basic basic dates, times, numbers, and endonyms) during the next submission cycle. In version 41, the following levels were reached:

Level Languages Locales* Notes
Modern 89 361 Suitable for full UI internationalization
Afrikaans‎, ‎… Čeština‎, ‎… Dansk‎, ‎… Eesti‎, ‎… Filipino‎, ‎… Gaeilge‎, ‎… Hrvatski‎, ‎Indonesia‎, ‎… Jawa‎, ‎Kiswahili‎, ‎Latviešu‎, ‎… Magyar‎, ‎…Nederlands‎, ‎… O‘zbek‎, ‎Polski‎, ‎… Română‎, ‎Slovenčina‎, ‎… Tiếng Việt‎, ‎… Ελληνικά‎, ‎Беларуская‎, ‎… ‎ᏣᎳᎩ‎, ‎ Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, … አማርኛ‎, ‎नेपाली‎, … ‎অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, ‎தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, ‎မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, ‎… 日本語‎, ‎…
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Binisaya, … ‎Èdè Yorùbá, ‎Føroyskt, ‎Igbo, ‎IsiZulu, ‎Kanhgág, ‎Nheẽgatu, ‎Runasimi, ‎Sardu, ‎Shqip, ‎سنڌي, …
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Asturianu, ‎Basa Sunda, ‎Interlingua, ‎Kabuverdianu, ‎Lea Fakatonga, ‎Rumantsch, ‎Te reo Māori, ‎Wolof, ‎Босански (Ћирилица), ‎Татар, ‎Тоҷикӣ, ‎Ўзбекча (Кирил), ‎کٲشُر, ‎कॉशुर (देवनागरी), ‎…, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, ‎粤语 (简体)‎
* Locales are variants for different countries or scripts.

If you would like to contribute missing data for your language, see Survey Tool Accounts. For more information on contributing to CLDR, see the CLDR Information Hub.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages