The Unicode Blog: June 2022

Thursday, June 30, 2022

Working with Local Communities to Revitalize and Preserve Indigenous Languages in Canada

By Kevin King, Typotheque

The Typotheque Syllabics Project, an initiative based out of Toronto and The Hague, Netherlands, undertook research with language keepers across various Syllabics-using Indigenous communities in Canada to document and address both local typographic preferences, as well as technical barriers they faced.

This research contributed to two proposals to amend the Unicode Standard for the Syllabics, which is an important step in the preservation and revitalization of Indigenous languages.

[Map, Image provided by Typotheque https://www.typotheque.com/, used with permission.]

The local Indigenous communities were given a voice in reclaiming ownership over the use of their language, as well as the resources for self-determined expression in the writing system that they identify with. By working in collaboration with Nattilik language keepers Nilaulaaq, Janet Tamalik, Attima and Elisabeth Hadlari, and elders in the community, key issues the Nattilik community of Western Nunavut faced were identified, and it was discovered that there were 12 missing syllabic characters from the Unicode Standard. The Nattilik community was unable to use their language reliably for even simple, everyday digital text exchanges such as email or text messaging.

[Syllables Block, Image provided by Typotheque https://www.typotheque.com/, used with permission.]

[Syllables Block, Image provided by Typotheque https://www.typotheque.com/, used with permission.]

The Nattilik Kutaiřřutit (Nattilik special characters), required for representing sounds unique to the Nattilingmiutut dialect of Inuktut.

It was also revealed that the glyphs of the Carrier (Dakelh) community of central British Columbia were incorrectly represented in the UCAS code charts. Additionally, 4 characters for a now-obsolete sp series were successfully proposed to Unicode for representing and digitally-preserving historical texts in the Cree and Ojibway languages. These important alterations meant that all Syllabics typefaces that are fully Unicode-compliant – including system level typefaces on common operating systems – would be capable of accurately and legibly representing text for the Carrier, Sayisi, and Ojibway Syllabics-using communities moving forward.

When the comprehensive glyph set was produced by the project, the results provided not only a stable environment for the local Indigenous communities to use their languages on their devices, but it also changed the standards for the development of all future Syllabics fonts, and ensured that writing systems of all communities will be accurately represented.

[Syllables, Image provided by Typotheque https://www.typotheque.com/, used with permission.]

[Syllables, Image provided by Typotheque https://www.typotheque.com/, used with permission.]

Above, a representation of the missing characters for Nattilingmiutut, a dialect of Inuktut in Western Nunavut.

Where to learn more:

Acknowledgements

Special thanks to Liang Hai, Deborah Anderson, and Sarah Rivera for their contributions to this blog.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, June 8, 2022

Unicode CLDR Version 42 Submission Open

The Unicode CLDR Survey Tool is open for submission for version 42. CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

Version 42 is focusing on:

Additional Coverage
- Unicode 15.0 additions: new emoji, script names, collation data (Chinese & Japanese), …
- New Languages: Adding Haryanvi, Bhojpuri, Rajasthani at a Basic level.
- Up-leveling: Xhosa, Hinglish (Hindi-Latin), Nigerian Pidgin, Hausa, Igbo, Yoruba, and Norwegian Nynorsk.
Person Name Formatting: for handling the wide variety in the way that people’s names work in different languages.
- People may have a different number of names, depending on their culture--they might have only one name (“Zendaya”), two (“Albert Einstein”), or three or more.
- People may have multiple words in a particular name field, eg “Mary Beth” as a given name, or “van Berg” as a surname.
- Some languages, such as Spanish, have two surnames (where each can be composed of multiple words).
- The ordering of name fields can be different across languages, as well as the spacing (or lack thereof) and punctuation.
- Name formatting need to be adapted to different circumstances, such as a need to be presented shorter or longer; formal or informal context; or when talking about someone, or talking to someone, or as a monogram (JFK).

Submission of new data opened recently, and is slated to finish on June 22. The new data then enters a vetting phase, where contributors work out which of the supplied data for each field is best. That vetting phase is slated to finish on July 6. A public alpha makes the draft data available around August 17, and the final release targets October 19.

Each new locale starts with a small set of Core data, such as a list of characters used in the language. Submitters of those locales need to bring the coverage up to Basic level (very basic basic dates, times, numbers, and endonyms) during the next submission cycle. In version 41, the following levels were reached:

Level	Languages	Locales*	Notes
Modern	89	361	Suitable for full UI internationalization
Afrikaans‎, ‎… Čeština‎, ‎… Dansk‎, ‎… Eesti‎, ‎… Filipino‎, ‎… Gaeilge‎, ‎… Hrvatski‎, ‎Indonesia‎, ‎… Jawa‎, ‎Kiswahili‎, ‎Latviešu‎, ‎… Magyar‎, ‎…Nederlands‎, ‎… O‘zbek‎, ‎Polski‎, ‎… Română‎, ‎Slovenčina‎, ‎… Tiếng Việt‎, ‎… Ελληνικά‎, ‎Беларуская‎, ‎… ‎ᏣᎳᎩ‎, ‎ Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, … አማርኛ‎, ‎नेपाली‎, … ‎অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, ‎தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, ‎မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, ‎… 日本語‎, ‎…
Moderate	13	32	Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Binisaya, … ‎Èdè Yorùbá, ‎Føroyskt, ‎Igbo, ‎IsiZulu, ‎Kanhgág, ‎Nheẽgatu, ‎Runasimi, ‎Sardu, ‎Shqip, ‎سنڌي, …
Basic	22	21	Suitable for locale selection, such as choice of language in mobile phone settings.
Asturianu, ‎Basa Sunda, ‎Interlingua, ‎Kabuverdianu, ‎Lea Fakatonga, ‎Rumantsch, ‎Te reo Māori, ‎Wolof, ‎Босански (Ћирилица), ‎Татар, ‎Тоҷикӣ, ‎Ўзбекча (Кирил), ‎کٲشُر, ‎कॉशुर (देवनागरी), ‎…, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, ‎粤语 (简体)‎

* Locales are variants for different countries or scripts.

If you would like to contribute missing data for your language, see Survey Tool Accounts. For more information on contributing to CLDR, see the CLDR Information Hub.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, June 30, 2022