Thursday, June 30, 2022

Working with Local Communities to Revitalize and Preserve Indigenous Languages in Canada

By Kevin King, Typotheque

The Typotheque Syllabics Project, an initiative based out of Toronto and The Hague, Netherlands, undertook research with language keepers across various Syllabics-using Indigenous communities in Canada to document and address both local typographic preferences, as well as technical barriers they faced.

This research contributed to two proposals to amend the Unicode Standard for the Syllabics, which is an important step in the preservation and revitalization of Indigenous languages.

[Map, Image provided by Typotheque https://www.typotheque.com/, used with permission.]

The local Indigenous communities were given a voice in reclaiming ownership over the use of their language, as well as the resources for self-determined expression in the writing system that they identify with. By working in collaboration with Nattilik language keepers Nilaulaaq, Janet Tamalik, Attima and Elisabeth Hadlari, and elders in the community, key issues the Nattilik community of Western Nunavut faced were identified, and it was discovered that there were 12 missing syllabic characters from the Unicode Standard. The Nattilik community was unable to use their language reliably for even simple, everyday digital text exchanges such as email or text messaging.

[Syllables Block, Image provided by Typotheque https://www.typotheque.com/, used with permission.]
The Nattilik Kutaiřřutit (Nattilik special characters), required for representing sounds unique to the Nattilingmiutut dialect of Inuktut.


It was also revealed that the glyphs of the Carrier (Dakelh) community of central British Columbia were incorrectly represented in the UCAS code charts. Additionally, 4 characters for a now-obsolete sp series were successfully proposed to Unicode for representing and digitally-preserving historical texts in the Cree and Ojibway languages. These important alterations meant that all Syllabics typefaces that are fully Unicode-compliant – including system level typefaces on common operating systems – would be capable of accurately and legibly representing text for the Carrier, Sayisi, and Ojibway Syllabics-using communities moving forward.

When the comprehensive glyph set was produced by the project, the results provided not only a stable environment for the local Indigenous communities to use their languages on their devices, but it also changed the standards for the development of all future Syllabics fonts, and ensured that writing systems of all communities will be accurately represented.

[Syllables, Image provided by Typotheque https://www.typotheque.com/, used with permission.]
Above, a representation of the missing characters for Nattilingmiutut, a dialect of Inuktut in Western Nunavut.


Where to learn more:

Acknowledgements

Special thanks to Liang Hai, Deborah Anderson, and Sarah Rivera for their contributions to this blog.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, June 8, 2022

Unicode CLDR Version 42 Submission Open

[ballot box image] The Unicode CLDR Survey Tool is open for submission for version 42. CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

Version 42 is focusing on:
  • Additional Coverage
    • Unicode 15.0 additions: new emoji, script names, collation data (Chinese & Japanese), …
    • New Languages: Adding Haryanvi, Bhojpuri, Rajasthani at a Basic level.
    • Up-leveling: Xhosa, Hinglish (Hindi-Latin), Nigerian Pidgin, Hausa, Igbo, Yoruba, and Norwegian Nynorsk.
  • Person Name Formatting: for handling the wide variety in the way that people’s names work in different languages.
    • People may have a different number of names, depending on their culture--they might have only one name (“Zendaya”), two (“Albert Einstein”), or three or more.
    • People may have multiple words in a particular name field, eg “Mary Beth” as a given name, or “van Berg” as a surname.
    • Some languages, such as Spanish, have two surnames (where each can be composed of multiple words).
    • The ordering of name fields can be different across languages, as well as the spacing (or lack thereof) and punctuation.
    • Name formatting need to be adapted to different circumstances, such as a need to be presented shorter or longer; formal or informal context; or when talking about someone, or talking to someone, or as a monogram (JFK).
Submission of new data opened recently, and is slated to finish on June 22. The new data then enters a vetting phase, where contributors work out which of the supplied data for each field is best. That vetting phase is slated to finish on July 6. A public alpha makes the draft data available around August 17, and the final release targets October 19.

Each new locale starts with a small set of Core data, such as a list of characters used in the language. Submitters of those locales need to bring the coverage up to Basic level (very basic basic dates, times, numbers, and endonyms) during the next submission cycle. In version 41, the following levels were reached:

Level Languages Locales* Notes
Modern 89 361 Suitable for full UI internationalization
Afrikaans‎, ‎… Čeština‎, ‎… Dansk‎, ‎… Eesti‎, ‎… Filipino‎, ‎… Gaeilge‎, ‎… Hrvatski‎, ‎Indonesia‎, ‎… Jawa‎, ‎Kiswahili‎, ‎Latviešu‎, ‎… Magyar‎, ‎…Nederlands‎, ‎… O‘zbek‎, ‎Polski‎, ‎… Română‎, ‎Slovenčina‎, ‎… Tiếng Việt‎, ‎… Ελληνικά‎, ‎Беларуская‎, ‎… ‎ᏣᎳᎩ‎, ‎ Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, … አማርኛ‎, ‎नेपाली‎, … ‎অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, ‎தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, ‎မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, ‎… 日本語‎, ‎…
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Binisaya, … ‎Èdè Yorùbá, ‎Føroyskt, ‎Igbo, ‎IsiZulu, ‎Kanhgág, ‎Nheẽgatu, ‎Runasimi, ‎Sardu, ‎Shqip, ‎سنڌي, …
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Asturianu, ‎Basa Sunda, ‎Interlingua, ‎Kabuverdianu, ‎Lea Fakatonga, ‎Rumantsch, ‎Te reo Māori, ‎Wolof, ‎Босански (Ћирилица), ‎Татар, ‎Тоҷикӣ, ‎Ўзбекча (Кирил), ‎کٲشُر, ‎कॉशुर (देवनागरी), ‎…, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, ‎粤语 (简体)‎
* Locales are variants for different countries or scripts.

If you would like to contribute missing data for your language, see Survey Tool Accounts. For more information on contributing to CLDR, see the CLDR Information Hub.



Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Tuesday, May 31, 2022

Unicode 15.0 Beta Review

[Kawi beta chart image] The beta review period for Unicode 15.0 has started. The Unicode Standard is the foundation for all modern software and communications around the world, including all modern operating systems, browsers, laptops, and smart phones-plus the Internet and Web (URLs, HTML, XML, CSS, JSON, etc.). The Unicode Standard, its associated standards, and data form the foundation for CLDR and ICU releases. Thus it is important to ensure a smooth transition to each new version of the standard.

Unicode 15.0 includes a number of changes and 4,489 new characters, including another major extension of CJK unified ideographs. A number of the Unicode Standard Annexes have significant modifications for Unicode 15.0. Two new scripts have been added, and there are also 20 additional emoji characters in Unicode 15.0.

Please review the documentation, adjust your code, test the data files, and report errors and other issues to the Unicode Consortium by July 12, 2022. The review period will only be for six weeks, so prompt feedback is appreciated. Feedback instructions are on the beta page.

See https://www.unicode.org/versions/beta-15.0.0.html for more information about testing the 15.0.0 beta.

See https://www.unicode.org/versions/Unicode15.0.0/ for the current draft summary of Unicode 15.0.0.

About the Unicode Consortium

The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard and related globalization standards.

The membership of the consortium represents a broad spectrum of corporations and organizations, many in the computer and information processing industry. Members include: Adobe, Amazon, Apple, Emojipedia, Google, Government of Bangladesh, International Emerging Technology Company (ETCO), Meta, Microsoft, Netflix, Salesforce, SAP, Tamil Virtual Academy, The University of California (Berkeley), Yat Labs, plus well over a hundred Associate, Liaison, and Individual members. For a complete member list go to https://home.unicode.org/membership/members/.

For more information, please contact the Unicode Consortium https://home.unicode.org/connect/contact-unicode/.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, May 4, 2022

Out of this World: New Astronomy Symbols Approved for the Unicode Standard

Five Trans-Neptunian Objects to Join Character Set

By Deborah Anderson, Chair of Unicode Script Ad Hoc Committee

In January 2022, the Unicode Technical Committee approved five new symbols to be published in Unicode 15.0. With the projected release date of September 2022, these symbols are based on newly discovered trans-Neptunian objects (TNOs) in the Solar System. They resulted from research efforts such as those led by astronomer and professor Dr. Michael Brown at the California Institute of Technology (CalTech).

These five objects orbit the Sun at a distance far larger than the major planets. They are currently believed to be large enough to be round, planetary worlds, in a category of objects called “dwarf planets” that also includes Ceres, Pluto, Eris and probably Sedna. The most famous trans-Neptunian object is Pluto, which historically had been considered to be the ninth planet from the Sun, but was reclassified as a dwarf planet in 2006 by the International Astronomical Union (IAU).[1]

[Pluto image]

How did this happen?

Individuals or organizations who want to propose new characters have to check existing characters to avoid duplicates, find out if there are equivalent forms already in existence, and most critically, determine the need for a digital interchange of them, such as symbols that have been encoded for use by NASA and other agencies. The proposal authors then must submit a proposal that articulates how their request meets the criteria.

Once a proposal is submitted, the Unicode Technical Committee determines whether to review the proposal and accept or decline it. This process can take a couple of years or more. In the case of these five characters, the proposers demonstrated the need, clearing the path for approval. 

Tell me more about these new characters. What are their names?

The International Astronomical Union (IAU) has standard conventions for naming objects both within and outside of the solar system. Objects orbiting the Sun outside the orbit of Neptune are named after mythological figures, particularly those associated with creation. But the subset that orbit in a two-to-three resonance with Neptune — the so-called “plutinos”, such as Pluto and Orcus — are named after figures associated with the underworld. In this case, the five TNOs, ordered by distance from the sun, are named:
  • Orcus: the Etruscan and Roman god of the underworld.
  • Haumea: the Hawaiian goddess of fertility; the telescope used to discover this object is located on Hawaiʻi.
  • Quaoar: an important mythological figure of the Tongva, the indigenous people who originally occupied the land where CalTech is located.
  • Makemake: the creator god of the Rapanui of Easter Island.
  • Gonggong: a destructive Chinese water god.
What information is there on the actual symbols that will be available?

All five symbols were designed by Denis Moskowitz, a software engineer in Massachusetts who had previously designed the Unicode symbol for Sedna. He drew inspiration from existing symbols and the “native name or culture” of the objects’ namesakes [2] to create the characters.

[TNO glyphs image]

Denis explains his inspiration for each symbol below:
  • Orcus: The symbol for Orcus is a combination of the Latin letters “O” and “R”, stylized to resemble a skull and an orca’s grin.
  • Haumea: The symbol created for Haumea was a combination and simplification of Hawaiian petroglyphs for “childbirth” and “woman”.
  • Quaoar: The symbol is the Latin letter “Q” with the tail fashioned into the shape of a canoe. The angular shape is intended to reflect Tongva rock art.
  • Makemake: The Makemake symbol is a traditional petroglyph of the face of the creator god Makemake, stylized to suggest an “M”. The design was a collaboration with John T. Whelan.
  • Gonggong: Gonggong’s symbol was based on the first Chinese character in the god’s name, 共 gòng, with a snaky tail replacing the lower section.
What else should we know?

The five symbols supplement a set of other characters for planetary objects that were published in 2018 (Unicode 11.0) and earlier. Two of the newly approved characters appear in a NASA poster. Other people have used the symbols in various media, including tattoos and art. Ultimately, these five new characters will join the 149,180 other characters in the Unicode Standard Version 15.0 and be accessible to anyone, anywhere in the world, who is using a computer or mobile device.

Where can I learn more?
Acknowledgments

Special thanks to Sarah Rivera and Kirk Miller for their contributions to this blog.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Friday, April 8, 2022

ICU 71 Released

ICU LogoUnicode® ICU 71 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 71 updates to CLDR 41 locale data with various additions and corrections.

ICU 71 adds phrase-based line breaking for Japanese. Existing line breaking methods follow standards and conventions for body text but do not work well for short Japanese text, such as in titles and headings. This new feature is optimized for these use cases.

ICU 71 adds support for Hindi written in Latin letters (hi_Latn). The CLDR data for this increasingly popular locale has been significantly revised and expanded. Note that based on user expectations, hi_Latn incorporates a large amount of English, and can also be referred to as “Hinglish”.

ICU 71 and CLDR 41 are minor releases, mostly focused on bug fixes and small enhancements. (The fall CLDR/ICU releases will update to Unicode 15 which is planned for September.) We are also working to re-establish continuous performance testing for ICU, and on development towards future versions.

ICU 71 updates to the time zone data version 2022a. Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/71.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, April 6, 2022

Unicode CLDR Version 41 Released!

[beta image] The Unicode CLDR Version 41 has been released, and has already been integrated into ICU.

CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%. The tooling changes are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool.

Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Level Languages  Locales  Notes
Modern 89 361 Suitable for full UI internationalization
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Total 124 414 Total of all languages/locales with ≥ Basic coverage.

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
  • Modern: Cherokee, Cantonese, Scottish Gaelic,  Sorbian (Lower), Sorbian (Upper)
  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
For details, see the Unicode CLDR v41 Release Note.
The next version of CLDR, version 42, is slated to start General Submission on May 18, 2022.

Unicode CLDR provides key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Monday, April 4, 2022

Emoji Are Not Born, They Are Made

Unicode now accepting proposals for Emoji 16.0

It’s hard to believe that just as Emoji 14.0 begins to appear on your device of choice this year, the Unicode Emoji Subcommittee [ESC] has already begun to plan for Emoji 16.0. That’s right, as of today — April 4, 2022 — applications to submit ideas for new emoji are open through July 31, 2022! 👁️📝👁️

So, how do you ensure your proposal is the best it can be? Well, here are some tips for consideration as you prepare it.

Check whether the emoji already exists!

✅ First: See if it’s already been approved.

🤔 Second, is it being reviewed?

🧑🏾‍🏫 Tip: Don’t skip any of the fields in the form! Incomplete proposals won’t be processed and will be returned. The ESC team members get a lot of submissions and complete proposals help them evaluate the submissions.

Be sure your proposal meets the criteria for consideration.

We recommend being faithful to the criteria for inclusion as much as possible and to consult the Emoji Subcommittee’s priorities, guidelines, strategies, reports, and audits. Many of the new provisional candidates for Emoji 15.0 are the result of these documents: pink heart, shaking face, rightwards pushing hand. The following are just some of the many considerations for writing a compelling proposal:
  • Multiple Uses
    Does the candidate emoji have significant metaphorical references or symbolism and not merely represent itself?
  • Use in sequences
    How is the emoji used with other emoji to communicate something new?
  • Breaking new ground
    Does the emoji represent something that is not already representable?
  • Distinctiveness
    Explain how and why this emoji represents a distinct, visually iconic entity that is relevant to a global audience
  • Compatibility
    Is it needed for compatibility with frequently-used emoji in popular existing systems, such as WeChat, Twitter, etc.
  • Frequency of Use
    Is there a high frequency of use? There should be empirical evidence of high usage in literature, movies, graphic novels, etc. worldwide.
Examples can be found on this page under “Selection Factors”

Well, let’s get going! How do I propose an emoji?

📝 Submit a proposal

My proposal wasn’t selected :(

We recognize that it will come as a disappointment if your proposal is not one of the few selected for inclusion. 💕 There are loads of reasons why this may have happened.
  • It can already be represented by a sequence
    (Ex. Garbage fire 🗑️🔥, Can of worms 🥫🪱)
  • 🔍 It’s too specific
    We can’t add every type of flower, every breed of dog, every color of drink
  • 💰 Very few are selected
    Roughly thirty emoji characters are added each year
  • 🐣 It’s a transient concept
    Think less “memes” and more “stable long-standing concepts”. Can you cite how this concept has existed in a communicative manner such as literature, movies, graphic novels, etc.?
  • ♾️ It’s open-ended
    There is no compelling evidence to add it over others of a similar type
  • Many other factors for exclusion

Why can’t we make EVERYTHING an emoji?

Any emoji additions have to take into consideration usage frequency, trade-offs with other choices, font file size, and the burden on developers (and users!) to make it easier to send and receive emoji. That’s why the Emoji Subcommittee set out to reduce the number of emoji we encode in any given year.

Reconciling the rapid, transient nature of modern communication with the formal, methodical process required by a standards body like the Unicode Consortium is the name of the game these days. Until the sending and receiving of images is standardized in some manner so you can send any image in the world alongside your text messages not just code points ... well, Unicode is here for the world’s emoji character needs. 🫂💖


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Monday, March 28, 2022

The Past and Future of Flag Emoji

Emoji Flags are dead, long live Emoji Flags 🏁 🏁 🏁

By Jennifer Daniel, Unicode Emoji Subcommittee Chair

With Emoji 16.0 submissions open from April 4, 2022 through July 31, 2022, the Unicode Emoji Subcommittee members stand with open arms for your future hair pick, khanda, and pink heart emoji proposals (BTW, if you were planning to prepare proposals for those concepts, we have some good news for you: they are already Emoij 15.0 draft candidates!).

That being said, there is one particular type of emoji for which the Unicode Consortium will no longer accept proposals. Flag emoji of any category.

Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. Today nine out of ten are in the top twenty most frequently shared flags. (The only outlier is Russia.) The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others.

Why do flag emoji exist in the first place?

Well, the shorter, more technical answer is: The country flags use a generative mechanism, and were encoded early on for compatibility reasons.

The longer answer requires a flashback to the 1990’s. KDDI and SoftBank — two Japanese mobile phone carriers — had early emoji sets which included 10 country flags: 🇨🇳 🇩🇪 🇪🇸 🇫🇷 🇬🇧 🇮🇹 🇯🇵 🇰🇷 🇷🇺 🇺🇸¹. A possibly apocryphal explanation is that they were used to denote what to grab for dinner: "American 🇺🇸 or Italian 🇮🇹?" (Such an innocent time in emoji history, pre-hamburger 🍔 emoji). Alas, as Unicode stepped in to create meaningful interoperability between these carrier-specific encodings, they were presented with a problem: why should these 10 countries have flag emoji when others do not?

The original emoji set included ten flags (shown above).
¹ Interestingly, Windows has never supported flag emoji 🔮. So, if you are reading this on a Windows device and flags aren't displaying, simply refer to the image above of the ten original flag emoji.

Various ideas were considered. The Unicode Consortium isn’t in the business of determining what is a country and what isn’t. That’s when the Consortium chose ISO 3166-1 alpha 2 as the source for valid country designations. ISO 3166 is a widely-accepted standard, and this particular mechanism represents each country with 2 letters, such as “US” (For United States), “FR” (France), or “CN” (China).

It wasn’t a perfect solution, but by allowing the 10 flag emoji — and the rest of the country flags — to be accurately interchanged between DoCoMo, KDDI, SoftBank, Google, and Apple, and others, it worked just fine.

Why this flag emoji but not that one?

Today, the largest emoji category is flags (Out of only ~3600 emoji, there are over 200 flags!). But, did you know that there are over 5,000 geographically-recognized regions that are also “valid”? These are known as subdivision regions and are based on ISO 3166-2. (These include states in the US, regions in Italy, provinces in Argentina, and so on.)

First, what does “valid” mean to the Unicode Standard? Well, think of it this way. Today, anyone could make a font of 5,000 emoji flags using these sequences. They are valid sequences. They are legit sequences. They won’t break. Any platform, application, or font can implement them. The significant difference here is that valid doesn’t mean they are recommended for implementation.

Back to ISO. ISO groups countries in a more formal way than say FIFA or The Olympics. For example, the four regions of the UK are regularly used in sport but not recognized in ISO 3166-1. In 2016, the Unicode Consortium started looking into solutions to support their inclusion (with the technical feasibility of adding more if needed in the future). This was the impetus for adding a general mechanism to make all ISO 3166-2 codes be valid for flags. However, only three of the 5,000 ISO 3166-2 codes have widely adopted emoji— England, Scotland, and Wales. (Northern Ireland remains in limbo until an “official flag” is formalized).

Flags for England, Scotland, and Wales were included in Emoji 5.0

So, with so many “valid sequences” why hasn’t anyone taken advantage of this sweet sweet rich flag opportunity?

At the time, in 2016, adding a few flags seemed reasonable but in retrospect was short-sighted. If the Emoji Subcommittee recommends the addition of a Catalonia flag emoji, then it looks like favoritism unless all the other subdivisions of Spain are added. And if those are added, what about the subdivisions of Japan or Namibia, or the Cantons of Liechtenstein? The inclusion of new flags will always continue to emphasize the exclusion of others. And there isn’t much room for the fluid nature of politics — countries change but Unicode additions are forever — once a character is added it can never be removed. (That being said, font designers can always update the designs as regimes change).

How are flag emoji used?

Flags are very specific in what they mean, and they don’t represent concepts used multiple times a day or even multiple times a year. You could say flag emoji have transcended the messaging experience and are primarily found in more auto-biographical contexts. (Like your TikTok bio. Or, maybe you add a flag to your username on Twitter.) But, even then flags are not as commonly found in biographical spaces as you may expect. (The top five emoji found in Twitter bios? ❤️✨💙💜💛.)

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. (There are exceptions: usage of the rainbow flag is above median!) That begs the question, “So, why not encode more identity flags?” Well, we have seen the same results for flags as we have seen for other emoji — a very long tail of rarely used options. They also tend to change over time! In the past six years since adding a Pride Flag to the Unicode Standard (2019) it’s already been redesigned. Many times. Identities are fluid and unstoppable which makes mapping them to a formal unchanging universal character set incompatible.

Why does usage matter in selecting emoji?

Any emoji additions have to take into consideration usage frequency, trade-offs with other choices, font file size, and the burden on developers (and users!) to make it easier to send and receive emoji. That’s why the Emoji Subcommittee set out to reduce the number of emoji we encode in any given year. Flags are also super hard to discern at emoji sizes — it’s quite easy to send a different flag than you intended (and with each additional flag the problem gets worse). The simple truth is that if more people used flags then there would be more of an argument to encode them. The Unicode Standard subset is just not a viable solution here for implementers nor users. Fortunately, there are seemingly infinite other ways to exchange images of flags that are more flexible and decentralized, such as stickers, gifs, and image attachments.

What is Unicode doing about it?

We realize closing this door may come as a disappointment — after all, flags often serve as a rallying cry to be seen, heard, recognized, and understood.

The Internet is a different place now than it was in the 90’s — the distribution of imagery online is unstoppable! Given how flags are commonly used this is a reasonable path forward: If you care to denote your affiliation with a region be it geographic, political, or identity (or all three) you can add a flag to your avatar image, share videos, or send a gif or sticker to razz your friend during a sports game (and of course there is always ⚽ ⚽ ⚽ ⚽ ⚽).


The more emoji can operate as building blocks, the more versatile, fluid, and useful they become! Rather than relying on Unicode to add new emoji for every concept under the Sun (this is simply not attainable) the citizens of the world have proven to be infinitely creative and fluid: often using existing emoji like the colored hearts (❤️️ 🧡 💛 💚 💙 💜 🤎 🖤 🤍) to express themselves. Hearts are among the most frequently used type of emoji and the nine colored hearts are often juxtaposed next to each other to denote markers of emotion (“I’m sorry 💙” or “love you ❤️”) and identity or affiliation that are not represented with atomic emoji in the Unicode Standard (ex. “Pan African pride ❤️️💚🖤”, “Hi I’m bi 💖💙💜”, and yes even sports teams “Go Mets! 💙🧡” ).

With this in mind, the Emoji Subcommittee has put forth a strategy to add a pink heart, a light blue heart, and a gray heart to the Unicode Standard. These are colors commonly found in gender flags (gender fluid pride flag), sexuality flags (bisexual pride flag), in sports team colors (Go Spurs!) and even some regional flags (Brussels). As of this year, these three heart emoji advanced as draft candidates, and you can expect them to land on your device of choice sometime next year.

In some ways we have returned to where we first started: Adding three new emoji to support a seemingly infinite number of concepts. This time if it fails, at least we’ll be left with lots of heart emoji that have multiple uses. ❤️🧡💛💚💙💜🤎🖤🤍



In light of this change, we’d like to clarify a few additional frequently asked questions with regards to emoji flags

Wait, if a country gains independence and is recognised by ISO, does that mean no flag emoji for them?
Flags for countries with Unicode region codes are automatically recommended, with no proposals necessary! First their codes and translated names are added to Unicode’s Common Locale Data Repository [CLDR], and then the emoji become valid in the next version of Unicode. These emoji are also automatically recommended for general interchange and wide deployment.

What about flags that change designs for geopolitical reasons?
Unicode does not specify the appearance of flag emoji. It is the responsibility of font designers to update their fonts as politics change. EG: no Unicode changes required for https://emojipedia.org/flag-mauritania/

My region was assigned a 3166-2 code. Do we have to submit a proposal?
No, the Emoji Subcommittee is no longer taking in any proposals for flags of any kind.

As a recent example, Kurdistan (a subdivision of Iraq) became an official subdivision in ISO 3166-2 (IQ-KR) on May 3, 2021. The corresponding Unicode subdivision code (iqkr) is slated for release in CLDR v41 on Apr 6, 2022. At that point the flag for Kurdistan will officially be valid — any platform, app, or font could support it. But that doesn’t mean it automatically gets in the queue for everyone’s phone. Only countries with ISO 3166-1 region codes are automatically recommended and require no proposal to move forward.

So what warrants an ISO 3166-1 assignment vs ISO 3166-2?
ISO 3166-1 is for countries recognized by the United Nations and ISO 3166-2 is for parts of countries.

Why is Antarctica part of ISO 3166-1 but Africa isn’t? There seems to be no rational explanation with regard to why islands with no inhabitants have a flag while regions with millions of people have no emoji flag.
It’s true, there are "Exceptional reservations." Antarctica has an ISO 3166-1 alpha 2 code: AQ. But WHY does it have an ISO 3166-1 code? Because ISO 3166 decided to (ages ago) include it, probably since the whole continent is "shared."

For historical reasons, you may see other exceptions like 🇦🇨 AC Ascension Island, 🇨🇵 CP Clipperton Island, or 🇩🇬 DG Diego Garcia.

Why don’t we have asexual, bisexual, pansexual, and non-binary pride flags? And if 🏴󠁧󠁢󠁷󠁬󠁳󠁿 and 🏴󠁧󠁢󠁳󠁣󠁴󠁿 get Unicode flags, surely there’s room for the Aboriginal and Torres Strait Islander flags?
Before diving into the facts of why these flags are not part of the universal character set, we want to first take a moment to consider what people mean when they ask these questions and what Unicode means when they decline these flag proposals. Because this question is not one we take lightly. In the course of world history, groups have used flags as a rallying cry to be seen, heard, recognized, and understood. In the Unicode Consortium’s mission to digitize the world’s languages, improve communication online, and achieve meaningful interoperability between platforms, the requests for flags have become a lightning rod for these rallying cries.

When people ask for a new flag emoji, we recognize that the underlying request is about more than simply a new emoji. And when we say, “We aren’t adding more flags,” we are only saying changing the Unicode Standard is not an effective mechanism for this recognition.

What if I submit a proposal for a flag despite this policy?
Your proposal will not be processed.

Relevant docs/Further Reading
https://www.unicode.org/L2/L2021/21128-esc-recs.pdf
https://www.unicode.org/L2/L2021/21167.htm
https://www.unicode.org/L2/L2021/21172-esc-recs.pdf
https://www.unicode.org/emoji/proposals.html#Flags
http://www.unicode.org/L2/L2019/19084-trans-flag.pdf

Thursday, March 24, 2022

Unicode CLDR v41 Beta available for testing

[beta image] The Unicode CLDR v41 Beta is now available for testing. The beta has already been integrated into the development version of ICU

The XML data, JSON data, charts, and specification are available for review. These may change if showstopper bugs are found. We would especially appreciate feedback from non-ICU consumers of CLDR data. Feedback can be filed at CLDR Tickets.

The release is scheduled for April 06, 2022.

CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%. 
The tooling changes  are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool

Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Level Languages  Locales  Notes
Modern 89 361 Suitable for full UI internationalization
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Total 124 414 Total of all languages/locales with ≥ Basic coverage.

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
  • Modern: Cherokee, Cantonese, Scottish Gaelic,  Sorbian (Lower), Sorbian (Upper)
  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
For details, see the Unicode CLDR v41 Release Note.
The next version of CLDR, version 42, is slated to start General Submission on May 18, 2022.

Unicode CLDR provides key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Thursday, March 3, 2022

Update on the Internationalization & Unicode Conference



This is an update on the annual Internationalization & Unicode Conference. As some of you know, Object Management Group (OMG), our events and logistics partner for the annual Internationalization and Unicode Conference (IUC), is moving in a different strategic direction.

We decided to mutually end the partnership and are now in the process of transferring the various resources from OMG to the Unicode Consortium.

ACKNOWLEDGMENTS

We would like to take this opportunity to thank the OMG team, especially Mike Narducci and Carol David, for their support and dedication in making IUC such a mainstay for the global internationalization community.

Unicode would also like to thank the dedicated group of volunteers who worked with Rick McGowan on the program committee. Some of them have been on the committee from the early days even before we began working with OMG in 2006. This speaks to the strong commitment by the individuals as well as the organizations supporting their involvement over the years.

THE WAY FORWARD

While the ending of this partnership creates some challenges, it is also an opportunity to reshape how Unicode approaches community building and training. And given how the meeting and event landscape continues to evolve, it is a great time to explore best practices and apply lessons learned from other meetings and groups.

To that end, Unicode staff and a small group of volunteers convened late last year and will continue meeting in the coming 60-90 days to create the future IUC.

REQUEST

The Unicode Consortium is always looking to improve its conference. We recognize IUC as a key opportunity each year for knowledge-sharing, community building, and evangelization and want your help to shape the future IUC. Please give us your input and ideas by EOD on Friday, March 11th in one of these brief questionnaires.

(1) Survey for previous attendees
(2) Survey for those who have yet to attend

NEXT STEPS

Once we have additional community input and an update on our specific plan, we will share that information with the broader community via this blog and other channels, including on the meeting website at www.unicodeconference.org.

In the meantime, thanks for your time and ideas!


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, March 2, 2022

Avoiding Source Code Spoofing

Unicode has convened a group of experts in programming languages, tooling, and security to provide guidance and recommendations on how to better handle international text in source code, as well as providing code to help implementations.

Recent reports have highlighted problems in the review of source code containing non-ASCII Unicode characters (the so-called “Trojan Source exploit”). A person reviewing a submission of source code could be fooled into thinking that the code was okay, when it was actually malicious. The basic problem occurs when the actual text is different from what the reader perceives it to be, based on what is displayed. This can result either from the presence of characters used in right-to-left scripts (such as Arabic or Hebrew) that can change the visual ordering of text, or from the presence of characters that look like others (also known as “confusables”).

The problems here are not solely a security issue: text with different writing directions or confusable characters can be hard to work with. Finding a solution here is important from both security and usability points of view. Developers of source code editors or compilers should not be required to have a deep knowledge of Unicode to provide good user experience and robust security mitigations.

Unicode’s mission is to allow everyone to use their own languages on computers and mobile devices. The above issues are part and parcel of a character set that covers all the writing systems of the world – and have been documented in the Unicode Standard since its very first version in 1991. Unicode’s past efforts have focused on misleading URLs and identifiers, and correct visual ordering of plain text. And while much of this material is relevant to source code, this group of experts will now collect, curate, and supplement that early documentation with concrete recommendations to support source code editors and compilers.

While it may seem that it is easiest to simply go back to limiting source code to only ASCII characters, ASCII-only environments make it much harder to write and maintain software that can be used all over the world – a fundamental requirement for modern software. Moreover, this approach disadvantages software developers who use languages other than English.

More details on the source code spoofing issue, the proposed plan, and formation of this group are found in document L2/22-007R2.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, February 23, 2022

Unicode CLDR v41 Alpha available for testing

[beta image] The Unicode CLDR v41 Alpha is now available for testing. The alpha has already been integrated into the development version of ICU. We would especially appreciate feedback from non-ICU consumers of CLDR data. Feedback can be filed at CLDR Tickets.

Alpha means that the main data and charts are available for review, but the specification, JSON data, and other components are not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:
  • Mar 09 — Beta (data)
  • Mar 23 — Beta2 (spec)
  • Apr 06 — Release
CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%.

The tooling changes are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool.

Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Level Languages  Locales  Notes
Modern 89 361 Suitable for full UI internationalization
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Total 124 414 Total of all languages/locales with ≥ Basic coverage.

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
  • Modern: Cherokee, Cantonese, Scottish Gaelic, Sorbian (Lower), Sorbian (Upper)
  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
Unicode CLDR provides key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Friday, February 11, 2022

Unicode 15.0 Alpha Review

u15 alpha image The repertoire for Unicode 15.0 is now open for early review and comment. During alpha review the repertoire is reasonably mature and stable, but is not yet completely locked down. Discussion regarding whether certain characters should be removed from the repertoire for publication is welcome. Character names and code point assignments are reasonably firm, but suggestions for improvement may still be entertained.

This early review is provided so that reviewers may consider the character repertoire issues prior to the start of beta review (currently scheduled to start in late May, 2022). Once beta review begins, the repertoire, code points, and character names will all be locked down, and no longer be subject to changes.

Feedback for the alpha review should be reported under PRI #442 using the Unicode contact form by April 5, 2022.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, February 9, 2022

Enhancements to Unicode Regular Expressions

Regex image A new revision of UTS #18, Unicode Regular Expressions is now available.

Regular expressions are a key tool in software development. Back in 2000, few regular expression engines supported Unicode, even at a basic level. UTS #18 set out to raise the bar, describing how regular expression engines could be adapted to deal with Unicode correctly and completely. Since that time, major programming languages and libraries have adopted level 1 features (supporting all Unicode literals, basic character properties, subtraction, intersection, ...), and some also adopted some level 2 features (full character properties, grapheme clusters, ...).

The main focus in this release is on handling the complement of properties of strings. The distinction is drawn between code point complement and full complement, followed by explicitly defining the complement operator [^...] to be code point complement, and providing the reasons for doing so in an annex. The important difference between [A--B] and [A&&[^B]] is outlined — setting out the reasons why the latter is insufficient to represent set difference.

For the EBNF in general, and for character classes with strings in particular, examples were added and the text clarified. A new annex provides examples for how character classes can be parsed.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Tuesday, January 4, 2022

Unicode 14.0 Paperback Available

U14 paperback vol 1 image The Unicode 14.0 core specification is now available in paperback book form with an original cover design by Sophia Tai. This edition consists of a pair of modestly priced print-on-demand volumes containing the complete text of the core specification of Version 14.0 of the Unicode Standard.

Each of the two volumes is a compact 6×9 inch US trade paperback size. The two volumes may be purchased separately or together, although they are intended as a set. Please visit the separate description pages for Volume 1 and Volume 2 to order each volume in the set. The cost for the pair is US $36.72, plus shipping and any applicable taxes.

These volumes do not include the Version 14.0 code charts, nor do they include the Version 14.0 Standard Annexes and Unicode Character Database, which are all freely available on the Unicode website.

Purchase The Unicode Standard, Version 14.0 - Core Specification Volume 1 and Volume 2.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]