Monday, March 28, 2022

The Past and Future of Flag Emoji

Emoji Flags are dead, long live Emoji Flags ๐Ÿ ๐Ÿ ๐Ÿ

By Jennifer Daniel, Unicode Emoji Subcommittee Chair

With Emoji 16.0 submissions open from April 4, 2022 through July 31, 2022, the Unicode Emoji Subcommittee members stand with open arms for your future hair pick, khanda, and pink heart emoji proposals (BTW, if you were planning to prepare proposals for those concepts, we have some good news for you: they are already Emoij 15.0 draft candidates!).

That being said, there is one particular type of emoji for which the Unicode Consortium will no longer accept proposals. Flag emoji of any category.

Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. Today nine out of ten are in the top twenty most frequently shared flags. (The only outlier is Russia.) The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others.

Why do flag emoji exist in the first place?

Well, the shorter, more technical answer is: The country flags use a generative mechanism, and were encoded early on for compatibility reasons.

The longer answer requires a flashback to the 1990’s. KDDI and SoftBank — two Japanese mobile phone carriers — had early emoji sets which included 10 country flags: ๐Ÿ‡จ๐Ÿ‡ณ ๐Ÿ‡ฉ๐Ÿ‡ช ๐Ÿ‡ช๐Ÿ‡ธ ๐Ÿ‡ซ๐Ÿ‡ท ๐Ÿ‡ฌ๐Ÿ‡ง ๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿ‡ฏ๐Ÿ‡ต ๐Ÿ‡ฐ๐Ÿ‡ท ๐Ÿ‡ท๐Ÿ‡บ ๐Ÿ‡บ๐Ÿ‡ธ¹. A possibly apocryphal explanation is that they were used to denote what to grab for dinner: "American ๐Ÿ‡บ๐Ÿ‡ธ or Italian ๐Ÿ‡ฎ๐Ÿ‡น?" (Such an innocent time in emoji history, pre-hamburger ๐Ÿ” emoji). Alas, as Unicode stepped in to create meaningful interoperability between these carrier-specific encodings, they were presented with a problem: why should these 10 countries have flag emoji when others do not?

The original emoji set included ten flags (shown above).
¹ Interestingly, Windows has never supported flag emoji ๐Ÿ”ฎ. So, if you are reading this on a Windows device and flags aren't displaying, simply refer to the image above of the ten original flag emoji.

Various ideas were considered. The Unicode Consortium isn’t in the business of determining what is a country and what isn’t. That’s when the Consortium chose ISO 3166-1 alpha 2 as the source for valid country designations. ISO 3166 is a widely-accepted standard, and this particular mechanism represents each country with 2 letters, such as “US” (For United States), “FR” (France), or “CN” (China).

It wasn’t a perfect solution, but by allowing the 10 flag emoji — and the rest of the country flags — to be accurately interchanged between DoCoMo, KDDI, SoftBank, Google, and Apple, and others, it worked just fine.

Why this flag emoji but not that one?

Today, the largest emoji category is flags (Out of only ~3600 emoji, there are over 200 flags!). But, did you know that there are over 5,000 geographically-recognized regions that are also “valid”? These are known as subdivision regions and are based on ISO 3166-2. (These include states in the US, regions in Italy, provinces in Argentina, and so on.)

First, what does “valid” mean to the Unicode Standard? Well, think of it this way. Today, anyone could make a font of 5,000 emoji flags using these sequences. They are valid sequences. They are legit sequences. They won’t break. Any platform, application, or font can implement them. The significant difference here is that valid doesn’t mean they are recommended for implementation.

Back to ISO. ISO groups countries in a more formal way than say FIFA or The Olympics. For example, the four regions of the UK are regularly used in sport but not recognized in ISO 3166-1. In 2016, the Unicode Consortium started looking into solutions to support their inclusion (with the technical feasibility of adding more if needed in the future). This was the impetus for adding a general mechanism to make all ISO 3166-2 codes be valid for flags. However, only three of the 5,000 ISO 3166-2 codes have widely adopted emoji— England, Scotland, and Wales. (Northern Ireland remains in limbo until an “official flag” is formalized).

Flags for England, Scotland, and Wales were included in Emoji 5.0

So, with so many “valid sequences” why hasn’t anyone taken advantage of this sweet sweet rich flag opportunity?

At the time, in 2016, adding a few flags seemed reasonable but in retrospect was short-sighted. If the Emoji Subcommittee recommends the addition of a Catalonia flag emoji, then it looks like favoritism unless all the other subdivisions of Spain are added. And if those are added, what about the subdivisions of Japan or Namibia, or the Cantons of Liechtenstein? The inclusion of new flags will always continue to emphasize the exclusion of others. And there isn’t much room for the fluid nature of politics — countries change but Unicode additions are forever — once a character is added it can never be removed. (That being said, font designers can always update the designs as regimes change).

How are flag emoji used?

Flags are very specific in what they mean, and they don’t represent concepts used multiple times a day or even multiple times a year. You could say flag emoji have transcended the messaging experience and are primarily found in more auto-biographical contexts. (Like your TikTok bio. Or, maybe you add a flag to your username on Twitter.) But, even then flags are not as commonly found in biographical spaces as you may expect. (The top five emoji found in Twitter bios? ❤️✨๐Ÿ’™๐Ÿ’œ๐Ÿ’›.)

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. (There are exceptions: usage of the rainbow flag is above median!) That begs the question, “So, why not encode more identity flags?” Well, we have seen the same results for flags as we have seen for other emoji — a very long tail of rarely used options. They also tend to change over time! In the past six years since adding a Pride Flag to the Unicode Standard (2019) it’s already been redesigned. Many times. Identities are fluid and unstoppable which makes mapping them to a formal unchanging universal character set incompatible.

Why does usage matter in selecting emoji?

Any emoji additions have to take into consideration usage frequency, trade-offs with other choices, font file size, and the burden on developers (and users!) to make it easier to send and receive emoji. That’s why the Emoji Subcommittee set out to reduce the number of emoji we encode in any given year. Flags are also super hard to discern at emoji sizes — it’s quite easy to send a different flag than you intended (and with each additional flag the problem gets worse). The simple truth is that if more people used flags then there would be more of an argument to encode them. The Unicode Standard subset is just not a viable solution here for implementers nor users. Fortunately, there are seemingly infinite other ways to exchange images of flags that are more flexible and decentralized, such as stickers, gifs, and image attachments.

What is Unicode doing about it?

We realize closing this door may come as a disappointment — after all, flags often serve as a rallying cry to be seen, heard, recognized, and understood.

The Internet is a different place now than it was in the 90’s — the distribution of imagery online is unstoppable! Given how flags are commonly used this is a reasonable path forward: If you care to denote your affiliation with a region be it geographic, political, or identity (or all three) you can add a flag to your avatar image, share videos, or send a gif or sticker to razz your friend during a sports game (and of course there is always ⚽ ⚽ ⚽ ⚽ ⚽).


The more emoji can operate as building blocks, the more versatile, fluid, and useful they become! Rather than relying on Unicode to add new emoji for every concept under the Sun (this is simply not attainable) the citizens of the world have proven to be infinitely creative and fluid: often using existing emoji like the colored hearts (❤️️ ๐Ÿงก ๐Ÿ’› ๐Ÿ’š ๐Ÿ’™ ๐Ÿ’œ ๐ŸคŽ ๐Ÿ–ค ๐Ÿค) to express themselves. Hearts are among the most frequently used type of emoji and the nine colored hearts are often juxtaposed next to each other to denote markers of emotion (“I’m sorry ๐Ÿ’™” or “love you ❤️”) and identity or affiliation that are not represented with atomic emoji in the Unicode Standard (ex. “Pan African pride ❤️️๐Ÿ’š๐Ÿ–ค”, “Hi I’m bi ๐Ÿ’–๐Ÿ’™๐Ÿ’œ”, and yes even sports teams “Go Mets! ๐Ÿ’™๐Ÿงก” ).

With this in mind, the Emoji Subcommittee has put forth a strategy to add a pink heart, a light blue heart, and a gray heart to the Unicode Standard. These are colors commonly found in gender flags (gender fluid pride flag), sexuality flags (bisexual pride flag), in sports team colors (Go Spurs!) and even some regional flags (Brussels). As of this year, these three heart emoji advanced as draft candidates, and you can expect them to land on your device of choice sometime next year.

In some ways we have returned to where we first started: Adding three new emoji to support a seemingly infinite number of concepts. This time if it fails, at least we’ll be left with lots of heart emoji that have multiple uses. ❤️๐Ÿงก๐Ÿ’›๐Ÿ’š๐Ÿ’™๐Ÿ’œ๐ŸคŽ๐Ÿ–ค๐Ÿค



In light of this change, we’d like to clarify a few additional frequently asked questions with regards to emoji flags

Wait, if a country gains independence and is recognised by ISO, does that mean no flag emoji for them?
Flags for countries with Unicode region codes are automatically recommended, with no proposals necessary! First their codes and translated names are added to Unicode’s Common Locale Data Repository [CLDR], and then the emoji become valid in the next version of Unicode. These emoji are also automatically recommended for general interchange and wide deployment.

What about flags that change designs for geopolitical reasons?
Unicode does not specify the appearance of flag emoji. It is the responsibility of font designers to update their fonts as politics change. EG: no Unicode changes required for https://emojipedia.org/flag-mauritania/

My region was assigned a 3166-2 code. Do we have to submit a proposal?
No, the Emoji Subcommittee is no longer taking in any proposals for flags of any kind.

As a recent example, Kurdistan (a subdivision of Iraq) became an official subdivision in ISO 3166-2 (IQ-KR) on May 3, 2021. The corresponding Unicode subdivision code (iqkr) is slated for release in CLDR v41 on Apr 6, 2022. At that point the flag for Kurdistan will officially be valid — any platform, app, or font could support it. But that doesn’t mean it automatically gets in the queue for everyone’s phone. Only countries with ISO 3166-1 region codes are automatically recommended and require no proposal to move forward.

So what warrants an ISO 3166-1 assignment vs ISO 3166-2?
ISO 3166-1 is for countries recognized by the United Nations and ISO 3166-2 is for parts of countries.

Why is Antarctica part of ISO 3166-1 but Africa isn’t? There seems to be no rational explanation with regard to why islands with no inhabitants have a flag while regions with millions of people have no emoji flag.
It’s true, there are "Exceptional reservations." Antarctica has an ISO 3166-1 alpha 2 code: AQ. But WHY does it have an ISO 3166-1 code? Because ISO 3166 decided to (ages ago) include it, probably since the whole continent is "shared."

For historical reasons, you may see other exceptions like ๐Ÿ‡ฆ๐Ÿ‡จ AC Ascension Island, ๐Ÿ‡จ๐Ÿ‡ต CP Clipperton Island, or ๐Ÿ‡ฉ๐Ÿ‡ฌ DG Diego Garcia.

Why don’t we have asexual, bisexual, pansexual, and non-binary pride flags? And if ๐Ÿด๓ ง๓ ข๓ ท๓ ฌ๓ ณ๓ ฟ and ๐Ÿด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ get Unicode flags, surely there’s room for the Aboriginal and Torres Strait Islander flags?
Before diving into the facts of why these flags are not part of the universal character set, we want to first take a moment to consider what people mean when they ask these questions and what Unicode means when they decline these flag proposals. Because this question is not one we take lightly. In the course of world history, groups have used flags as a rallying cry to be seen, heard, recognized, and understood. In the Unicode Consortium’s mission to digitize the world’s languages, improve communication online, and achieve meaningful interoperability between platforms, the requests for flags have become a lightning rod for these rallying cries.

When people ask for a new flag emoji, we recognize that the underlying request is about more than simply a new emoji. And when we say, “We aren’t adding more flags,” we are only saying changing the Unicode Standard is not an effective mechanism for this recognition.

What if I submit a proposal for a flag despite this policy?
Your proposal will not be processed.

Relevant docs/Further Reading
https://www.unicode.org/L2/L2021/21128-esc-recs.pdf
https://www.unicode.org/L2/L2021/21167.htm
https://www.unicode.org/L2/L2021/21172-esc-recs.pdf
https://www.unicode.org/emoji/proposals.html#Flags
http://www.unicode.org/L2/L2019/19084-trans-flag.pdf

Thursday, March 24, 2022

Unicode CLDR v41 Beta available for testing

[beta image] The Unicode CLDR v41 Beta is now available for testing. The beta has already been integrated into the development version of ICU

The XML data, JSON data, charts, and specification are available for review. These may change if showstopper bugs are found. We would especially appreciate feedback from non-ICU consumers of CLDR data. Feedback can be filed at CLDR Tickets.

The release is scheduled for April 06, 2022.

CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%. 
The tooling changes  are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool

Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Level Languages  Locales  Notes
Modern 89 361 Suitable for full UI internationalization
Moderate 13 32 Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Basic 22 21 Suitable for locale selection, such as choice of language in mobile phone settings.
Total 124 414 Total of all languages/locales with ≥ Basic coverage.

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
  • Modern: Cherokee, Cantonese, Scottish Gaelic,  Sorbian (Lower), Sorbian (Upper)
  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Mฤori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
For details, see the Unicode CLDR v41 Release Note.
The next version of CLDR, version 42, is slated to start General Submission on May 18, 2022.

Unicode CLDR provides key building blocks for software supporting the world’s languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Thursday, March 3, 2022

Update on the Internationalization & Unicode Conference



This is an update on the annual Internationalization & Unicode Conference. As some of you know, Object Management Group (OMG), our events and logistics partner for the annual Internationalization and Unicode Conference (IUC), is moving in a different strategic direction.

We decided to mutually end the partnership and are now in the process of transferring the various resources from OMG to the Unicode Consortium.

ACKNOWLEDGMENTS

We would like to take this opportunity to thank the OMG team, especially Mike Narducci and Carol David, for their support and dedication in making IUC such a mainstay for the global internationalization community.

Unicode would also like to thank the dedicated group of volunteers who worked with Rick McGowan on the program committee. Some of them have been on the committee from the early days even before we began working with OMG in 2006. This speaks to the strong commitment by the individuals as well as the organizations supporting their involvement over the years.

THE WAY FORWARD

While the ending of this partnership creates some challenges, it is also an opportunity to reshape how Unicode approaches community building and training. And given how the meeting and event landscape continues to evolve, it is a great time to explore best practices and apply lessons learned from other meetings and groups.

To that end, Unicode staff and a small group of volunteers convened late last year and will continue meeting in the coming 60-90 days to create the future IUC.

REQUEST

The Unicode Consortium is always looking to improve its conference. We recognize IUC as a key opportunity each year for knowledge-sharing, community building, and evangelization and want your help to shape the future IUC. Please give us your input and ideas by EOD on Friday, March 11th in one of these brief questionnaires.

(1) Survey for previous attendees
(2) Survey for those who have yet to attend

NEXT STEPS

Once we have additional community input and an update on our specific plan, we will share that information with the broader community via this blog and other channels, including on the meeting website at www.unicodeconference.org.

In the meantime, thanks for your time and ideas!


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]

Wednesday, March 2, 2022

Avoiding Source Code Spoofing

Unicode has convened a group of experts in programming languages, tooling, and security to provide guidance and recommendations on how to better handle international text in source code, as well as providing code to help implementations.

Recent reports have highlighted problems in the review of source code containing non-ASCII Unicode characters (the so-called “Trojan Source exploit”). A person reviewing a submission of source code could be fooled into thinking that the code was okay, when it was actually malicious. The basic problem occurs when the actual text is different from what the reader perceives it to be, based on what is displayed. This can result either from the presence of characters used in right-to-left scripts (such as Arabic or Hebrew) that can change the visual ordering of text, or from the presence of characters that look like others (also known as “confusables”).

The problems here are not solely a security issue: text with different writing directions or confusable characters can be hard to work with. Finding a solution here is important from both security and usability points of view. Developers of source code editors or compilers should not be required to have a deep knowledge of Unicode to provide good user experience and robust security mitigations.

Unicode’s mission is to allow everyone to use their own languages on computers and mobile devices. The above issues are part and parcel of a character set that covers all the writing systems of the world – and have been documented in the Unicode Standard since its very first version in 1991. Unicode’s past efforts have focused on misleading URLs and identifiers, and correct visual ordering of plain text. And while much of this material is relevant to source code, this group of experts will now collect, curate, and supplement that early documentation with concrete recommendations to support source code editors and compilers.

While it may seem that it is easiest to simply go back to limiting source code to only ASCII characters, ASCII-only environments make it much harder to write and maintain software that can be used all over the world – a fundamental requirement for modern software. Moreover, this approach disadvantages software developers who use languages other than English.

More details on the source code spoofing issue, the proposed plan, and formation of this group are found in document L2/22-007R2.


Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

[badge]