Wednesday, October 9, 2019

The Most Frequent Emoji

Emoji Frequency Image
How does the Unicode Consortium choose which new emoji to add? One important factor is data about how frequently the current emoji are used. Patterns of usage help to inform decisions about future emoji. The Consortium has been working to assemble this information and make it available to the public.

And the two most frequently used emoji in the world are...

😂 and ❤️

The new Unicode Emoji Frequency page shows a list of the Unicode v12.0 emoji ranked in order of how frequently they are used.

“The forecasted frequency of use is a key factor in determining whether to encode new emoji, and for that it is important to know the frequency of use of existing emoji,” said Mark Davis, President of the Unicode Consortium. “Understanding how frequently emoji are used helps prioritize which categories to focus on and which emoji to add to the Standard.”


Over 136,000 characters are available for adoption, to help the Unicode Consortium’s work on digitally disadvantaged languages.

[badge]

Friday, October 4, 2019

ICU 65 Released

ICU LogoUnicode® ICU 65 has just been released. It updates to CLDR 36 locale data with many additions and corrections, and some new measurement units. The Java LocaleMatcher API is improved, and ported to C++. For building ICU data, there are new filtering options, and new tracing support for data loading in ICU4C.

ICU is a software library widely used by products and other libraries to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

For details please see http://site.icu-project.org/download/65.

Unicode CLDR Version 36 Language/Locale Data Released

Unicode CLDR 36 provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

CLDR 36 included a full Survey Tool data collection phase, adding approximately 32K new translated fields, with significant increases in moderate and/or modern coverage for: ceb (Cebuano), ha (Hausa / Latin script), ig (Igbo), kok (Konkani), qu (Quechua), to (Tongan), yo (Yoruba). Seed data was added for several new languages: cic (Chickasaw), mus (Muscogee), osa (Osage, Osage script); an (Aragonese), su (Sundanese, Latin script), szl (Silesian).

Enhancements in v36 include:
  • New Emoji 13 draft candidates’ names and search keywords are included in this release to support smooth adoption of the upcoming Emoji release (scheduled for release in 2020Q1 as part of Unicode 13)
  • New measurement units and patterns: dot-per-centimeter, dot-per-inch, em, megapixel, pixel, pixel-per-centimeter, pixel-per-inch; decade; therm-us; bar, pascal; and a pattern for combining units in a multiplicative relationship, such as “newton-meter”.
  • Locale IDs:
    • Extended Language Matching to have fallbacks for many encompassed languages.
    • Added more languageAliases from the BCP47 language subtag registry, for deprecated languages.
  • A new test directory added for localeIdentifiers, graphemeClusters (for currently supported Indic languages) and transliterations.
There are some infrastructure changes to be aware of, including:
  • The cldr repository has moved from subversion to git, and queries using Trac no longer work. See CLDR Change Requests for new information.
  • The data in the cldr repository now preserves votes for inherited data, indicated with “↑↑↑”. In order to generate CLDR in the previous form without “↑↑↑” and with proper minimization, a new tool GenerateProductionData is available.
    Note: Release data that has been processed with GenerateProductionData is available in a parallel cldr-staging repository, with the same release tags.


The Common Locale Data Repository (CLDR) provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks as:

  • Locale-specific patterns for formatting and parsing: dates, times, time zones, numbers and currency values, measurement units,…
  • Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, time zones, cities, and time units, emoji characters and sequences (and search keywords),…
  • Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences; keyboard layouts;…
  • Country information: language usage, currency information, calendar preference, week conventions, …
  • Validity: Definitions, aliases, and validity information for Unicode locales, languages, scripts, regions, and extensions,…

Over 136,000 characters are available for adoption, to help the Unicode Consortium’s work on digitally disadvantaged languages.

[badge]

Monday, September 30, 2019

Call for Unicode 13.0 Cover Design Art

 [cover1] The Unicode Consortium is inviting artists and designers to submit cover design proposals for Version 13.0 of The Unicode Standard, scheduled for publications in March 2020.

The selected cover design will appear on the Unicode Standard 13.0 web pages, in the print-on-demand publications, and in associated promotional literature on the Unicode website. The artist whose design is selected for the cover will receive full credit in the colophon of the publication for which the art is used, and wherever else the design appears, and will receive $700. Two selected runner-up artists will receive $150 apiece.

Please see the announcement web page for requirements and more details.

Thursday, September 19, 2019

New Public Review Issues for Unicode Technical Reports

stopwatch image The Unicode Consortium has recently opened several Public Review Issues for proposed updates to Unicode Standard Annexes and other technical reports . The closing date for comments on these open issues is September 30, 2019, for feedback to be reviewed at the UTC meeting.

Highlights include a major proposed update to UTS #51, Unicode Emoji as well as significant updates to UAX #14, Unicode Line Breaking Algorithm, UTS #18, Unicode Regular Expressions, UAX #29, Unicode Text Segmentation, and UAX #38, Unicode Han Database.

Please see the Public Review Issues page for a full list of the items for review and links to the documents.


Over 136,000 characters are available for adoption, to help the Unicode Consortium’s work on digitally disadvantaged languages.

[badge]