Tuesday, March 3, 2026

UTS #18: More Unicode Properties in Regular Expressions

Regular Expressions, or “Regex”, are the invisible workhorses of the digital world. Regex allows apps and computer systems to find, validate, and change text based on patterns rather than specific words. Unicode properties play a vital role in this. Rather than an application using a fixed list of characters like a-z, A-Z — and failing badly for all but English — Unicode properties take on the burden of supplying meaningful sets of characters, like letters, Greek characters, or Emoji. Properties can be combined, such as Greek letters with an expression like [\p{script=greek}&\p{letter}].

This specification has an update for now covering over 100 different properties. The following are the most important changes, with others found in the modification section.

  • Section 2.7 Full Properties lists the full set of properties recommended for support. This version adds: IDS_Unary_Operator, NFKC_Simple_Casefold, ID_Compat_Math_Start, ID_Compat_Math_Continue, Indic_Conjunct_Break, and RGI_Emoji_Qualification

  • Special rules called “matching rules” are used when looking up properties and their values by name. This version recommends the matching rules from Section 5.9 Matching Rules of UAX #44.

By expanding and refining property support in UTS #18, this update strengthens the foundation for global text processing.


----------------------------------------------

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock