Wednesday, September 13, 2023

Source Code Handling: Preventing Spoofing at the Source

header image
By: Mark Davis, Cofounder and CTO

The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.


Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “pа”, where the first ‘а’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 😵‍💫

In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.

In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.


  • Line-break spoofs can cause what appears to be a line of code to be actually commented out, as far as the compiler is concerned. This can happen with C11, for example:
    precondition image
    To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!

  • The “pа” above is an example of a confusable spoof.

  • As for a bidirectional spoof, take pair of variables named Aא1 and A1א; these look identical, but the former consists of the letters A and א followed by the digit 1, whereas the latter consists of the letter A, the digit 1, and the letter א, in that order.
Such code might not even be malicious — it is too easy to accidentally give reviewers (or even the writer!) the wrong impression, leading to hidden software bugs — and just be very hard to understand; here’s an example:

The text “Error: {0} {1}", message” becomes RTL in translation.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.


In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.

The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.


The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as Aא1 above.

Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.

This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (梁海), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.

Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.