The way Unicode’s UTF-8 text encoding handles different languages could be misused to write malicious code that says one thing to humans and another to compilers, academics are warning.
“What if it were possible to trick compilers into emitting binaries that did not match the logic visible in source code?” ask Cambridge student Nicholas Boucher and Professor Ross Anderson in a paper published today.
They say it is possible, and outlined a new threat [PDF] that could be deployed by future supply chain attackers – making detection of something like the SolarWinds attack at code level even harder than it is already.
Tracked as CVE-2021-42574, the duo’s research focused on so-called bidirectional (“bidi”) characters in Unicode. These are used so words written in right-to-left languages (such as Arabic and Hebrew) can be inserted into sentences written in left-to-right languages (such as English). Boucher and Anderson discovered that they can be misused to misrepresent source code.
“Embedding multiple layers of LRI and RLI within each other enables the near-arbitrary reordering of strings,” says their paper. “Our key insight is that we can reorder source code characters in such a way that the resulting display order also represents syntactically valid source code.”
“In effect, we anagram program A into program B.”
Concerningly, the academics say that Microsoft’s VS Code and Apple’s Xcode text editors don’t highlight the use of bidi characters as prominently as they might – while praising Vim for showing them as “numerical code points.”
Professor Anderson told The Register: “Most programming languages let you put [bidi characters] in string literals and in comments, so you can use them in source code: code that appears innocuous to a human reviewer can actually do something nasty. That’s bad news for projects like Linux and Webkit that accept contributions from random people, subject them to manual review, then incorporate them into critical code.”
The problem is not merely academic: Rust’s maintainers patched rustc against the attack over the weekend after the researchers used it for a successful proof-of-concept, even though Rust acknowledged it has not seen the technique deployed in the wild.
Snippets of the technique exist on GitHub, although the Cambridge pair’s paper says that none of them seemed to be malicious.
Break comment, receive code
Boucher and Anderson’s paper included several examples of this novel attack technique. One, in Python, is presented below.
In figure 2 'alice'
is defined as being worth 100, followed by a function that subtracts funds from Alice. The final line calls that function with a value of 50, so when executed that little program should give us a result of 50.
However, figure 1 shows us how bidi characters can be used to frustrate the program’s intent: by inserting RLI we change the text direction from conventional English to right-to-left. The output of figure 1 becomes 100 in spite of our subtract funds function.
“This is because the word return in the docstring is actually executed due to a bidi override, causing the function to return prematurely and the code which subtracts value from a user’s bank account to never run,” explains the paper.
The same principle can be applied to other languages, including C, C#, C++ and JavaScript as well as Rust – though for the latter, yesterday’s update to version 1.56.0 sees Rust rejecting code containing bidi characters.
Surely highlighting solves this
Most text editors used by devs highlight various levels of nested code, so you’d imagine bidi attacks would be frustrated by changes immediately showing up. Unfortunately, this isn’t as reliable a defence as you might imagine: the academics say their “experience was mixed” on this front.
“Some attacks provided strange highlighting in a subset of editors, which may suffice to alert developers that an encoding issue is present. However, all syntax highlighting nuances were editor-specific, and other attacks did not show abnormal highlighting in the same settings” the paper says.
Defending against the attack technique could be as straightforward as rewriting software build pipelines to halt if they encounter a bidi character, suggest the academics.
The same technique could be used to insert homoglyphs – those irritating non-Latin characters used by fraudsters in domain names for years in order to phish the unwary.
Martin Lee, EMEA outreach manager for Cisco Talos, commented to The Register: “Managing security risk is all the more difficult when threat actors are able to compromise source code, or software update systems, in order to integrate malicious functionality within otherwise legitimate software. “This research underlines the fact that threat actors may bypass even the most secure perimeter defences. Organisations need to be constantly vigilant for evidence of incursion using both endpoint and network based security systems.” ®
Bootnote
Boucher and Anderson’s paper observes: “When writing vulnerability disclosures, descriptions that personalise the potential impact can be needed to drive action. Neutral disclosures like those found in academic papers are less likely to evoke a response than disclosures stating that named products are immediately at risk”.
We reserve the right to arbitrarily rename the next security discovery FLAMINGHELLDEATHPWNAGE. Tenders will be issued in due course for design of a logo and procurement of a snappy domain name.