CVE-2021-42694: Unicode Homoglyph Security Vulnerability

CVE-2021-42694 Overview

CVE-2021-42694 is a vulnerability discovered in the character definitions of the Unicode Specification through version 14.0. The specification allows an adversary to produce source code identifiers such as function names using homoglyphs that render visually identical to a target identifier. Adversaries can leverage this to inject code via adversarial identifier definitions in upstream software dependencies that are invoked deceptively in downstream software.

This vulnerability, part of the broader "Trojan Source" class of attacks, exploits the fundamental nature of international text support to enable supply chain attacks where malicious code can be hidden in plain sight within source code repositories.

Critical Impact
Attackers can inject malicious code into software projects using visually identical but semantically different Unicode characters, enabling supply chain attacks that evade human code review.

Affected Products

Unicode Specification through version 14.0
All software implementations supporting Unicode identifiers in source code
Programming languages and compilers that allow Unicode in identifiers

Discovery Timeline

2021-11-01 - CVE-2021-42694 published to NVD
2024-11-21 - Last updated in NVD database

Technical Details for CVE-2021-42694

Vulnerability Analysis

The vulnerability exploits a fundamental aspect of Unicode support in modern programming languages and development tools. When software systems implement support for The Unicode Standard, they may fail to distinguish between visually identical but semantically different characters known as homoglyphs. This creates an opportunity for attackers to craft malicious source code that appears legitimate during human review but executes differently than expected.

The attack leverages confusable characters—characters from different Unicode code points that render identically or nearly identically on screen. For example, the Latin letter "a" (U+0061) and the Cyrillic letter "а" (U+0430) appear identical in most fonts but are treated as distinct characters by compilers and interpreters.

Root Cause

The root cause stems from the Unicode Specification's inclusion of thousands of characters from various scripts, many of which are visually similar or identical. When programming languages allow Unicode characters in identifiers (variable names, function names, class names), they create an attack surface where:

Two identifiers can appear identical to human reviewers
Compilers treat them as distinct identifiers
Malicious definitions can shadow or replace legitimate ones

This is classified under CWE-94 (Improper Control of Generation of Code) and relates to CWE-1007 (Insufficient Visual Distinction of Homoglyphs Renders Users Vulnerable to Spoofing).

Attack Vector

The attack is network-based and requires user interaction, typically occurring through supply chain compromise scenarios:

An attacker identifies a popular library or dependency
They create a malicious contribution containing function or variable names using homoglyphs
The malicious identifiers visually match legitimate ones but contain different Unicode code points
During code review, human reviewers cannot distinguish the malicious code from legitimate code
When the code is merged and compiled, the malicious definitions execute instead of or alongside legitimate ones

The Trojan Source research demonstrates how these attacks can be weaponized in real-world scenarios. For example, an attacker could define a function named isАdmin using a Cyrillic "А" that shadows the legitimate isAdmin function using a Latin "A", causing security checks to be bypassed.

Detection Methods for CVE-2021-42694

Indicators of Compromise

Source code files containing mixed-script identifiers (e.g., Latin and Cyrillic characters in the same identifier)
Presence of zero-width characters or bidirectional text markers in source code
Multiple function or variable definitions that appear identical but have different byte sequences
Unusual Unicode normalization behavior in repository diffs

Detection Strategies

Implement static analysis tools that flag identifiers containing mixed Unicode scripts
Configure IDEs and code editors to highlight non-ASCII characters in identifiers
Use Git hooks to scan commits for suspicious Unicode patterns before acceptance
Deploy code review tools that can display Unicode code points alongside rendered text

Monitoring Recommendations

Enable logging for code repository changes and review for suspicious Unicode patterns
Monitor build processes for warnings related to identifier conflicts or shadowing
Implement automated scanning of third-party dependencies for homoglyph attacks
Review Unicode Technical Standards TR36 and TR39 for comprehensive security guidance

How to Mitigate CVE-2021-42694

Immediate Actions Required

Audit existing codebases for identifiers containing non-ASCII characters
Configure development environments to visually distinguish Unicode scripts
Implement pre-commit hooks that reject mixed-script identifiers
Review recent contributions to critical projects for potential homoglyph attacks

Patch Information

The Unicode Consortium has documented this class of security vulnerability in Unicode Technical Report #36, Unicode Security Considerations. Guidance on mitigations is provided in Unicode Technical Standard #39, Unicode Security Mechanisms.

Major compiler and language vendors have released updates to warn about or restrict confusable identifiers:

Refer to the Gentoo GLSA 202210-09 for distribution-specific guidance
Consult the CERT Vulnerability Report #999008 for comprehensive mitigation strategies

Workarounds

Restrict source code to ASCII-only identifiers in security-sensitive projects
Enable compiler flags that warn about confusable or mixed-script identifiers where available
Use repository-level policies to block commits containing suspicious Unicode patterns
Implement mandatory tooling checks in CI/CD pipelines to detect homoglyph attacks before code is merged

bash

# Example Git pre-commit hook to detect non-ASCII in identifiers
#!/bin/bash
# Scan staged files for suspicious Unicode patterns
git diff --cached --name-only | xargs grep -Pn '[^\\x00-\\x7F]' && \
  echo "Warning: Non-ASCII characters detected. Review for homoglyphs." && exit 1
exit 0