CVE-2021-42694 Overview
CVE-2021-42694 is a vulnerability discovered in the character definitions of the Unicode Specification through version 14.0. The specification allows an adversary to produce source code identifiers such as function names using homoglyphs that render visually identical to a target identifier. Adversaries can leverage this to inject code via adversarial identifier definitions in upstream software dependencies that are invoked deceptively in downstream software.
This vulnerability, part of the broader "Trojan Source" class of attacks, exploits the fundamental nature of international text support to enable supply chain attacks where malicious code can be hidden in plain sight within source code repositories.
Critical Impact
Attackers can inject malicious code into software projects using visually identical but semantically different Unicode characters, enabling supply chain attacks that evade human code review.
Affected Products
- Unicode Specification through version 14.0
- All software implementations supporting Unicode identifiers in source code
- Programming languages and compilers that allow Unicode in identifiers
Discovery Timeline
- 2021-11-01 - CVE-2021-42694 published to NVD
- 2024-11-21 - Last updated in NVD database
Technical Details for CVE-2021-42694
Vulnerability Analysis
The vulnerability exploits a fundamental aspect of Unicode support in modern programming languages and development tools. When software systems implement support for The Unicode Standard, they may fail to distinguish between visually identical but semantically different characters known as homoglyphs. This creates an opportunity for attackers to craft malicious source code that appears legitimate during human review but executes differently than expected.
The attack leverages confusable characters—characters from different Unicode code points that render identically or nearly identically on screen. For example, the Latin letter "a" (U+0061) and the Cyrillic letter "а" (U+0430) appear identical in most fonts but are treated as distinct characters by compilers and interpreters.
Root Cause
The root cause stems from the Unicode Specification's inclusion of thousands of characters from various scripts, many of which are visually similar or identical. When programming languages allow Unicode characters in identifiers (variable names, function names, class names), they create an attack surface where:
- Two identifiers can appear identical to human reviewers
- Compilers treat them as distinct identifiers
- Malicious definitions can shadow or replace legitimate ones
This is classified under CWE-94 (Improper Control of Generation of Code) and relates to CWE-1007 (Insufficient Visual Distinction of Homoglyphs Renders Users Vulnerable to Spoofing).
Attack Vector
The attack is network-based and requires user interaction, typically occurring through supply chain compromise scenarios:
- An attacker identifies a popular library or dependency
- They create a malicious contribution containing function or variable names using homoglyphs
- The malicious identifiers visually match legitimate ones but contain different Unicode code points
- During code review, human reviewers cannot distinguish the malicious code from legitimate code
- When the code is merged and compiled, the malicious definitions execute instead of or alongside legitimate ones
The Trojan Source research demonstrates how these attacks can be weaponized in real-world scenarios. For example, an attacker could define a function named isАdmin using a Cyrillic "А" that shadows the legitimate isAdmin function using a Latin "A", causing security checks to be bypassed.
Detection Methods for CVE-2021-42694
Indicators of Compromise
- Source code files containing mixed-script identifiers (e.g., Latin and Cyrillic characters in the same identifier)
- Presence of zero-width characters or bidirectional text markers in source code
- Multiple function or variable definitions that appear identical but have different byte sequences
- Unusual Unicode normalization behavior in repository diffs
Detection Strategies
- Implement static analysis tools that flag identifiers containing mixed Unicode scripts
- Configure IDEs and code editors to highlight non-ASCII characters in identifiers
- Use Git hooks to scan commits for suspicious Unicode patterns before acceptance
- Deploy code review tools that can display Unicode code points alongside rendered text
Monitoring Recommendations
- Enable logging for code repository changes and review for suspicious Unicode patterns
- Monitor build processes for warnings related to identifier conflicts or shadowing
- Implement automated scanning of third-party dependencies for homoglyph attacks
- Review Unicode Technical Standards TR36 and TR39 for comprehensive security guidance
How to Mitigate CVE-2021-42694
Immediate Actions Required
- Audit existing codebases for identifiers containing non-ASCII characters
- Configure development environments to visually distinguish Unicode scripts
- Implement pre-commit hooks that reject mixed-script identifiers
- Review recent contributions to critical projects for potential homoglyph attacks
Patch Information
The Unicode Consortium has documented this class of security vulnerability in Unicode Technical Report #36, Unicode Security Considerations. Guidance on mitigations is provided in Unicode Technical Standard #39, Unicode Security Mechanisms.
Major compiler and language vendors have released updates to warn about or restrict confusable identifiers:
- Refer to the Gentoo GLSA 202210-09 for distribution-specific guidance
- Consult the CERT Vulnerability Report #999008 for comprehensive mitigation strategies
Workarounds
- Restrict source code to ASCII-only identifiers in security-sensitive projects
- Enable compiler flags that warn about confusable or mixed-script identifiers where available
- Use repository-level policies to block commits containing suspicious Unicode patterns
- Implement mandatory tooling checks in CI/CD pipelines to detect homoglyph attacks before code is merged
# Example Git pre-commit hook to detect non-ASCII in identifiers
#!/bin/bash
# Scan staged files for suspicious Unicode patterns
git diff --cached --name-only | xargs grep -Pn '[^\\x00-\\x7F]' && \
echo "Warning: Non-ASCII characters detected. Review for homoglyphs." && exit 1
exit 0
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.

