CVE-2026-0847 Overview
A path traversal vulnerability exists in NLTK (Natural Language Toolkit) versions up to and including 3.9.2 that allows arbitrary file read through multiple CorpusReader classes. The affected classes include WordListCorpusReader, TaggedCorpusReader, and BracketParseCorpusReader, which fail to properly sanitize or validate file paths. This enables attackers to traverse directories and access sensitive files on the server.
This vulnerability is particularly critical in scenarios where user-controlled file inputs are processed, such as in machine learning APIs, chatbots, or NLP pipelines. Successful exploitation can lead to unauthorized access to sensitive files including system files, SSH private keys, and API tokens, and may potentially escalate to remote code execution when combined with other vulnerabilities.
Critical Impact
Attackers can exploit path traversal sequences to read arbitrary files from the server, potentially exposing sensitive system files, credentials, and API tokens. This vulnerability is especially dangerous in production NLP services that accept user-controlled inputs.
Affected Products
- NLTK versions up to and including 3.9.2
- WordListCorpusReader class
- TaggedCorpusReader class
- BracketParseCorpusReader class
Discovery Timeline
- 2026-03-04 - CVE CVE-2026-0847 published to NVD
- 2026-03-05 - Last updated in NVD database
Technical Details for CVE-2026-0847
Vulnerability Analysis
This vulnerability is classified as CWE-22 (Improper Limitation of a Pathname to a Restricted Directory, commonly known as Path Traversal). The issue stems from insufficient input validation in multiple NLTK CorpusReader classes when processing file path parameters.
When applications built on NLTK accept user-provided file paths and pass them to the vulnerable CorpusReader classes, the library fails to sanitize directory traversal sequences such as ../ or ..\\. This allows an attacker to escape the intended directory boundary and read arbitrary files from the file system, limited only by the permissions of the application process.
The network-based attack vector with low complexity and no authentication requirements makes this vulnerability particularly dangerous in exposed NLP services. The potential for reading sensitive configuration files, credentials, or private keys significantly elevates the risk profile of affected deployments.
Root Cause
The root cause lies in the failure of WordListCorpusReader, TaggedCorpusReader, and BracketParseCorpusReader to properly validate and sanitize user-supplied file path inputs. These classes do not implement adequate checks to prevent relative path traversal sequences from being processed, allowing attackers to specify paths that traverse outside the intended corpus directory structure.
The vulnerability manifests when these CorpusReader classes directly concatenate or process user-controlled path strings without first normalizing the path and verifying it remains within the expected directory boundaries. This architectural oversight enables classic directory traversal attacks.
Attack Vector
The vulnerability is exploitable via network-accessible applications that leverage NLTK's CorpusReader functionality with user-controlled input. Typical attack scenarios include:
- Machine Learning APIs: NLP services that allow users to specify corpus files or training data sources
- Chatbots: Conversational AI systems that dynamically load language models or training corpora
- NLP Pipelines: Data processing systems that accept file path parameters from external sources
An attacker can craft malicious file path inputs containing directory traversal sequences (e.g., ../../../../etc/passwd) to read sensitive system files. The impact includes unauthorized access to configuration files, credentials, SSH keys, and other sensitive data on the host system.
For technical details on exploitation, see the Huntr Vulnerability Bounty report.
Detection Methods for CVE-2026-0847
Indicators of Compromise
- File access attempts to sensitive system files (e.g., /etc/passwd, /etc/shadow, .ssh/id_rsa) from NLTK-based application processes
- Log entries showing path traversal sequences (../, ..\\) in corpus file path parameters
- Unexpected file read operations outside designated corpus directories
- Application error messages revealing file system paths or file contents
Detection Strategies
- Monitor application logs for path traversal sequences in file path parameters passed to NLTK functions
- Implement file integrity monitoring on sensitive directories to detect unauthorized access attempts
- Deploy web application firewalls (WAF) with rules to block common path traversal patterns in request parameters
- Use runtime application self-protection (RASP) to detect and block directory traversal attempts
Monitoring Recommendations
- Enable verbose logging for NLTK corpus loading operations in production environments
- Set up alerts for file access attempts to sensitive system directories from application processes
- Monitor for anomalous file read patterns that deviate from normal corpus loading behavior
- Implement audit logging for all file path inputs processed by NLP pipeline components
How to Mitigate CVE-2026-0847
Immediate Actions Required
- Audit all applications using NLTK CorpusReader classes to identify those accepting user-controlled file path inputs
- Implement strict input validation to reject file paths containing traversal sequences (../, ..\\, encoded variants)
- Apply principle of least privilege to application processes to limit potential impact of exploitation
- Isolate NLP processing components in sandboxed environments with restricted file system access
Patch Information
Monitor the official NLTK repository and security advisories for patches addressing this vulnerability. Until an official fix is released, implement the workarounds described below. The vulnerability was reported through the Huntr vulnerability bounty platform.
Workarounds
- Implement application-level path validation using os.path.realpath() or pathlib.Path.resolve() to normalize paths and verify they remain within allowed directories
- Use allowlists for permitted corpus file paths rather than accepting arbitrary user input
- Deploy the application in a containerized environment with a read-only file system and minimal file access
- Restrict the application process to a chroot jail or similar sandboxed environment to limit file system exposure
# Configuration example
# Example: Python path validation before passing to CorpusReader
# Normalize the path and verify it stays within the allowed corpus directory
#
# import os
# CORPUS_ROOT = "/var/app/corpora"
# user_path = user_input.strip()
# full_path = os.path.realpath(os.path.join(CORPUS_ROOT, user_path))
# if not full_path.startswith(CORPUS_ROOT):
# raise ValueError("Invalid corpus path")
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.


