CVE-2026-40682: Apache OpenNLP XXE Vulnerability

CVE-2026-40682 Overview

CVE-2026-40682 is an XML External Entity (XXE) vulnerability [CWE-611] in Apache OpenNLP. The flaw resides in the DictionaryEntryPersistor class, which initializes a static SAXParserFactory at class-load time without enabling FEATURE_SECURE_PROCESSING or disabling DTD processing. Attackers who can supply a crafted dictionary file can trigger local file disclosure or server-side request forgery (SSRF) during SAX parsing. The vulnerability affects Apache OpenNLP versions before 2.5.9 and before 3.0.0-M3. The public Dictionary(InputStream) constructor is the documented API for loading user-supplied dictionaries, making untrusted input a realistic exposure path.

Critical Impact
Remote attackers can read arbitrary files from the host or pivot to internal network resources through SSRF by submitting a malicious XML dictionary file, with no authentication or user interaction required.

Affected Products

Apache OpenNLP versions before 2.5.9
Apache OpenNLP 3.0.0-M1
Apache OpenNLP 3.0.0-M2

Discovery Timeline

2026-05-04 - CVE CVE-2026-40682 published to NVD
2026-05-06 - Last updated in NVD database

Technical Details for CVE-2026-40682

Vulnerability Analysis

The DictionaryEntryPersistor class parses XML dictionary files using a SAXParserFactory configured only with namespace support. External entity resolution and DOCTYPE declarations remain enabled. When create(InputStream, EntryInserter) is invoked, the underlying XMLReader resolves any entity references defined in a DOCTYPE block before the application processes a single dictionary entry.

This behavior is inconsistent with the project's own XmlUtil.createSaxParser() helper, which correctly sets FEATURE_SECURE_PROCESSING and disallow-doctype-decl. All other XML parsing paths in the codebase use the secure helper. The persistor path was missed during hardening.

Root Cause

The root cause is missing XML parser hardening. The static SAXParserFactory instance never receives the security features that block external entity expansion. Because the factory is initialized at class load and reused for all parse operations, every call to the persistor inherits the unsafe configuration. This maps to [CWE-611] Improper Restriction of XML External Entity Reference.

Attack Vector

An attacker supplies a dictionary file (for example, a stop-word list or domain dictionary) that contains a malicious DOCTYPE declaration. When the application loads the file via Dictionary(InputStream), the SAX parser resolves entities defined in the DOCTYPE. A file:// reference triggers local file disclosure. An http:// reference triggers SSRF against internal services. Exploitation requires no authentication and no user interaction over the network.

No public proof-of-concept code is available. See the Apache Thread Discussion and the Openwall OSS Security Update for technical details.

Detection Methods for CVE-2026-40682

Indicators of Compromise

Dictionary XML files containing <!DOCTYPE declarations or <!ENTITY definitions referencing external URIs.
Outbound network connections from the OpenNLP host to unexpected internal or external endpoints during dictionary loads.
Java process file reads against sensitive paths such as /etc/passwd, /etc/shadow, or application configuration files immediately after dictionary parsing.
Application logs showing SAX parser activity tied to user-supplied dictionary inputs.

Detection Strategies

Inspect XML dictionary inputs at ingest for DOCTYPE and ENTITY tokens before they reach DictionaryEntryPersistor.
Audit dependency manifests for Apache OpenNLP versions earlier than 2.5.9 or 3.0.0-M3.
Correlate JVM file-read and outbound HTTP events with calls into opennlp-tools classes via runtime tracing.

Monitoring Recommendations

Alert on outbound connections initiated by Java services that host OpenNLP and that do not normally make egress requests.
Monitor read access to credential and configuration files by application service accounts.
Track the volume and origin of uploaded dictionary files in applications that expose this functionality to users.

How to Mitigate CVE-2026-40682

Immediate Actions Required

Upgrade Apache OpenNLP 2.x deployments to 2.5.9 and 3.x deployments to 3.0.0-M3.
Restrict dictionary file sources to trusted, internally controlled origins until the upgrade is complete.
Enumerate all applications and pipelines that call Dictionary(InputStream) with externally sourced data.
Block egress from OpenNLP hosts to internal management interfaces and metadata services.

Patch Information

Apache has released fixed versions 2.5.9 and 3.0.0-M3. Both versions route dictionary parsing through the hardened XmlUtil.createSaxParser() helper, which sets FEATURE_SECURE_PROCESSING and disallow-doctype-decl. Refer to the Apache Thread Discussion for the official advisory.

Workarounds

Wrap calls to Dictionary(InputStream) with input validation that rejects any XML containing a DOCTYPE declaration before it reaches the parser.
Reject dictionary uploads that are not byte-for-byte identical to vetted, signed reference files.
Run OpenNLP-hosting services with network egress filtering that blocks file:// resolution paths and limits HTTP destinations to an allowlist.

bash

# Configuration example: validate dictionary input before parsing
# Reject any XML payload containing a DOCTYPE declaration
if grep -qiE '<!DOCTYPE|<!ENTITY' "$DICT_FILE"; then
  echo "Rejecting dictionary file: DOCTYPE/ENTITY detected"
  exit 1
fi