CVE-2025-6985: LangChain Text Splitters XXE Vulnerability

CVE-2025-6985 Overview

The HTMLSectionSplitter class in langchain-text-splitters version 0.3.8 contains a critical XML External Entity (XXE) vulnerability resulting from unsafe XSLT parsing. This security flaw allows remote attackers to read arbitrary local files or perform outbound HTTP(S) requests without requiring authentication, special privileges, or user interaction.

The vulnerability arises because the HTMLSectionSplitter class permits the use of arbitrary XSLT stylesheets, which are parsed using lxml.etree.parse() and lxml.etree.XSLT() without any security hardening measures. In lxml versions up to 4.9.x, external entities are resolved by default, enabling attackers to exfiltrate sensitive data. Even in lxml versions 5.0 and above where entity expansion is disabled, the XSLT document() function can still read any URI unless XSLTAccessControl restrictions are applied.

Critical Impact
Remote attackers can gain read-only access to any file the LangChain process can reach, including SSH keys, environment files, source code, and cloud metadata endpoints, with no authentication required.

Affected Products

langchain-text-splitters version 0.3.8
Applications using HTMLSectionSplitter with custom XSLT enabled
Deployments using lxml versions up to 4.9.x (full XXE) or 5.0+ (document() function abuse)

Discovery Timeline

2025-10-06 - CVE-2025-6985 published to NVD
2025-10-08 - Last updated in NVD database

Technical Details for CVE-2025-6985

Vulnerability Analysis

This XXE vulnerability in the HTMLSectionSplitter class stems from the unsafe handling of XSLT stylesheets during HTML document processing. The class is designed to split HTML documents into sections, but its implementation accepts arbitrary XSLT input without proper validation or security controls.

When processing XSLT stylesheets, the code utilizes lxml.etree.parse() for document parsing and lxml.etree.XSLT() for stylesheet transformation. Neither function is configured with security hardening, leaving the parser vulnerable to external entity injection attacks. This allows an attacker to craft malicious XSLT content that references external resources, leading to information disclosure.

The attack is exploitable in default deployments where custom XSLT functionality is enabled, representing a significant risk for organizations using LangChain for document processing workflows.

Root Cause

The root cause is the absence of security hardening when parsing XSLT stylesheets in the HTMLSectionSplitter class. The lxml library's default configuration resolves external entities in versions up to 4.9.x, and the document() XSLT function remains unrestricted even in newer versions unless XSLTAccessControl is explicitly configured.

The vulnerable code path processes user-supplied or externally-sourced XSLT without:

Disabling external entity resolution
Applying XSLTAccessControl restrictions
Validating or sanitizing XSLT input
Limiting accessible URIs or file paths

Attack Vector

The attack vector is network-based with low complexity. An attacker can exploit this vulnerability by supplying a malicious XSLT stylesheet to the HTMLSectionSplitter class. The stylesheet can contain external entity declarations or document() function calls that reference:

Local file paths (e.g., /etc/passwd, ~/.ssh/id_rsa, .env files)
Cloud metadata endpoints (e.g., http://169.254.169.254/latest/meta-data/)
Internal network resources via HTTP(S) requests

The vulnerability mechanism involves crafting an XSLT stylesheet that either defines external entities pointing to sensitive files or uses the document() function to fetch arbitrary URIs. When the HTMLSectionSplitter processes this stylesheet, the lxml parser resolves these references and includes the content in the output, effectively exfiltrating the data to the attacker.

For technical details on the exploitation mechanism, see the Huntr Bounty Report.

Detection Methods for CVE-2025-6985

Indicators of Compromise

Unexpected file access attempts by the LangChain process to sensitive files such as /etc/passwd, SSH keys, or environment files
Outbound HTTP(S) connections to cloud metadata endpoints (e.g., 169.254.169.254) from the application server
XSLT processing logs showing external entity declarations or document() function calls with unusual URIs
Application logs indicating access to files outside normal operational scope

Detection Strategies

Monitor file access patterns for the LangChain process and alert on access to sensitive system files
Implement network monitoring to detect connections to cloud metadata services from application containers
Deploy application-level logging to capture XSLT content being processed by HTMLSectionSplitter
Use SIEM rules to correlate file access anomalies with XSLT processing events

Monitoring Recommendations

Enable detailed audit logging for file system access on servers running LangChain applications
Configure network egress monitoring to identify and alert on suspicious outbound requests
Implement runtime application security monitoring to detect XXE attack patterns
Review application logs for error messages related to external entity resolution or document loading failures

How to Mitigate CVE-2025-6985

Immediate Actions Required

Disable custom XSLT functionality in HTMLSectionSplitter if not required for business operations
Upgrade lxml to version 5.0 or higher to disable default external entity expansion
Implement input validation to reject XSLT stylesheets containing external entity declarations or document() function calls
Apply network segmentation to restrict outbound connections from LangChain application servers

Patch Information

Monitor the langchain-text-splitters project for security updates addressing this vulnerability. The Huntr Bounty Report provides additional details on the vulnerability disclosure and remediation status.

When a patch becomes available, update langchain-text-splitters to the patched version immediately. Review your deployment configuration to ensure custom XSLT processing is only enabled where necessary.

Workarounds

Configure XSLTAccessControl in lxml to restrict access to local files and network resources when XSLT processing is required
Implement a wrapper around HTMLSectionSplitter that sanitizes XSLT input before processing
Use application-level firewalls to block outbound requests to cloud metadata endpoints
Run the LangChain application with minimal file system permissions using a dedicated service account

bash

# Configuration example: Restrict file access for LangChain service
# Create dedicated user with minimal permissions
useradd -r -s /bin/false langchain-service

# Set restrictive file permissions
chmod 700 /opt/langchain/app
chown -R langchain-service:langchain-service /opt/langchain/app

# Run application with restricted user
sudo -u langchain-service python app.py