CVE-2025-62164 Overview
CVE-2025-62164 is a memory corruption vulnerability in vLLM, a popular inference and serving engine for large language models (LLMs). The vulnerability exists in the Completions API endpoint and affects versions 0.10.2 through 0.11.0. When processing user-supplied prompt embeddings, the endpoint loads serialized tensors using torch.load() without sufficient validation. Due to a change introduced in PyTorch 2.8.0, sparse tensor integrity checks are disabled by default, allowing maliciously crafted tensors to bypass internal bounds checks and trigger an out-of-bounds memory write during the call to to_dense(). This memory corruption can crash vLLM and potentially lead to remote code execution on the server hosting vLLM.
Critical Impact
This vulnerability enables network-based attackers with low privileges to achieve denial-of-service (crash) and potentially execute arbitrary code on vLLM servers through maliciously crafted tensor payloads, compromising the confidentiality, integrity, and availability of AI inference infrastructure.
Affected Products
- vLLM versions 0.10.2 to 0.11.0 (exclusive of 0.11.1)
- vLLM version 0.11.1-rc0
- vLLM version 0.11.1-rc1
Discovery Timeline
- 2025-11-21 - CVE-2025-62164 published to NVD
- 2025-12-04 - Last updated in NVD database
Technical Details for CVE-2025-62164
Vulnerability Analysis
This vulnerability exploits a fundamental weakness in how vLLM handles untrusted serialized data through PyTorch's tensor deserialization mechanism. The Completions API endpoint accepts user-supplied prompt embeddings as serialized tensors, which are then loaded using torch.load(). The critical issue arises from PyTorch 2.8.0's default behavior change where sparse tensor integrity checks are no longer enforced automatically.
When a maliciously crafted sparse tensor is deserialized and subsequently converted to a dense representation via to_dense(), the lack of proper bounds validation allows an attacker to manipulate tensor dimensions and indices. This results in an out-of-bounds memory write that corrupts heap memory, potentially overwriting critical data structures. The memory corruption can be weaponized beyond simple crashes to achieve arbitrary code execution, particularly when combined with heap spraying or other memory manipulation techniques.
The attack requires only authenticated network access, making it particularly dangerous for vLLM deployments exposed to multiple users or services.
Root Cause
The root cause is improper input validation (CWE-20) in the tensor deserialization pathway. Specifically, vLLM trusts user-supplied serialized tensor data without validating the integrity of sparse tensor structures before processing them. The vulnerability is exacerbated by PyTorch 2.8.0's decision to disable sparse tensor bounds checking by default, creating a gap between the expected safety guarantees and actual runtime behavior.
Attack Vector
An attacker with network access to the vLLM Completions API can craft a malicious serialized sparse tensor with deliberately corrupted indices or dimensions. When submitted as prompt embeddings, the tensor passes through torch.load() deserialization. The subsequent call to to_dense() performs memory writes based on the corrupted tensor indices without bounds validation, resulting in out-of-bounds memory access. This can crash the vLLM service or, with precise memory layout knowledge, enable arbitrary code execution on the hosting server.
# Security patch requiring explicit flag for loading embeddings
# Source: https://github.com/vllm-project/vllm/commit/58fab50d82838d5014f4a14d991fdb9352c9c84b
# In examples/offline_inference/prithvi_geospatial_mae.py
dtype="float16",
enforce_eager=True,
model_impl="terratorch",
+ enable_mm_embeds=True,
)
# In examples/offline_inference/prithvi_geospatial_mae_io_processor.py
max_num_seqs=32,
io_processor_plugin="prithvi_to_tiff",
model_impl="terratorch",
+ enable_mm_embeds=True,
)
The patch introduces an explicit enable_mm_embeds=True flag that must be set to allow loading of multimodal embeddings, preventing automatic processing of untrusted tensor data.
Detection Methods for CVE-2025-62164
Indicators of Compromise
- Unexpected vLLM server crashes or service restarts, particularly when processing API requests
- Anomalous serialized tensor payloads in Completions API request logs containing unusual sparse tensor structures
- Memory corruption signatures or segmentation faults in vLLM process logs
- Unusual network traffic patterns to the Completions API endpoint with large or malformed payloads
Detection Strategies
- Monitor vLLM service logs for segmentation faults, memory access violations, or unexpected process terminations
- Implement API request payload inspection to detect abnormally structured tensor data in prompt embeddings
- Deploy endpoint detection to identify memory corruption exploitation attempts on vLLM host systems
- Set up process monitoring for vLLM to track crash frequency and restart patterns
Monitoring Recommendations
- Configure alerting on vLLM process crashes or abnormal memory consumption patterns
- Implement request logging with payload size and structure analysis for the Completions API
- Deploy runtime application self-protection (RASP) solutions to detect memory corruption attempts
- Monitor for unusual CPU or memory spikes during tensor deserialization operations
How to Mitigate CVE-2025-62164
Immediate Actions Required
- Upgrade vLLM to version 0.11.1 or later immediately to apply the security patch
- Audit access controls on vLLM Completions API endpoints to limit exposure to authenticated and trusted users only
- Review and restrict which clients can submit custom prompt embeddings to the API
- Implement network segmentation to limit access to vLLM servers from untrusted networks
Patch Information
The vulnerability has been patched in vLLM version 0.11.1. The fix introduces an explicit enable_mm_embeds flag that must be enabled to allow loading of text and image embeddings, preventing automatic processing of potentially malicious serialized tensor data. For detailed patch information, see the GitHub Security Advisory and the Pull Request #27204.
Workarounds
- If upgrading is not immediately possible, restrict access to the Completions API endpoint to trusted internal clients only
- Implement an API gateway or web application firewall to filter and validate incoming tensor payloads before they reach vLLM
- Disable custom prompt embedding functionality if not required for your use case
- Consider running vLLM in isolated containers with limited privileges and resource constraints to minimize RCE impact
# Configuration example: Restrict vLLM API access via network firewall
# Allow only trusted internal network access to vLLM port 8000
iptables -A INPUT -p tcp --dport 8000 -s 10.0.0.0/8 -j ACCEPT
iptables -A INPUT -p tcp --dport 8000 -j DROP
# Run vLLM with resource limits in container
docker run --memory="16g" --cpus="4" \
--security-opt=no-new-privileges \
--read-only \
vllm/vllm-openai:v0.11.1
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.


