CVE-2025-4287 Overview
A denial of service vulnerability has been identified in PyTorch version 2.6.0+cu124. The vulnerability exists in the torch.cuda.nccl.reduce function within the torch/cuda/nccl.py file. An attacker with local access can manipulate inputs to this function to cause improper resource handling, resulting in a denial of service condition. The exploit has been publicly disclosed, and a patch has been made available.
Critical Impact
Local attackers can exploit improper input validation in the CUDA NCCL reduce function to cause denial of service, potentially disrupting GPU-accelerated machine learning workloads.
Affected Products
- PyTorch 2.6.0+cu124
Discovery Timeline
- 2025-05-05 - CVE-2025-4287 published to NVD
- 2025-05-05 - Last updated in NVD database
Technical Details for CVE-2025-4287
Vulnerability Analysis
This vulnerability is classified under CWE-404 (Improper Resource Shutdown or Release). The torch.cuda.nccl.reduce function in PyTorch's CUDA NCCL implementation lacks proper input validation, allowing malicious inputs to trigger improper resource handling. When exploited, this causes denial of service by disrupting normal GPU operations in distributed computing scenarios.
The vulnerability affects the NVIDIA Collective Communications Library (NCCL) wrapper used for multi-GPU and multi-node deep learning operations. The NCCL reduce operation is commonly used in distributed training to aggregate gradients across multiple GPUs, making this a significant concern for organizations running distributed machine learning workloads.
Root Cause
The root cause is insufficient validation of operation types passed to the NCCL reduce function. The original implementation did not properly verify that the operation parameter corresponds to valid NCCL reduction operations before processing, leading to improper resource handling when invalid operations are supplied.
Attack Vector
The attack requires local access to the system running PyTorch. An attacker can craft malicious inputs to the torch.cuda.nccl.reduce function to trigger the denial of service condition. Since PyTorch is commonly used in shared computing environments and machine learning pipelines, an attacker with access to submit jobs or scripts could potentially exploit this vulnerability to disrupt GPU compute resources.
# Security patch in torch/cuda/nccl.py - Add more check for torch.cuda.nccl
# Source: GitHub Commit
from collections.abc import Sequence
from typing import Optional, Union
-import torch.cuda
+import torch
__all__ = ["all_reduce", "reduce", "broadcast", "all_gather", "reduce_scatter"]
-SUM = 0 # ncclRedOp_t
+
+# ncclRedOp_t
+SUM = 0 # ncclSum
+
+VALID_OPS = {SUM}
def is_available(tensors):
Source: GitHub Commit Notice
Detection Methods for CVE-2025-4287
Indicators of Compromise
- Unexpected crashes or hangs in PyTorch CUDA operations during distributed training
- Abnormal GPU resource utilization patterns or resource exhaustion
- Error logs indicating failures in NCCL reduce operations with invalid operation types
- Unusual process behavior in Python scripts utilizing torch.cuda.nccl functions
Detection Strategies
- Monitor system logs for repeated failures in PyTorch CUDA NCCL operations
- Implement application-level logging around NCCL function calls to detect invalid inputs
- Use runtime monitoring tools to track GPU resource utilization anomalies
- Audit Python scripts and machine learning pipelines for suspicious calls to NCCL functions
Monitoring Recommendations
- Enable verbose logging for PyTorch CUDA operations in production environments
- Set up alerts for GPU process crashes or abnormal terminations
- Monitor distributed training job health metrics for unexpected failures
- Review access logs to identify unauthorized users executing GPU workloads
How to Mitigate CVE-2025-4287
Immediate Actions Required
- Apply the security patch identified by commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5
- Review and restrict local access to systems running PyTorch with CUDA support
- Audit machine learning pipelines for untrusted input handling in NCCL operations
- Upgrade to a patched version of PyTorch when available from the official repository
Patch Information
A patch has been developed and is available via GitHub Pull Request. The fix introduces a VALID_OPS set that validates reduction operations before processing, preventing invalid operations from causing denial of service. The patch is identified by commit hash 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5.
For additional technical details, refer to the GitHub Issue Discussion.
Workarounds
- Restrict execution of untrusted code in environments with PyTorch CUDA enabled
- Implement input validation wrappers around NCCL function calls in custom code
- Isolate GPU compute workloads using containerization to limit blast radius
- Monitor and limit the ability of non-privileged users to execute GPU-accelerated code
# Configuration example - Verify PyTorch version and check for vulnerability
python -c "import torch; print(f'PyTorch Version: {torch.__version__}')"
# Check if CUDA NCCL is available
python -c "import torch; print(f'NCCL Available: {torch.cuda.nccl.is_available([torch.zeros(1).cuda()])}')"
# Apply patch by upgrading PyTorch (when patched version is released)
pip install --upgrade torch
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.

