CVE-2025-4287: PyTorch NCCL DoS Vulnerability

CVE-2025-4287 Overview

A denial of service vulnerability has been identified in PyTorch version 2.6.0+cu124. The vulnerability exists in the torch.cuda.nccl.reduce function within the torch/cuda/nccl.py file. An attacker with local access can manipulate inputs to this function to cause improper resource handling, resulting in a denial of service condition. The exploit has been publicly disclosed, and a patch has been made available.

Critical Impact
Local attackers can exploit improper input validation in the CUDA NCCL reduce function to cause denial of service, potentially disrupting GPU-accelerated machine learning workloads.

Affected Products

PyTorch 2.6.0+cu124

Discovery Timeline

2025-05-05 - CVE-2025-4287 published to NVD
2025-05-05 - Last updated in NVD database

Technical Details for CVE-2025-4287

Vulnerability Analysis

This vulnerability is classified under CWE-404 (Improper Resource Shutdown or Release). The torch.cuda.nccl.reduce function in PyTorch's CUDA NCCL implementation lacks proper input validation, allowing malicious inputs to trigger improper resource handling. When exploited, this causes denial of service by disrupting normal GPU operations in distributed computing scenarios.

The vulnerability affects the NVIDIA Collective Communications Library (NCCL) wrapper used for multi-GPU and multi-node deep learning operations. The NCCL reduce operation is commonly used in distributed training to aggregate gradients across multiple GPUs, making this a significant concern for organizations running distributed machine learning workloads.

Root Cause

The root cause is insufficient validation of operation types passed to the NCCL reduce function. The original implementation did not properly verify that the operation parameter corresponds to valid NCCL reduction operations before processing, leading to improper resource handling when invalid operations are supplied.

Attack Vector

The attack requires local access to the system running PyTorch. An attacker can craft malicious inputs to the torch.cuda.nccl.reduce function to trigger the denial of service condition. Since PyTorch is commonly used in shared computing environments and machine learning pipelines, an attacker with access to submit jobs or scripts could potentially exploit this vulnerability to disrupt GPU compute resources.

python

# Security patch in torch/cuda/nccl.py - Add more check for torch.cuda.nccl
# Source: GitHub Commit

 from collections.abc import Sequence
 from typing import Optional, Union
 
-import torch.cuda
+import torch
 
 
 __all__ = ["all_reduce", "reduce", "broadcast", "all_gather", "reduce_scatter"]
 
-SUM = 0  # ncclRedOp_t
+
+# ncclRedOp_t
+SUM = 0  # ncclSum
+
+VALID_OPS = {SUM}
 
 
 def is_available(tensors):

Source: GitHub Commit Notice

Detection Methods for CVE-2025-4287

Indicators of Compromise

Unexpected crashes or hangs in PyTorch CUDA operations during distributed training
Abnormal GPU resource utilization patterns or resource exhaustion
Error logs indicating failures in NCCL reduce operations with invalid operation types
Unusual process behavior in Python scripts utilizing torch.cuda.nccl functions

Detection Strategies

Monitor system logs for repeated failures in PyTorch CUDA NCCL operations
Implement application-level logging around NCCL function calls to detect invalid inputs
Use runtime monitoring tools to track GPU resource utilization anomalies
Audit Python scripts and machine learning pipelines for suspicious calls to NCCL functions

Monitoring Recommendations

Enable verbose logging for PyTorch CUDA operations in production environments
Set up alerts for GPU process crashes or abnormal terminations
Monitor distributed training job health metrics for unexpected failures
Review access logs to identify unauthorized users executing GPU workloads

How to Mitigate CVE-2025-4287

Immediate Actions Required

Apply the security patch identified by commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5
Review and restrict local access to systems running PyTorch with CUDA support
Audit machine learning pipelines for untrusted input handling in NCCL operations
Upgrade to a patched version of PyTorch when available from the official repository

Patch Information

A patch has been developed and is available via GitHub Pull Request. The fix introduces a VALID_OPS set that validates reduction operations before processing, preventing invalid operations from causing denial of service. The patch is identified by commit hash 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5.

For additional technical details, refer to the GitHub Issue Discussion.

Workarounds

Restrict execution of untrusted code in environments with PyTorch CUDA enabled
Implement input validation wrappers around NCCL function calls in custom code
Isolate GPU compute workloads using containerization to limit blast radius
Monitor and limit the ability of non-privileged users to execute GPU-accelerated code

bash

# Configuration example - Verify PyTorch version and check for vulnerability
python -c "import torch; print(f'PyTorch Version: {torch.__version__}')"

# Check if CUDA NCCL is available
python -c "import torch; print(f'NCCL Available: {torch.cuda.nccl.is_available([torch.zeros(1).cuda()])}')"

# Apply patch by upgrading PyTorch (when patched version is released)
pip install --upgrade torch

CVE-2025-4287 Overview

Critical Impact
Local attackers can exploit improper input validation in the CUDA NCCL reduce function to cause denial of service, potentially disrupting GPU-accelerated machine learning workloads.

Affected Products

PyTorch 2.6.0+cu124

Discovery Timeline

2025-05-05 - CVE-2025-4287 published to NVD
2025-05-05 - Last updated in NVD database

Technical Details for CVE-2025-4287

Vulnerability Analysis

Root Cause

Attack Vector

python

# Security patch in torch/cuda/nccl.py - Add more check for torch.cuda.nccl
# Source: GitHub Commit

 from collections.abc import Sequence
 from typing import Optional, Union
 
-import torch.cuda
+import torch
 
 
 __all__ = ["all_reduce", "reduce", "broadcast", "all_gather", "reduce_scatter"]
 
-SUM = 0  # ncclRedOp_t
+
+# ncclRedOp_t
+SUM = 0  # ncclSum
+
+VALID_OPS = {SUM}
 
 
 def is_available(tensors):

Source: GitHub Commit Notice

Detection Methods for CVE-2025-4287

Indicators of Compromise

Unexpected crashes or hangs in PyTorch CUDA operations during distributed training
Abnormal GPU resource utilization patterns or resource exhaustion
Error logs indicating failures in NCCL reduce operations with invalid operation types
Unusual process behavior in Python scripts utilizing torch.cuda.nccl functions

Detection Strategies

Monitor system logs for repeated failures in PyTorch CUDA NCCL operations
Implement application-level logging around NCCL function calls to detect invalid inputs
Use runtime monitoring tools to track GPU resource utilization anomalies
Audit Python scripts and machine learning pipelines for suspicious calls to NCCL functions

Monitoring Recommendations

Enable verbose logging for PyTorch CUDA operations in production environments
Set up alerts for GPU process crashes or abnormal terminations
Monitor distributed training job health metrics for unexpected failures
Review access logs to identify unauthorized users executing GPU workloads

How to Mitigate CVE-2025-4287

Immediate Actions Required

Apply the security patch identified by commit 5827d2061dcb4acd05ac5f8e65d8693a481ba0f5
Review and restrict local access to systems running PyTorch with CUDA support
Audit machine learning pipelines for untrusted input handling in NCCL operations
Upgrade to a patched version of PyTorch when available from the official repository

Patch Information

For additional technical details, refer to the GitHub Issue Discussion.

Workarounds

Restrict execution of untrusted code in environments with PyTorch CUDA enabled
Implement input validation wrappers around NCCL function calls in custom code
Isolate GPU compute workloads using containerization to limit blast radius
Monitor and limit the ability of non-privileged users to execute GPU-accelerated code

bash

# Configuration example - Verify PyTorch version and check for vulnerability
python -c "import torch; print(f'PyTorch Version: {torch.__version__}')"

# Check if CUDA NCCL is available
python -c "import torch; print(f'NCCL Available: {torch.cuda.nccl.is_available([torch.zeros(1).cuda()])}')"

# Apply patch by upgrading PyTorch (when patched version is released)
pip install --upgrade torch

CVE-2025-4287: PyTorch NCCL DoS Vulnerability

CVE-2025-4287 Overview

Critical Impact

Affected Products

Discovery Timeline

Technical Details for CVE-2025-4287

Vulnerability Analysis

Root Cause

Attack Vector

Detection Methods for CVE-2025-4287

Indicators of Compromise

Detection Strategies

Monitoring Recommendations

How to Mitigate CVE-2025-4287

Immediate Actions Required

Patch Information

Workarounds

Experience the World’s Most Advanced Cybersecurity Platform

CVE-2025-4287: PyTorch NCCL DoS Vulnerability

CVE-2025-4287 Overview

Critical Impact

Affected Products

Discovery Timeline

Technical Details for CVE-2025-4287

Vulnerability Analysis

Root Cause

Attack Vector

Detection Methods for CVE-2025-4287

Indicators of Compromise

Detection Strategies

Monitoring Recommendations

How to Mitigate CVE-2025-4287

Immediate Actions Required

Patch Information

Workarounds

Experience the World’s Most Advanced Cybersecurity Platform