CVE-2025-68756 Overview
A deadlock vulnerability has been identified in the Linux kernel's block multi-queue (blk-mq) subsystem. The flaw exists in the blk_mq_[un]quiesce_tagset() functions which improperly use the set->tag_list_lock mutex during queue quiesce operations. This creates a circular lock dependency that can result in system hangs when NVMe devices experience command timeouts while queue operations are being modified.
The vulnerability was introduced when commit 98d81f0df70c modified the NVMe driver to quiesce the entire tagset instead of individual queues. This change exposed a lock ordering issue between the timeout handling path and the queue removal path, causing two threads to deadlock while waiting for each other to release resources.
Critical Impact
Systems using NVMe storage devices may experience complete system hangs when device timeouts occur during queue management operations, requiring a hard reboot to recover.
Affected Products
- Linux Kernel (versions with blk-mq tagset sharing)
- NVMe storage subsystem components
- Systems using block multi-queue I/O scheduling
Discovery Timeline
- 2026-01-05 - CVE CVE-2025-68756 published to NVD
- 2026-01-08 - Last updated in NVD database
Technical Details for CVE-2025-68756
Vulnerability Analysis
The deadlock condition arises from the interaction between two kernel code paths that both require access to shared resources but acquire locks in conflicting orders. The blk_mq_{add,del}_queue_tag_set() functions manage queue attachments to tagsets and must freeze queues before modifying the BLK_MQ_F_TAG_QUEUE_SHARED flag. These functions hold the set->tag_list_lock mutex during this operation.
Simultaneously, blk_mq_quiesce_tagset() attempts to walk the queues in set->tag_list while also holding this same lock. When an NVMe command times out, the timeout handler calls nvme_dev_disable(), which invokes blk_mq_quiesce_tagset(). If another thread is in the process of removing a queue and waiting for it to freeze (which requires the timeout handler to complete), a classic deadlock occurs.
The two conflicting stack traces demonstrate this circular dependency:
- Thread A (timeout handler): nvme_timeout() → nvme_dev_disable() → blk_mq_quiesce_tagset() - waiting for set->tag_list_lock
- Thread B (queue removal): nvme_ns_remove() → del_gendisk() → blk_mq_exit_queue() → blk_mq_update_tag_set_shared() → blk_mq_freeze_queue_wait() - holding set->tag_list_lock, waiting for queue freeze
Root Cause
The root cause is the improper synchronization mechanism used in blk_mq_[un]quiesce_tagset(). The functions use mutex-based locking (set->tag_list_lock) to protect the tag list traversal, but this creates a lock dependency that conflicts with the queue freeze operation. Since quiescing a queue does not require sleeping, the use of a mutex is unnecessarily restrictive and creates the conditions for deadlock.
The fix replaces the mutex-based synchronization with RCU (Read-Copy-Update), which is a lockless synchronization mechanism well-suited for read-mostly data structures. This change eliminates the lock ordering conflict by allowing the quiesce operation to traverse the queue list without holding any mutex.
Attack Vector
This is a denial of service vulnerability that manifests under specific operational conditions rather than through external exploitation. The deadlock can be triggered when:
- An NVMe device experiences a command timeout while queue management operations are in progress
- A system administrator unbinds an NVMe device via sysfs while I/O operations are pending
- Hot-removal of NVMe storage occurs during high I/O load with timeouts
While not remotely exploitable in the traditional sense, the vulnerability can cause complete system unavailability requiring manual intervention. In virtualized or containerized environments where NVMe devices are shared or passthrough is used, the impact could affect multiple workloads.
Detection Methods for CVE-2025-68756
Indicators of Compromise
- System hangs with no kernel panic or crash dump generated
- Processes stuck in uninterruptible sleep state (D state) related to NVMe or block I/O operations
- Kernel stack traces in logs showing blk_mq_quiesce_tagset and blk_mq_freeze_queue_wait on different threads
- NVMe device timeout messages followed by system unresponsiveness
Detection Strategies
- Monitor for hung task warnings in kernel logs referencing blk_mq_* or nvme_* functions
- Use echo t > /proc/sysrq-trigger on unresponsive systems to capture stack traces showing the deadlock pattern
- Deploy kernel watchdog monitoring to detect and alert on system hangs
- Review dmesg output for NVMe timeout messages correlating with system stability issues
Monitoring Recommendations
- Implement automated kernel log analysis for deadlock signatures involving block layer functions
- Configure hardware watchdog timers to automatically recover from system hangs
- Monitor NVMe device health metrics and timeout rates through smartctl or nvme-cli tools
- Set up alerting for unusual patterns of NVMe command timeouts in storage-intensive workloads
How to Mitigate CVE-2025-68756
Immediate Actions Required
- Update the Linux kernel to a patched version containing the RCU-based fix
- Avoid hot-removal or unbinding of NVMe devices during periods of high I/O activity until patched
- Consider disabling automatic NVMe device unbinding in production environments
- Implement hardware watchdog timers to enable automatic recovery if deadlock occurs
Patch Information
The vulnerability has been addressed through multiple kernel commits that replace the mutex-based synchronization with RCU in the affected functions. The fix updates blk_mq_[un]quiesce_tagset() to use RCU traversal and modifies blk_mq_{add,del}_queue_tag_set() to use RCU-safe list operations.
Relevant patch commits include:
- Linux Kernel Commit 3baeec23
- Linux Kernel Commit 59e25ef2
- Linux Kernel Commit 6e8d3637
- Linux Kernel Commit ef0cd7b6
Workarounds
- Avoid triggering NVMe device unbind operations during periods of active I/O or when timeouts may occur
- Configure longer NVMe command timeout values to reduce the likelihood of timeout-triggered deadlocks
- Use SCSI-based storage instead of NVMe where the risk is unacceptable and patching is not immediately possible
- Implement monitoring to detect early signs of deadlock and trigger graceful failover before complete system hang
# Check current kernel version for vulnerability status
uname -r
# Monitor for hung tasks in kernel logs
dmesg | grep -i "hung_task\|deadlock\|blk_mq"
# Configure hardware watchdog for automatic recovery
echo 60 > /proc/sys/kernel/hung_task_timeout_secs
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.


