CVE-2026-44223: Vllm Vllm DOS Vulnerability

CVE-2026-44223 Overview

CVE-2026-44223 is a denial-of-service vulnerability in vLLM, an inference and serving engine for large language models (LLMs). The flaw resides in the extract_hidden_states speculative decoding proposer, which returns a tensor with an incorrect shape after the first decode step. This triggers a RuntimeError that crashes the EngineCore process whenever a request uses sampling penalty parameters. A single request specifying repetition_penalty, frequency_penalty, or presence_penalty is sufficient to take down the server. The issue is fixed in vLLM 0.20.0.

Critical Impact
A single authenticated request containing a sampling penalty parameter crashes the vLLM EngineCore process, disrupting inference service for all concurrent users.

Affected Products

vLLM versions prior to 0.20.0
Deployments using speculative decoding with the extract_hidden_states proposer
LLM inference services exposing sampling penalty parameters to clients

Discovery Timeline

2026-05-12 - CVE-2026-44223 published to NVD
2026-05-15 - Last updated in NVD database

Technical Details for CVE-2026-44223

Vulnerability Analysis

The vulnerability stems from incorrect tensor shape handling in vLLM's speculative decoding pipeline. The extract_hidden_states proposer produces a tensor whose dimensions do not match what downstream sampling logic expects after the first decode step. When sampling penalty parameters are applied, the runtime attempts to broadcast or index into the malformed tensor and raises a RuntimeError. This exception propagates up and terminates the EngineCore process responsible for batched inference. The issue is classified as [CWE-131] Incorrect Calculation of Buffer Size, reflecting the mismatch between expected and produced tensor dimensions.

Root Cause

The root cause is a shape calculation error in the speculative decoding proposer. After the initial decode step, extract_hidden_states returns a tensor sized incorrectly relative to the active batch. The sampling logic that applies repetition_penalty, frequency_penalty, or presence_penalty requires tensor shapes consistent with the batch's token history. The mismatch surfaces only when a penalty path is exercised, which is why deployments without these parameters remain unaffected.

Attack Vector

The attack vector is network-based and requires only low privileges, typically API access to a vLLM endpoint. An attacker submits an inference request containing any non-default penalty value, for example "repetition_penalty": 1.1. The malformed tensor reaches the sampler on the second decode iteration and triggers the runtime exception. The EngineCore worker crashes, dropping all in-flight requests in the same batch and disrupting service until the process is restarted.

Detection Methods for CVE-2026-44223

Indicators of Compromise

Repeated RuntimeError exceptions in vLLM logs referencing tensor shape mismatches in extract_hidden_states or sampling penalty code paths
Unexpected restarts or termination of the EngineCore process during normal inference traffic
API request patterns containing repetition_penalty, frequency_penalty, or presence_penalty immediately preceding server crashes

Detection Strategies

Monitor vLLM stderr and structured logs for RuntimeError tracebacks originating in the speculative decoding proposer
Correlate inbound request payloads against EngineCore process lifecycle events to identify crash-inducing requests
Track inference service availability metrics for short-interval outages following requests with penalty parameters set

Monitoring Recommendations

Instrument the vLLM serving layer with process supervision metrics and alert on abnormal worker restart rates
Log the full JSON body of failed inference requests to support post-incident triage
Set rate-based alerts on HTTP 5xx responses from /v1/completions and /v1/chat/completions endpoints

How to Mitigate CVE-2026-44223

Immediate Actions Required

Upgrade vLLM to version 0.20.0 or later, which contains the fix for the extract_hidden_states shape error
Restrict network access to vLLM inference endpoints to authenticated, trusted clients until patching is complete
Audit existing deployments for use of speculative decoding combined with sampling penalty parameters

Patch Information

The vulnerability is fixed in vLLM 0.20.0 via vLLM Pull Request #38610. Full details are available in the GitHub Security Advisory GHSA-83vm-p52w-f9pw. Operators should update container images, Python dependencies, and any pinned requirements files to reference the patched release.

Workarounds

Disable speculative decoding in the vLLM server configuration if upgrading is not immediately feasible
Reject or strip repetition_penalty, frequency_penalty, and presence_penalty fields at an API gateway in front of vLLM
Run multiple EngineCore workers behind a load balancer with automatic restart to reduce service interruption from crashes