CVE-2026-44223 Overview
CVE-2026-44223 is a denial-of-service vulnerability in vLLM, an inference and serving engine for large language models (LLMs). The flaw resides in the extract_hidden_states speculative decoding proposer, which returns a tensor with an incorrect shape after the first decode step. This triggers a RuntimeError that crashes the EngineCore process whenever a request uses sampling penalty parameters. A single request specifying repetition_penalty, frequency_penalty, or presence_penalty is sufficient to take down the server. The issue is fixed in vLLM 0.20.0.
Critical Impact
A single authenticated request containing a sampling penalty parameter crashes the vLLM EngineCore process, disrupting inference service for all concurrent users.
Affected Products
- vLLM versions prior to 0.20.0
- Deployments using speculative decoding with the extract_hidden_states proposer
- LLM inference services exposing sampling penalty parameters to clients
Discovery Timeline
- 2026-05-12 - CVE-2026-44223 published to NVD
- 2026-05-15 - Last updated in NVD database
Technical Details for CVE-2026-44223
Vulnerability Analysis
The vulnerability stems from incorrect tensor shape handling in vLLM's speculative decoding pipeline. The extract_hidden_states proposer produces a tensor whose dimensions do not match what downstream sampling logic expects after the first decode step. When sampling penalty parameters are applied, the runtime attempts to broadcast or index into the malformed tensor and raises a RuntimeError. This exception propagates up and terminates the EngineCore process responsible for batched inference. The issue is classified as [CWE-131] Incorrect Calculation of Buffer Size, reflecting the mismatch between expected and produced tensor dimensions.
Root Cause
The root cause is a shape calculation error in the speculative decoding proposer. After the initial decode step, extract_hidden_states returns a tensor sized incorrectly relative to the active batch. The sampling logic that applies repetition_penalty, frequency_penalty, or presence_penalty requires tensor shapes consistent with the batch's token history. The mismatch surfaces only when a penalty path is exercised, which is why deployments without these parameters remain unaffected.
Attack Vector
The attack vector is network-based and requires only low privileges, typically API access to a vLLM endpoint. An attacker submits an inference request containing any non-default penalty value, for example "repetition_penalty": 1.1. The malformed tensor reaches the sampler on the second decode iteration and triggers the runtime exception. The EngineCore worker crashes, dropping all in-flight requests in the same batch and disrupting service until the process is restarted.
Detection Methods for CVE-2026-44223
Indicators of Compromise
- Repeated RuntimeError exceptions in vLLM logs referencing tensor shape mismatches in extract_hidden_states or sampling penalty code paths
- Unexpected restarts or termination of the EngineCore process during normal inference traffic
- API request patterns containing repetition_penalty, frequency_penalty, or presence_penalty immediately preceding server crashes
Detection Strategies
- Monitor vLLM stderr and structured logs for RuntimeError tracebacks originating in the speculative decoding proposer
- Correlate inbound request payloads against EngineCore process lifecycle events to identify crash-inducing requests
- Track inference service availability metrics for short-interval outages following requests with penalty parameters set
Monitoring Recommendations
- Instrument the vLLM serving layer with process supervision metrics and alert on abnormal worker restart rates
- Log the full JSON body of failed inference requests to support post-incident triage
- Set rate-based alerts on HTTP 5xx responses from /v1/completions and /v1/chat/completions endpoints
How to Mitigate CVE-2026-44223
Immediate Actions Required
- Upgrade vLLM to version 0.20.0 or later, which contains the fix for the extract_hidden_states shape error
- Restrict network access to vLLM inference endpoints to authenticated, trusted clients until patching is complete
- Audit existing deployments for use of speculative decoding combined with sampling penalty parameters
Patch Information
The vulnerability is fixed in vLLM 0.20.0 via vLLM Pull Request #38610. Full details are available in the GitHub Security Advisory GHSA-83vm-p52w-f9pw. Operators should update container images, Python dependencies, and any pinned requirements files to reference the patched release.
Workarounds
- Disable speculative decoding in the vLLM server configuration if upgrading is not immediately feasible
- Reject or strip repetition_penalty, frequency_penalty, and presence_penalty fields at an API gateway in front of vLLM
- Run multiple EngineCore workers behind a load balancer with automatic restart to reduce service interruption from crashes
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.


