CVE-2023-36464: Pypdf_project Pypdf DOS Vulnerability

CVE-2023-36464 Overview

CVE-2023-36464 is an Infinite Loop vulnerability affecting pypdf, an open source, pure-python PDF library. In affected versions, an attacker may craft a malicious PDF that triggers an infinite loop when the __parse_content_stream function is executed. This occurs, for example, when a user attempts to extract text from such a crafted PDF file. The vulnerability can lead to denial of service conditions by consuming system resources indefinitely.

Critical Impact
An attacker can craft a malicious PDF file that causes applications using pypdf to hang indefinitely when processing text extraction, resulting in denial of service conditions.

Affected Products

pypdf_project pypdf
pypdf2_project pypdf2

Discovery Timeline

2023-06-27 - CVE CVE-2023-36464 published to NVD
2024-11-21 - Last updated in NVD database

Technical Details for CVE-2023-36464

Vulnerability Analysis

This vulnerability is classified under CWE-835 (Loop with Unreachable Exit Condition), commonly referred to as an infinite loop. The flaw exists in the PDF content stream parsing functionality within pypdf. When parsing content streams, the library reads bytes looking for carriage return (\r) or newline (\n) characters. However, the parsing logic fails to properly handle the case where the stream reaches end-of-file (EOF) before encountering these expected characters.

The attack requires local access with user interaction—a victim must open or process a maliciously crafted PDF file. While the vulnerability does not compromise confidentiality or integrity, it can completely halt application availability, causing the affected process to consume CPU resources indefinitely until manually terminated.

Root Cause

The root cause lies in the parsing loop within pypdf/generic/_data_structures.py. The problematic code iterates with the condition while peek not in (b"\r", b"\n"), which continues reading until a newline or carriage return is found. The issue is that this condition never accounts for an empty byte string (b""), which is returned when the end of the file is reached. Without proper EOF handling, a malformed PDF that lacks the expected line terminators causes the parser to loop indefinitely.

Attack Vector

The attack vector is local and requires user interaction. An attacker must distribute a specially crafted PDF file and convince a user to process it with an application that uses the vulnerable pypdf library. Common scenarios include:

Email attachments containing malicious PDFs
PDF files uploaded to web applications that perform server-side text extraction
Document processing pipelines that automatically parse PDF content
Any application using pypdf's text extraction functionality on untrusted PDF files

The vulnerability is triggered when the __parse_content_stream function processes the malicious content, such as during text extraction operations.

Detection Methods for CVE-2023-36464

Indicators of Compromise

Processes using pypdf or pypdf2 libraries consuming excessive CPU resources for extended periods
Application hangs or freezes when opening or processing specific PDF files
Abnormally long processing times for PDF text extraction operations
System resource exhaustion symptoms coinciding with PDF processing activities

Detection Strategies

Monitor application processes for unusually long-running PDF parsing operations
Implement timeouts for PDF processing operations to detect potential infinite loop conditions
Audit Python dependencies to identify vulnerable pypdf versions in your environment
Use software composition analysis (SCA) tools to scan for affected library versions

Monitoring Recommendations

Configure application-level timeouts for all PDF processing operations
Monitor CPU utilization patterns for processes that handle PDF files
Set up alerts for processes that exceed expected execution times during document processing
Implement health checks for services that depend on pypdf for PDF parsing

How to Mitigate CVE-2023-36464

Immediate Actions Required

Upgrade pypdf to a patched version that includes the fix from pull request #1828
Audit all applications and services using pypdf or pypdf2 libraries
Implement input validation and processing timeouts for PDF file handling
Restrict PDF processing to trusted sources where possible

Patch Information

The vulnerability was introduced in pull request #969 and has been resolved in pull request #1828. Users should upgrade to the latest version of pypdf that includes this fix. The GitHub Security Advisory GHSA-4vvm-4w3v-6mr8 provides additional details about the affected versions and remediation.

Workarounds

For users unable to immediately upgrade, apply a manual patch by modifying the line while peek not in (b"\r", b"\n") in pypdf/generic/_data_structures.py to while peek not in (b"\r", b"\n", b"")
Implement processing timeouts around pypdf text extraction calls to prevent indefinite hangs
Validate PDF files using alternative tools before processing with vulnerable pypdf versions
Consider using sandboxed environments for processing untrusted PDF files

python

# Manual workaround: Modify pypdf/generic/_data_structures.py
# Change the line:
# while peek not in (b"\r", b"\n")
# To include EOF handling:
# while peek not in (b"\r", b"\n", b"")