CVE-2025-4022: Webarena Code Injection RCE Vulnerability

CVE-2025-4022 Overview

CVE-2025-4022 is a code injection vulnerability [CWE-94] affecting web-arena-x WebArena versions up to 0.2.0. The flaw resides in the HTMLContentEvaluator function within webarena/evaluation_harness/evaluators.py. Attackers can manipulate the target["url"] argument to inject and execute arbitrary code remotely. The exploit details have been publicly disclosed, increasing exposure for unpatched deployments.

WebArena is a benchmark environment used for evaluating autonomous web agents. Because evaluation harnesses commonly run in trusted research environments, code injection in this component can compromise host systems and downstream pipelines.

Critical Impact
Remote attackers with low privileges can inject code through the target["url"] parameter processed by HTMLContentEvaluator, enabling unauthorized code execution within the evaluation harness.

Affected Products

web-arena-x WebArena versions up to and including 0.2.0
Deployments using webarena/evaluation_harness/evaluators.py
Research and benchmarking environments running the WebArena evaluation harness

Discovery Timeline

2025-04-28 - CVE-2025-4022 published to NVD
2025-05-14 - Last updated in NVD database

Technical Details for CVE-2025-4022

Vulnerability Analysis

The vulnerability resides in the HTMLContentEvaluator function inside webarena/evaluation_harness/evaluators.py. This evaluator processes the target["url"] argument as part of grading agent task outcomes. Improper handling of this input allows attacker-controlled content to flow into a code execution context, classified as code injection [CWE-94] and improper neutralization of special elements in output [CWE-74].

WebArena is widely used to benchmark autonomous web agents, including LLM-driven agents. Test harnesses frequently execute with broad access to the host environment, including network egress, credentials for test sites, and orchestration tooling. Code injection inside the evaluator can therefore pivot from benchmark results into the underlying machine.

Root Cause

The root cause is insufficient validation and neutralization of the target["url"] argument before it is consumed by HTMLContentEvaluator. When the evaluator processes attacker-influenced URL data, the input is interpreted as code rather than data. This pattern aligns with CWE-94, where externally controlled input is incorporated into a code path without proper sanitization.

Attack Vector

The attack is initiated over the network and requires low privileges with no user interaction. An attacker supplies a crafted target["url"] value that reaches the HTMLContentEvaluator function. Once parsed, the malicious payload executes in the context of the evaluation harness process. The exploit has been disclosed publicly through the project's GitHub Issue #194 and VulDB entry #306376.

No verified proof-of-concept code is published in this advisory. Technical discussion is available in the GitHub Comment on Issue #194.

Detection Methods for CVE-2025-4022

Indicators of Compromise

Unexpected child processes spawned by the Python interpreter running the WebArena evaluation harness
Anomalous outbound network connections originating from evaluation hosts during benchmark runs
Modified or newly created files in the webarena/evaluation_harness/ directory tree
Evaluator log entries containing unusual characters, shell metacharacters, or encoded payloads in the target["url"] field

Detection Strategies

Inspect WebArena task definitions and benchmark configurations for target["url"] values containing executable syntax, template expressions, or special characters
Audit access logs of systems hosting the evaluation harness for unauthorized configuration changes or remote submissions
Monitor process trees for evaluator invocations that launch shells, package managers, or network utilities

Monitoring Recommendations

Enable Python audit hooks or runtime tracing on hosts running the WebArena harness to capture dynamic code execution
Forward harness logs and host telemetry to a centralized analytics platform for retrospective hunting
Alert on first-seen executables, scripts, or scheduled tasks created on benchmark hosts

How to Mitigate CVE-2025-4022

Immediate Actions Required

Restrict access to the WebArena evaluation harness so that only trusted operators can submit task definitions containing target["url"] values
Run the evaluation harness inside an isolated container or virtual machine with no access to production credentials or networks
Review existing benchmark configurations for untrusted or externally sourced target["url"] entries before re-running evaluations
Disable network paths that allow remote submission of task definitions to harness instances

Patch Information

No official patched release is referenced in the advisory at the time of publication. Track remediation progress through GitHub Issue #194 and the VulDB advisory. Organizations using WebArena should pin to a fixed commit once published by the maintainers and validate the change against the HTMLContentEvaluator function in webarena/evaluation_harness/evaluators.py.

Workarounds

Sanitize and validate target["url"] inputs against an allowlist of well-formed URL schemes and hostnames before invoking the evaluator
Wrap HTMLContentEvaluator calls in a sandboxed subprocess with restricted filesystem and network access
Apply seccomp, AppArmor, or SELinux profiles to limit syscalls available to the harness process
Remove or rewrite local copies of HTMLContentEvaluator to avoid dynamic evaluation patterns until an upstream fix is merged

bash

# Configuration example: run the WebArena harness in an isolated container
docker run --rm \
  --network=none \
  --read-only \
  --cap-drop=ALL \
  --security-opt no-new-privileges \
  -v "$(pwd)/configs:/work/configs:ro" \
  webarena:pinned-commit \
  python -m webarena.evaluation_harness --config /work/configs/task.json