CVE-2025-4022 Overview
CVE-2025-4022 is a code injection vulnerability [CWE-94] affecting web-arena-x WebArena versions up to 0.2.0. The flaw resides in the HTMLContentEvaluator function within webarena/evaluation_harness/evaluators.py. Attackers can manipulate the target["url"] argument to inject and execute arbitrary code remotely. The exploit details have been publicly disclosed, increasing exposure for unpatched deployments.
WebArena is a benchmark environment used for evaluating autonomous web agents. Because evaluation harnesses commonly run in trusted research environments, code injection in this component can compromise host systems and downstream pipelines.
Critical Impact
Remote attackers with low privileges can inject code through the target["url"] parameter processed by HTMLContentEvaluator, enabling unauthorized code execution within the evaluation harness.
Affected Products
- web-arena-x WebArena versions up to and including 0.2.0
- Deployments using webarena/evaluation_harness/evaluators.py
- Research and benchmarking environments running the WebArena evaluation harness
Discovery Timeline
- 2025-04-28 - CVE-2025-4022 published to NVD
- 2025-05-14 - Last updated in NVD database
Technical Details for CVE-2025-4022
Vulnerability Analysis
The vulnerability resides in the HTMLContentEvaluator function inside webarena/evaluation_harness/evaluators.py. This evaluator processes the target["url"] argument as part of grading agent task outcomes. Improper handling of this input allows attacker-controlled content to flow into a code execution context, classified as code injection [CWE-94] and improper neutralization of special elements in output [CWE-74].
WebArena is widely used to benchmark autonomous web agents, including LLM-driven agents. Test harnesses frequently execute with broad access to the host environment, including network egress, credentials for test sites, and orchestration tooling. Code injection inside the evaluator can therefore pivot from benchmark results into the underlying machine.
Root Cause
The root cause is insufficient validation and neutralization of the target["url"] argument before it is consumed by HTMLContentEvaluator. When the evaluator processes attacker-influenced URL data, the input is interpreted as code rather than data. This pattern aligns with CWE-94, where externally controlled input is incorporated into a code path without proper sanitization.
Attack Vector
The attack is initiated over the network and requires low privileges with no user interaction. An attacker supplies a crafted target["url"] value that reaches the HTMLContentEvaluator function. Once parsed, the malicious payload executes in the context of the evaluation harness process. The exploit has been disclosed publicly through the project's GitHub Issue #194 and VulDB entry #306376.
No verified proof-of-concept code is published in this advisory. Technical discussion is available in the GitHub Comment on Issue #194.
Detection Methods for CVE-2025-4022
Indicators of Compromise
- Unexpected child processes spawned by the Python interpreter running the WebArena evaluation harness
- Anomalous outbound network connections originating from evaluation hosts during benchmark runs
- Modified or newly created files in the webarena/evaluation_harness/ directory tree
- Evaluator log entries containing unusual characters, shell metacharacters, or encoded payloads in the target["url"] field
Detection Strategies
- Inspect WebArena task definitions and benchmark configurations for target["url"] values containing executable syntax, template expressions, or special characters
- Audit access logs of systems hosting the evaluation harness for unauthorized configuration changes or remote submissions
- Monitor process trees for evaluator invocations that launch shells, package managers, or network utilities
Monitoring Recommendations
- Enable Python audit hooks or runtime tracing on hosts running the WebArena harness to capture dynamic code execution
- Forward harness logs and host telemetry to a centralized analytics platform for retrospective hunting
- Alert on first-seen executables, scripts, or scheduled tasks created on benchmark hosts
How to Mitigate CVE-2025-4022
Immediate Actions Required
- Restrict access to the WebArena evaluation harness so that only trusted operators can submit task definitions containing target["url"] values
- Run the evaluation harness inside an isolated container or virtual machine with no access to production credentials or networks
- Review existing benchmark configurations for untrusted or externally sourced target["url"] entries before re-running evaluations
- Disable network paths that allow remote submission of task definitions to harness instances
Patch Information
No official patched release is referenced in the advisory at the time of publication. Track remediation progress through GitHub Issue #194 and the VulDB advisory. Organizations using WebArena should pin to a fixed commit once published by the maintainers and validate the change against the HTMLContentEvaluator function in webarena/evaluation_harness/evaluators.py.
Workarounds
- Sanitize and validate target["url"] inputs against an allowlist of well-formed URL schemes and hostnames before invoking the evaluator
- Wrap HTMLContentEvaluator calls in a sandboxed subprocess with restricted filesystem and network access
- Apply seccomp, AppArmor, or SELinux profiles to limit syscalls available to the harness process
- Remove or rewrite local copies of HTMLContentEvaluator to avoid dynamic evaluation patterns until an upstream fix is merged
# Configuration example: run the WebArena harness in an isolated container
docker run --rm \
--network=none \
--read-only \
--cap-drop=ALL \
--security-opt no-new-privileges \
-v "$(pwd)/configs:/work/configs:ro" \
webarena:pinned-commit \
python -m webarena.evaluation_harness --config /work/configs/task.json
Disclaimer: This content was generated using AI. While we strive for accuracy, please verify critical information with official sources.


