Executive Summary
- SentinelLABS’ analysis of benchmarks for LLM in cybersecurity, including those published by major players such as Microsoft and Meta, found that none measure what actually matters for defenders.
- Most LLM benchmarks test narrow tasks, but these map poorly to security workflows, which are typically continuous, collaborative, and frequently disrupted by unexpected changes.
- Models that excel at coding and math provide minimal direct gains on security tasks, indicating that general LLM capabilities do not readily translate to analyst-level thinking.
- All of today’s benchmarks use LLMs to evaluate other LLMs, often using the same vendor’s models for both, creating a closed loop that is more susceptible to gaming, and difficult to trust.
- As frontier labs push defenders to rely on models to automate security operations, the importance of benchmarks will increase drastically as the main mechanism to evaluate whether the capabilities of the models match the vendor’s claims.
For security teams, AI promised to write secure code, identify and patch vulnerabilities, and replace monotonous security operations tasks. Its key value proposition was raising costs for adversaries while lowering them for defenders.
To evaluate whether Large Language Models were both performant and reliable enough to be deployed into the enterprise, a wave of new benchmarks were created. In 2023, these early benchmarks largely comprised multiple-choice exams over clean text, which produced clean and reproducible metrics for performance. However, as the models improved they outgrew the early tests: scores across models began to converge at the top of the scale as the benchmarks became increasingly “saturated”, and the tests themselves ceased telling anything meaningful.
As the industry has boomed over the past few years, benchmarking has become a way to distinguish new models from older ones. Developing a benchmark that shows how a smaller model outperforms a larger one released from a frontier AI lab is a billion-dollar industry, and now every new model launches with a menagerie of charts with bold claims. +3.7 on SomeBench-v2, SOTA on ObscureQA-XL, or 99th percentile on an-exam-no-one-had-heard-of-last-week. The subtext here is simple: look at the bold numbers, be impressed, and please join our seed round!
Inside this swamp of scores and claims, security teams are somehow meant to conclude that a system is safe enough to trust with an organization’s business, its users, and maybe even its critical infrastructure. However, a careful read through the arxiv benchmark firehose reveals a hard-to-miss pattern: We have more benchmarks than ever, and somehow we are still not measuring what actually matters for defenders.
So what do security benchmarks actually measure? And how well does this approach map to real security work?
In this post, we review four popular LLM benchmarking evaluations: Microsoft’s ExCyTIn-Bench, Meta’s CyberSOCEval and CyberSecEval 3, and Rochester Institute’s CTIBench. We explore what we think these benchmarks get right and where we believe they fall short.
What Current Benchmarks Actually Measure
ExCyTIn-Bench | Realistic Logs in a Microsoft Snow Globe
ExCyTIn-Bench was the cleanest example of an “agentic” Security Operations benchmark that we reviewed. It drops LLM agents into a MySQL instance that mirrors a realistic Microsoft Azure tenant. They provide 57 Sentinel-style tables, 8 distinct multi-stage attacks, and a unified log stream spanning 44 days of activity.
Each question posed to the LLM agent is anchored to an incident graph path. This means that the agent must discover the schema, issue SQL queries, pivot across entities, and eventually answer the question. Rewards for the agent are path-aware, meaning that full credit is assigned for the right answer, but the agent could also earn partial credit for each correct intermediate step that it took.
The headline result is telling:
“Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368…” (arxiv)
Microsoft’s ExCyTIn benchmark demonstrates that LLMs struggle to plan multi-hop investigations over realistic, heterogeneous logs.
This is an important finding – especially for those who are concerned with how LLMs work in real world scenarios. Moreover, all of this takes place in a Microsoft snow globe: one fictional Azure tenant, eight well-studied, canned attacks and clean tables and curated detection logic for the agent to work with. Although the realistic agent setup is a massive improvement over trivia-style Multiple Choice Question (MCQ) benchmarks, it is not the daily chaos of real security operations.
CyberSOCEval | Defender Tasks Turned into Exams
CyberSOCEval is part of Meta’s CyberSecEval 4 and deliberately picks two tasks defenders care about: malware analysis over real sandbox detonation logs and threat Intelligence reasoning over 45 CTI reports. The authors open with a statement we very much agree with:
“This lack of informed evaluation has significant implications for both AI developers and those seeking to apply LLMs to SOC automation. Without a clear understanding of how LLMs perform in real-world security scenarios, AI system developers lack a north star to guide their development efforts, and users are left without a reliable way to select the most effective models.” (arxiv)
To evaluate these tasks, the benchmark frames them as multi-answer multiple-choice questions and incorporates analytically computed random baselines and confidence intervals. This setup gives clean, statistically grounded comparisons between models and reduces complex workflows into simplified questions. Researchers found that the models perform far above random, but also far from solved.
In the malware analysis trial, they score exact-match accuracy in the teens to high-20s percentage range versus a random baseline around 0.63%. For threat-intel reasoning, models land in the ~43 to 53% accuracy band versus ~1.7% random.
In other words, the models are clearly extracting meaningful signals from real logs and CTI reports. However, the models also are failing to correctly answer most of the malware questions and roughly half of the threat intelligence questions.
These findings suggest that for any system aimed at automating SOC workflows, model performance should be evaluated as assistive rather than autonomous.
Crucially, they find that test-time “reasoning” models don’t get the same uplift they see in math/coding:
“We also find that reasoning models leveraging test time scaling do not achieve the boost they do in areas like coding and math, suggesting that these models have not been trained to reason about cybersecurity analysis…” (arxiv)
That’s a big deal, and it’s evidence that you don’t get generalized security reasoning for free just by cranking up “thinking steps”.
Meta’s CyberSOCEval falls short because it compresses two complex domains into MCQ exams. There is no notion of triaging multiple alerts or asking follow-up questions or hunting down log sources. In real life, analysts need to decide when to stop and escalate or switch paths.
In the end, while the CyberSOCEval is a clean and statistically sound probe of model performance on a set of highly-specific sub-tasks, it is far from a representation of how SOC workflows should be modeled.
CTIBench | CTI as a Certification Exam
CTIBench is a benchmark task suite introduced by researchers at Rochester Institute of Technology to evaluate how well LLMs operate in the field of Cyber Threat Intelligence. Unlike general purpose benchmarks, which focus on high-level domain knowledge, CTIBench grounds tasks in the practical workflows of information security analysts. Like other benchmarks we examined it performs this analysis as an MCQ exam.
“While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks.” (NeurIPS Papers)
CTIbench draws on well-known security standards and real-world threat reports, then turns them into five kinds of tasks:
- basic multiple-choice questions about threat-intelligence knowledge
- mapping software vulnerabilities to their underlying weaknesses
- estimating how serious a vulnerability is
- pulling out the specific attacker techniques described in a report
- guessing which threat group or malware family is responsible.
The data is mostly from 2024, so it’s newer than what most models were trained on, and each task is graded with a simple “how close is this to the expert answer?” style score that fits the kind of prediction being made.
On paper, this looks close to the work CTI teams care about: mapping vulnerabilities to weaknesses, assigning severity, mapping behaviors to techniques, and tying reports back to actors.
In practice, though, the way those tasks are operationalized keeps the benchmark in the frame of a certification exam. Each task is cast as a single-shot question with a fixed ground-truth label, answered in isolation with a zero-shot prompt. There is no notion of long-running cases, heterogeneous and conflicting evidence, evolving intelligence, or the need to cross-check and revise hypotheses over time.
CTIBench is yet another MCQ, an excellent exam if you want to know, “Can this model answer CTI exam questions and do basic mapping/annotation?” It says less about whether an LLM can do the messy work that actually creates value: normalizing overlapping feeds, enriching and de-duplicating entities in a shared knowledge graph, negotiating severity and investment decisions with stakeholders, or challenging threat attributions that don’t fit an organization’s historical data.
CyberSecEval 3 | Policy Framing Without Operational Closure
CyberSecEval 3, also from Meta, is not a SOC benchmark so much as a risk map. The authors carve the space into eight risks, grouped into two buckets: harms to third parties i.e., offensive capabilities and harms to application developers and end users such as misuse, vulnerabilities, or data leakage. The frame of this eval is the current regulatory conversation between governments and standards bodies about unacceptable model risk, so the suite is understandably organized around “where could this go wrong?” rather than “how much better does this make my security operations?”
The benchmark’s coverage tracks almost perfectly with the concerns of policymakers and safety orgs. On the offensive side, CyberSecEval 3 looks at automated spear-phishing against LLM-simulated victims, uplift for human attackers solving Hack-The-Box style CTF challenges, fully autonomous offensive operations in a small cyber range, and synthetic exploit-generation tasks over toy programs and CTF snippets. On the application side, it probes prompt injection, insecure code generation in both autocomplete and instruction modes, abuse of attached code interpreters, and the model’s willingness to help with cyberattacks mapped to ATT&CK stages.
The findings across these areas are very broad. Llama3 is described as capable of “moderately persuasive” spear-phishing, roughly on par with other SOTA models when judged against simulated victims. In the CTF study, Llama3 405B gives novice participants a noticeable bump in completed phases and slightly faster progress, but the authors stress that the effect is not statistically robust.
The fully autonomous agent can handle basic reconnaissance in the lab environment, but fails to achieve reliable exploitation or persistence. On the application-risk side, all tested models suggest insecure code at non-trivial rates, prompt injection succeeds a significant fraction of the time, and models will sometimes execute malicious code or provide help with cyberattacks. Meta stresses that its own guardrails reduce these risks on the benchmark distributions.
CyberSecEval 3 may have some value for those working in policy and governance. None of the eight risks are defined in terms of operational metrics such as detection coverage, time to triage, containment, or vulnerability closure rates. The CTF experiment comes closest to demonstrating something about real-world value, but it is still an artificial one-hour lab on pre-selected targets. Moreover, this experiment is expensive and not reproducible at scale.
There are glimmers of this in the paper, and CyberSecEval3 remains a strong contribution to AI security understanding and governance, but a weak instrument for deciding whether to deploy a model as a copilot for live operations.
Benchmarks are Measuring Tasks, not Workflows
All of these benchmarks share a common blind spot: they treat security as a collection of isolated questions rather than as an ongoing workflow.
Real teams work through queues of alerts, pivot between partially related incidents, and coordinate across levels of seniority. They make judgment calls under time pressure and incomplete telemetry. Closing out a single alert or scoring 90% on a multiple choice test is not the goal of a security team. The goal is reducing the underlying risk to the business, and this means knowing the right questions to ask in the first place.
ExCyTIn-Bench comes closest to acknowledging this reality. Agents interact with an environment over multiple turns and earn rewards for intermediate progress. Yet even here, the fundamental unit of evaluation is still a question: “What is the correct answer to this prompt?” The system is not asked to “run this incident to ground” or evaluate different environments or logging sources that may be included in an incident response. CyberSOCEval and CTIBench compress even richer workflows into single multiple-choice interactions.
Methodologically, this means none of these benchmarks are measuring outcomes that define security performance. Metrics such as time-to-detect, time-to-contain, and mean time to remediate are absent. We are measuring how models behave when the important context has already been carefully prepared and handed to them, not how they behave when dropped into a live incident where they must decide what to look at, what to ignore, and when to ask for help.
Until we are ready to benchmark at the workflow level, we should understand that high accuracy on multiple-choice security questions and smooth reward curves are not stand-ins for operational uplift. In information security, the bar must be higher than passing an exam.
MCQs and Static QA are Overused Crutches
Multiple-choice questions are attractive for understandable reasons. They are easy to score at scale. They support clean random baselines and confidence intervals and they fit nicely into leaderboards and slide decks.
The downside is that this format quietly bakes in assumptions that do not hold in practice. For any given scenario, the benchmark assumes someone has already asked the right question. There is no space for challenging the premise of that question, reframing the problem, or building and revising a plan. All of the relevant evidence has already been selected and pre-packaged for the analyst. In that setting, the model’s job is essentially to compress and restate context, not to decide what to investigate or how to prioritize effort. Wrong or partially correct answers carry no real cost.
This is the inverse of real SOC and CTI work where the hardest part is deciding what questions to ask, what data to pull, and what to ignore. That judgment ability is usually earned over years of experience or deliberate training, If we want to know whether models will actually help in our workflows, we need evaluations where asking for more data has a cost, ignoring critical signals is penalized, and “I don’t know, let me check” is a legitimate and sometimes optimal response.
Statistical Hygiene is Still Uneven
To their credit, some of these efforts take statistics seriously. CyberSOCEval reports confidence intervals and uses bootstrap analysis to reason about power and minimum detectable effect sizes. CTIBench distinguishes between pre- and post-cutoff datasets and examines performance drift. CyberSecEval 3 uses survival analysis and appropriate hypothesis tests in its human-subject CTF study to show an unexpected lack of statistically significant uplift from an LLM copilot.
Across the board, however, there are still gaps. Many results come from single-seed, temperature-zero runs with no variance reported. ExCyTIn-Bench, for instance, reports an average reward of 0.249 and a best of 0.368, but provides no confidence intervals or sensitivity analysis. Contamination is rarely addressed systematically, even though all four benchmarks draw on well-known corpora that almost certainly overlap with model training data. Heavy dependence on a single LLM judge, often from the same vendor as the model being evaluated, compounds these issues.
The consequence is that headline numbers can look precise while being fragile under small changes in prompts, sampling parameters, or judge models. If we expect these benchmarks to inform real governance and deployment decisions, variance, contamination checks, and judge robustness should be baseline, check-box requirements.
Using LLMs to Evaluate LLMs Is Everywhere, and Rarely Questioned
Every benchmark we reviewed relies on LLMs somewhere in the evaluation loop, either to generate questions or to score answers.
ExCyTIn uses models to turn incident graphs into Q&A pairs and to grade free-form responses, falling back to deterministic checks only in constrained cases. CyberSOCEval uses Llama models in its question-generation pipeline before shifting to algorithmic scoring. CTIBench relies on GPT-4-class models to produce CTI multiple-choice questions. CyberSecEval 3 uses LLM judges to rate phishing persuasiveness and other behaviors.
CyberSecEval 3 is a standout here. It calibrates its phishing judge against human raters and reports a strong correlation, which is a step in the right direction. But overall, we are treating these judges as if they were neutral ground truth. In many cases, the judge is supplied by the same vendor whose models are being evaluated, and the judging prompts and criteria are public. That makes the benchmarks simple to overfit: once you know how the judge “thinks,” it is trivial to tune a model or prompting strategy to please it.
That being said, “LLM as a judge” remains incredibly popular across the field. It is cheap, fast, and feels objective. It’s not the worst setup, but if we do not actively interrogate and diversify these judges, comparing them against humans, against each other, then over time we risk baking the biases and blind spots of a few dominant models into the evaluation layer itself. That is a poor foundation for any serious claims about security performance.
Technical Gaps
Even when the evaluation methodology is thoughtful, there are structural reasons today’s benchmarks diverge from real SOC environments.
Single-Tenant, Single-Vendor Worlds
ExCyTIn presents a well-designed Azure-style environment, but it is still a single fictional tenant with a curated set of attacks and detection rules. It tells us how models behave in a world with clean logging and eight known attack chains, but not in a hybrid AWS/Azure/on-prem estate where sensors are misconfigured and detection logic is uneven.
CyberSOCEval’s malware logs and CTI corpora are similarly narrow. They represent security artifacts cleanly without the messy mix of SIEM indices, ticketing systems, internal wikis, email threads, and chat logs that working defenders navigate daily. If the goal is to augment those people, current benchmarks barely capture their environment. If the goal is to replace them, the gap is even wider.
Static Text Instead of Living Tools and Data
CTIBench and CyberSOCEval are fundamentally static. PDFs are flattened into text, JSON logs are frozen into MCQ contexts, CVEs and CWEs are snapshots from public databases. That is reasonable for early-stage evaluation, but it omits the dynamics that matter most in real operations.
Analysts spend their time in a world of internal middleware consoles, vendor platforms, and collaboration tools. Threat actors shift infrastructure mid-campaign or opportunistically piggyback on others’ infrastructure. New intelligence arrives in the middle of triage, often from sources uncovered during the investigation. In that sense, a well-run tabletop or red–blue exercise is closer to reality than a static question bank. Benchmarks that do not encode time, change, and feedback will always understate the difficulty of the work.
Multimodality is Still Underdeveloped
CyberSOCEval does take an impressive run at multimodality, comparing text-only, image-only, and combined modes on CTI reports and malware artifacts. One uncomfortable takeaway is that text-only models often outperform image or text+image pipelines, and images matter primarily when they contain information not available in text at all. In practice, analysts rarely hinge a response on a single graph or screenshot.
At the same time, current “multimodal” models are still uneven at reasoning over screenshots, tables, and diagrams with the same fluency they show on clean prose. If we want to understand how much help an LLM will be at the console, we need benchmarks that isolate and stress those capabilities directly, rather than treating multimodality as a side note.
Modeling Limitations
Ironically, the very benchmarks that miss real-world workflows still reveal quite a bit about where today’s models fall short.
General Reasoning is Not Security Reasoning
CyberSOCEval’s abstract states outright that “reasoning” models with extended test-time thinking do not achieve their usual gains on malware and CTI tasks. ExCyTIn shows a similar pattern: models that shine on math and coding benchmarks stumble when asked to plan coherent sequences of SQL queries across dozens of tables and multi-stage attack graphs.
In other words, we mostly have capable general-purpose models that know a lot of security trivia. That is not the same as being able to reason like an analyst. On the plus side, the benchmarks are telling us what is needed next: security-specific fine-tuning and chain-of-thought traces, exposure to real log schemas and CTI artifacts during training, and objective functions that reward good investigative trajectories, not just correct final answers.
Poor Calibration on Scores and Severities
CTIBench’s CVSS task (CTI-VSP) is especially revealing in this regard. Models are asked to infer CVSS v3 base vectors from CVE descriptions, and performance is measured with mean absolute deviation from ground-truth scores. The results show systematic misjudgments of severity, not just random noise. This is an important finding from the benchmark
Those errors are concerning for any organization that plans to use model-generated scores to drive patch prioritization or risk reporting. More broadly, they highlight a recurring theme: models often sound confident while being poorly calibrated on risk. Benchmarks that only track accuracy or top-1 match rates will fail to identify the danger of confident, but incorrect recommendations, especially in environments where those recommendations can be gamed or exploited.
Conclusion
Today’s benchmarks present a clear step forward from generic NLP evaluations, but our findings reveal as much about what is missing as what is measured: LLMs struggle with multi-hop investigations even when given extended reasoning time, general LLM reasoning capabilities don’t transfer cleanly to security work, and evaluation methods that rely on vendor models to grade vendor models create obvious conflicts of interest.
More fundamentally, current benchmarks measure task performance in controlled settings, not the operational outcomes that matter to defenders: faster detection, reduced containment time, and better decisions under pressure. No current benchmarks can tell a security team whether deploying an LLM-driven SOC or CTI system will actually improve their posture or simply add another tool to manage.
In Part 2 of this series, we’ll examine what a better generation of benchmarks should look like, digging into the methodologies, environments, and metrics required to evaluate whether LLMs are ready for security operations, not just security exams.