Frontier LLMs Fail Agentic Threat Hunting: Best Model Catches 3.8% of Malicious Events in 11-Model Benchmark

Researchers at Simbian AI published a benchmark to arXiv on April 21, 2026 that tests frontier LLMs on open-ended threat hunting rather than security knowledge questions: agents query SQLite databases with up to 135,000 Windows event log records and 106 real attack procedures, with no pre-staged hints about what to look for. In the initial five-model evaluation, Claude Opus 4.6 led the field but correctly flagged only 3.8% of malicious events on average, and no model cleared ≥50% recall across all 13 MITRE ATT&CK tactics tested. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv abstract))

What the Cyber Defense Benchmark Actually Tests

The paper, “Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps,” is authored by Alankrit Chona, Igor Kozlov, and Ambuj Kumar of Simbian AI (v1 submitted April 21, 2026; revised to v2 April 22). (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv abstract)) Its structure departs from the security MCQ and knowledge-retrieval benchmarks most vendors cite when pitching LLM-powered SOC products.

Each episode presents an agent with a SQLite database containing 75,000–135,000 Windows event log records drawn from the OTRF Security-Datasets collection, covering 106 real attack procedures across 26 campaigns and 86 MITRE ATT&CK sub-techniques. Agents have a budget of 50 SQL queries, each returning at most 10 rows. Success is scored CTF-style against Sigma-rule ground truth — the agent must independently formulate hypotheses, query the database, and flag specific malicious events with no scaffolding. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv abstract))

The benchmark is implemented as a Gymnasium RL harness using a ThreatHuntEnv class, with LiteLLM handling model routing. The code is open-sourced under Simbian AI’s GitHub repository; a sample payload is included and the full dataset is distributed separately. (Simbian AI Cyber Defense Benchmark GitHub Repository)

The Numbers: No Model Clears 50% Recall on Every Tactic

In the v1 five-model evaluation, Claude Opus 4.6 submitted correct flags for 3.8% of malicious events on average, clearing the ≥50% recall threshold on 5 of 13 tactics. GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash each cleared zero tactics. No single run across any model found all flags in an episode. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv abstract))

The v2 full paper expanded to 11 models and reported Coverage Scores: (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF))

Model	Coverage Score	Tactics ≥50% Recall (of 13)
Claude Opus 4.6	0.55 ± 0.05	7
Claude Sonnet 4.6	0.44 ± 0.08	0
Claude Opus 4.7	0.36 ± 0.13	0
Gemini 3.1 Pro	0.22 ± 0.13	0
GPT-5	0.21 ± 0.08	0
Kimi K2.6	0.20 ± 0.14	0
Qwen3.6 Plus	0.19 ± 0.11	0
Gemini 3 Flash	0.18 ± 0.08	0
MiniMax M2.7	0.15 ± 0.10	0
Kimi K2.5	0.11 ± 0.13	0
DeepSeek V3.2	0.10 ± 0.07	0

Claude Opus 4.6 cleared 7 of 13 tactics in v2; no other model cleared any. The v1 abstract reported the same model clearing 5 of 13 across the smaller evaluation set, with the discrepancy reflecting the expanded model set and revised scoring in v2. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF))

Search-space intractability. With 75k–135k event records and only 50 queries returning 10 rows each, agents can read at most 500 rows — a small fraction of the corpus. Claude Opus 4.6 defaulted to breadth-first keyword scanning: querying event counts and filtering on common malicious EventIDs such as 1, 4104, and 4624. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) That strategy surfaces the obvious events but misses multi-step sequences where individual events look unremarkable without cross-correlation.

Attribution gap. Claude Opus 4.6 observed an average of 159 flag-worthy events during a run but submitted only 113. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) The model identified suspicious artifacts but failed to close the inferential loop from observation to explicit flag — and this gap persisted even as it consumed more tokens per run than any other tested model.

Tactic blind spots. All models showed near-zero recall for Credential Access and near-zero or zero recall for Initial Access. Defense Evasion was the highest-recalled tactic across the field at 0.59. GPT-5 had two complete blind spots — Initial Access and Lateral Movement — submitting zero relevant flags for those tactic categories entirely. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF))

Why Cost Does Not Predict Performance

The per-run cost spread is substantial. Claude Opus 4.6 costs $17.98 per run, consuming 3.5 million tokens across an average 51.7 turns. Gemini 3 Flash costs $0.19 per run — 676,000 tokens across 20.6 turns. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) That is a 94× cost difference, but the Coverage Score ratio between them is 3.1×.

Gemini 3 Flash achieved comparable flags-submitted counts to GPT-5 at roughly one-sixth the cost, which the paper interprets as evidence that reasoning quality rather than token throughput is the primary bottleneck. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) The Claude model family makes the same point more sharply: Claude Opus 4.7 (Coverage Score 0.36) scores below Claude Opus 4.6 (0.55) while the paper reports it uses approximately 80% fewer tokens per run. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) More tokens spent does not guarantee better hunting; the question is whether the model can formulate discriminating queries when it does spend them.

What This Means for LangChain and CrewAI SecOps Stacks

The paper states directly that these results “do not reflect model quality on security knowledge tasks, where the same models score well; rather, they expose a specific deficiency in open-ended, agentic evidence gathering at scale.” (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF))

This distinction matters for teams building SOC pipelines on LangChain or CrewAI. Most existing orchestration-layer benchmarks for security agents evaluate tool-call correctness: did the agent invoke the right retrieval function, parse the alert correctly, route to the right tool? The Cyber Defense Benchmark tests a different capability: whether the agent can decide what to search for across a noisy corpus, formulate SQL that actually isolates malicious events, and connect observations to conclusions without scaffolded prompts.

An agent that achieves high tool-call accuracy in a curated alert-triage pipeline can score 0.10 on a free-form hunt if it cannot generate discriminating queries. The 50-query budget makes the failure mode concrete: an agent burning queries on generic EventID counts has fewer turns left to pursue specific hypotheses. The dominant failure is not orchestration correctness; it is epistemic — the model cannot reliably decide where to look.

What SOC Agent Builders Should Measure Instead

Three metric categories that current SOC agent evaluation frameworks rarely capture:

Trajectory quality over query budget. How discriminating are the agent’s queries? An agent spending 30 of 50 queries on SELECT COUNT(*) WHERE EventID IN (1, 4104, 4624) variants consumes budget without narrowing the hypothesis space. The ThreatHuntEnv harness tracks this at the turn level; most production evaluation pipelines do not. (Simbian AI Cyber Defense Benchmark GitHub Repository)

Attribution rate vs. observation rate. The gap between events observed (159 on average for Opus 4.6) and flags submitted (113) quantifies how often a model sees evidence but fails to act on it. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) This is measurable at inference time and should be a first-class metric in any threat hunting agent evaluation — not an afterthought derived from final recall.

Tactic-level recall, not aggregate accuracy. GPT-5’s zero recall on Initial Access and Lateral Movement would be concealed by an adequate aggregate score if other tactics perform reasonably. (Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps (arXiv PDF)) Meaningful SOC coverage requires tactic-level breakdowns because complete blind spots in a single tactic category can be operationally catastrophic regardless of what F1 looks like in aggregate.

Frequently Asked Questions

How do these failure modes compare to what frontier LLMs show on security MCQ benchmarks like CyberSecEval?

Security MCQ benchmarks pre-stage the evidence — the model sees a structured question with bounded answer choices and the relevant artifact already in context. The Cyber Defense Benchmark removes that scaffolding entirely: the agent must decide which EventIDs to query before it knows whether they will be relevant. A model can correctly answer ‘which EventID indicates PowerShell script-block logging?’ on a knowledge test while still failing to generate SQL that isolates EventID 4104 records within 135,000 unfiltered log entries — which is why strong security knowledge scores do not predict Coverage Scores above 0.20.

Would a RAG-over-logs architecture score differently than the SQL-query design this benchmark enforces?

Possibly — dense retrieval over embedded log chunks could surface semantically similar events without the discriminating-query problem that sinks SQL-based agents, since a retrieval index does not impose a row-return cap per query. The benchmark does not test RAG architectures; all agents interact exclusively through parameterized SQL returning at most 10 rows per call. Teams using vector search over SIEM data in production would need to adapt the open-source ThreatHuntEnv harness to evaluate that design. The current results are specific to the SQL-and-budget paradigm.

Does using a public dataset like OTRF Security-Datasets introduce benchmark contamination risk?

Yes — OTRF Security-Datasets is an openly maintained community repository, meaning a model fine-tuned on its logs and attack procedures could recognize flag patterns without developing genuine open-ended hunting reasoning. The paper does not report whether any of the 11 tested models had OTRF data in their pretraining or fine-tuning corpora, and contamination is not discussed as a limitation. This is a structural vulnerability the benchmark shares with any evaluation built on public data, and warrants caution before treating Coverage Scores as ground-truth capability measurements rather than lower bounds.

What log-analysis properties make Credential Access and Initial Access so much harder to recall than Defense Evasion?

Credential Access and Initial Access events typically appear benign in isolation — a successful logon (EventID 4624) or a new process launch is routine system activity. Correctly flagging them as malicious requires correlating multiple log entries across different event types, often in temporal sequence. Defense Evasion events, by contrast, include EventIDs rare enough to flag on presence alone, such as audit log clearing or injection into system processes. The breadth-first keyword strategy all tested models defaulted to is well-suited to rare-EventID detection but structurally incapable of surfacing multi-event sequences that only read as malicious in combination — which is exactly why Defense Evasion topped recall at 0.59 while Credential Access sat near zero across every model.

Frontier LLMs Fail Agentic Threat Hunting: Best Model Catches 3.8% of Malicious Events in 11-Model Benchmark

What the Cyber Defense Benchmark Actually Tests

The Numbers: No Model Clears 50% Recall on Every Tactic

The Three Failure Modes: Search, Attribution, and Blind Spots

Why Cost Does Not Predict Performance

What This Means for LangChain and CrewAI SecOps Stacks

What SOC Agent Builders Should Measure Instead

Frequently Asked Questions

How do these failure modes compare to what frontier LLMs show on security MCQ benchmarks like CyberSecEval?

Would a RAG-over-logs architecture score differently than the SQL-query design this benchmark enforces?

Does using a public dataset like OTRF Security-Datasets introduce benchmark contamination risk?

What log-analysis properties make Credential Access and Initial Access so much harder to recall than Defense Evasion?

Sources

Enjoyed this article?

What the Cyber Defense Benchmark Actually Tests

The Numbers: No Model Clears 50% Recall on Every Tactic

The Three Failure Modes: Search, Attribution, and Blind Spots

Why Cost Does Not Predict Performance

What This Means for LangChain and CrewAI SecOps Stacks

What SOC Agent Builders Should Measure Instead

Frequently Asked Questions

How do these failure modes compare to what frontier LLMs show on security MCQ benchmarks like CyberSecEval?

Would a RAG-over-logs architecture score differently than the SQL-query design this benchmark enforces?

Does using a public dataset like OTRF Security-Datasets introduce benchmark contamination risk?

What log-analysis properties make Credential Access and Initial Access so much harder to recall than Defense Evasion?

Sources

Related Articles

FSE 2026: Chain-of-Thought Fails Per-Bias as Debiasing; Axiomatic Cues Cut Sensitivity 51%

Google's TPU 8i Targets Agentic Workloads. What CrewAI, LangGraph, and AutoGen Must Measure

A2A v1.0 Left Agent Discovery Blank: Why AAIF's 170-Member Standard Still Forces Every Enterprise to Build Its Own Governance Layer

Enjoyed this article?