Topic
#llm-benchmarks
3 articles exploring llm-benchmarks. Expert insights and analysis from our editorial team.
Showing 1–3 of 3 articles
Articles
Newest first
Agents & Frameworks
Frontier LLMs Fail Agentic Threat Hunting: Best Model Catches 3.8% of Malicious Events in 11-Model Benchmark
Simbian AI's benchmark tests 11 LLMs on raw Windows event log hunting; Claude Opus 4.6 leads at 0.55 coverage score while every other model clears zero of 13 ATT&CK tactics.
Agents & Frameworks
FSE 2026: Chain-of-Thought Fails Per-Bias as Debiasing; Axiomatic Cues Cut Sensitivity 51%
FSE 2026: chain-of-thought fails per-bias on PROBE-SWE SE tasks. Axiomatic cues cut bias sensitivity 51%, exposing gaps in CrewAI, LangChain, Pydantic AI defaults.
Models & Research
STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide
IBM Research's STaD framework exposes compositional skill gaps aggregate benchmarks miss: two models at 32% on ToT Arithmetic needed fundamentally different fixes.