Topic

#llm-benchmarks

3 articles exploring llm-benchmarks. Expert insights and analysis from our editorial team.

Showing 1–3 of 3 articles

Articles

Newest first
Agents & Frameworks

Frontier LLMs Fail Agentic Threat Hunting: Best Model Catches 3.8% of Malicious Events in 11-Model Benchmark

Simbian AI's benchmark tests 11 LLMs on raw Windows event log hunting; Claude Opus 4.6 leads at 0.55 coverage score while every other model clears zero of 13 ATT&CK tactics.

Agents & Frameworks

FSE 2026: Chain-of-Thought Fails Per-Bias as Debiasing; Axiomatic Cues Cut Sensitivity 51%

FSE 2026: chain-of-thought fails per-bias on PROBE-SWE SE tasks. Axiomatic cues cut bias sensitivity 51%, exposing gaps in CrewAI, LangChain, Pydantic AI defaults.

Models & Research

STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide

IBM Research's STaD framework exposes compositional skill gaps aggregate benchmarks miss: two models at 32% on ToT Arithmetic needed fundamentally different fixes.