models & research
Top in models & research
Tracing Why LLM Agent Memory Fails: A Method for Attributing Errors
MemTrace constructs provenance graphs across every memory operation in an LLM agent, tracing wrong answers to the exact operation that corrupted state across sessions.
modelsPersona Prompts Change Who an LLM Recommends as an Expert
A 43-model audit finds that geographic and role framing in LLM prompts systematically shifts which scholars get recommended as experts, with no neutral default.
Opus 4.8 Batch API: 1M Context, 300k Output, and Team Cost Controls
Opus 4.8 has a 1M token context window (200k on Foundry), 128k standard output, and 300k output via Batch API beta. January 2026 cutoff. Batch design and quota allocation.
modelsOpus 4.8 vs Opus 4.7: What Changed and What Did Not
Anthropic's Opus 4.8 raises SWE-Bench Pro from 64.3% to 69.2% and cuts code-flaw pass-through fourfold at unchanged $5/$25 pricing. A fast mode at $10/$50 runs 2.5x quicker.
modelsOne Learning Rate Doesn't Fit All: Heavy-Tail Layerwise LR Schedules for LLM Pretraining
LLR assigns per-layer learning rates from spectral heavy-tail diagnostics during LLM pretraining, achieving 1.5x faster convergence and up to 2 pp higher zero-shot accuracy.
modelsScale Vectors: Tiny Parameter Subsets That Disproportionately Steer LLM Behavior
Scale vectors are a negligible parameter class in LLM normalization layers whose outsized optimization role makes them high-value targets for quantization and safety editing.
modelsEmbedding Compression at Training Time: DIVE's Gradient Trick vs Post-Hoc Quantization for Vector DBs
DIVE's gradient-limited adapter outperforms baselines for embedding compression, but training-time methods lock RAG pipelines to specific adapters and raise refresh costs.
modelsμP Hyperparameter Transfer Has an Embedding Layer Hole, New arXiv Paper Says
An arXiv paper shows the embedding learning rate accounts for most of μP's advantage over standard parameterization, and a single scaling fix recovers the bulk of the benefit.
- may 25 models Audio LLMs Break When the Codec Changes: A Robustness Vector Voice-AI Teams Haven't Tested
- may 23 models Project Glasswing One Month In: AI Bug Discovery Has Outpaced the Patch Pipeline
- may 25 models Do LLMs Know What Not to Say? Causal Evidence for Statistical Preemption
- may 22 models arXiv 2605.16428 Measures AI Search's Drag on Publisher Traffic Using Paired Google and Reddit Data
- may 22 models A Theory of Time-Sensitive Language Generation Says Sparse Hallucination Beats Mode Collapse
- may 18 models The Last Word Often Wins: A Format Confound Inflates Chain-of-Thought Corruption Robustness Scores
- may 17 models Learning, Fast and Slow: What arXiv 2605.12484 Proposes for LLMs That Adapt Continually
- apr 27 models There Will Be a Scientific Theory of Deep Learning: What arXiv 2604.21691 Argues and Where It Will Lose
- apr 22 models Qwen3.6-27B's Dense Architecture Challenges the MoE-Only Playbook for Flagship-Class Coding Models
- mar 23 models Chinese AI Models Compared: DeepSeek, Qwen, Kimi, Doubao, and Ernie
- mar 23 models Running DeepSeek R1 Locally: Hardware Requirements, Quantization, and Real Throughput
- mar 14 models Fish-Speech: The Open-Source TTS Model That's Threatening ElevenLabs
- feb 26 models Gemini 2.0 Pro's 2 Million Token Context: What Can You Actually Do With It?
- feb 26 models Google's TimesFM: A Foundation Model for Time Series
- feb 26 models Synthetic Data Is Eating AI Training
- feb 26 models Claude's Web Search Changes Everything for AI Research
- feb 26 models DeepSeek V3/R1: How Chinese Engineers Matched GPT-4 for $6 Million
- feb 26 models The Million-Token Context Window: What Can You Actually Do?
- feb 18 models Gemini 3.1 Pro: Google's New Reasoning Model Explained
- feb 14 models AI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?
- feb 17 models Kimi Claw: Moonshot AI's Answer to Claude and ChatGPT
- feb 17 models WiFi DensePose: Full-Body Tracking Through Walls Using Your Router
- feb 10 models The Best AI Models for OpenClaw in 2026
Every new foundation model arrives wrapped in a benchmark chart and a press release. This beat exists because the chart almost never tells you what the model actually does — and the press release tells you even less. The interesting work is one layer down: the attention variant that changes the cost curve, the parameterization fix that makes scaling laws actually transfer, the eval protocol whose failure modes flip the standings once you change a decoding parameter or a codec. We cover that layer.
We treat open-weight and closed releases as the same story told from opposite ends of a distribution shift. A trillion-parameter mixture beating a dense competitor on one harness and losing on another is not a contradiction; it is evidence about what the harness measures. Training-efficiency claims, context-window claims, and reasoning-benchmark claims all get the same treatment — read the method section, find the assumption that was load-bearing, and report whether removing it changes the result. When a paper proposes a new safety mechanism, a new compression trick, or a new continual-learning split, we are interested in the part the authors did not want to highlight.
The throughline is comparative and skeptical without being contrarian. Foundation-model research has become the field where the largest gap between published numbers and deployed behavior tends to live. Closing that gap — with reproduction notes, ablation reading, and honest accounting of what generalizes versus what was tuned into the eval — is the beat.