models & research
Top in models & research
Can RoboSSM's State-Space Backbone Replace Transformer Imitation Policies?
The RoboSSM preprint swaps the transformer backbone of in-context robot imitation for a Longhorn state-space model and claims LIBERO gains. The full paper is still pending.
modelsPruning Experts to Shrink MoE Models: Does Attribution-Guided Compression Beat Magnitude?
Routing architecture decides whether MoE experts can be pruned safely: hard routers keep calibration per expert, soft routers need an aggregate guard.
GLM-5.2 vs Kimi K2.7 Code: Two Open-Weight Bets on Agentic Coding
GLM-5.2 and Kimi K2.7 Code shipped a day apart as open-weight coding models. One leads every open-weight leaderboard; the other is cheaper and more token-efficient.
modelsHow Linear Is a Transformer Feed-Forward Block? A New Test Says It's Learned, Not Built In
Per-block linear recoverability in transformer FFNs swings from near-linear to strongly nonlinear between layers, so compression tools must probe each block per checkpoint.
modelsGLM-5.2 Benchmarks: What 62.1% SWE-bench Pro and 99.2% AIME Actually Mean
Zhipu published a full benchmark suite for GLM-5.2 on June 19, 2026. Each score targets a different skill domain, and each carries distinct contamination or harness caveats.
modelsGLM-5.2 on Terminal-Bench 2.1: Strengths, Gaps, and How to Route Real Coding Tasks
GLM-5.2 scores 81.0 on Terminal-Bench 2.1, trails Claude Opus 4.8 (85.0) on shell tasks, but wins on 1M-token monorepo context. Here is how to route tasks.
modelsGLM-5.2 vs Claude Opus 4.8: Open-Weight Coding at Frontier Pricing
GLM-5.2 posts 62.1% on SWE-bench Pro and 81.0 on Terminal-Bench 2.1, four points behind Opus 4.8. MIT weights are self-hostable; flat plan starts at $18/month.
modelsGLM-5.2's 753B MoE Costs More to Self-Host Than the MIT License Suggests
GLM-5.2 ships 753B parameters under MIT with strong coding benchmarks, but the full MoE weight load makes self-hosting far heavier than the license terms imply.
- jun 17 models STAR Replaces Scalar Reward in Text-to-Image RL with Attention-Derived Spatial Maps
- jun 15 models Can Editing One Neuron Fix LLM Repetition Loops?
- jun 10 models Claude Fable 5 Benchmarks: What FrontierCode, CursorBench, and ViBench Show
- jun 10 models Does Attribution Patching Lie? A Fix for a Common Interpretability Shortcut
- jun 11 models Can You Make a Multimodal Model Unlearn With Activation Steering?
- jun 11 models Why Pruning a Model Can Raise Its Out-of-Distribution Accuracy
- jun 09 models Do Unified Multimodal Models Actually Interleave Understanding and Generation?
- jun 09 models How LLMs Track Who Did What: The Entity Rebinding Circuit
- jun 09 models Claude Fable 5 vs Opus 4.8: When 2x Pricing Is Worth It
- jun 09 models Claude Mythos 5 Access Rules: Who Gets Project Glasswing and Why
- jun 09 models Fable 5 Distillation Protection: How Anthropic Blocks Model Copying
- jun 09 models Skip Fable 5 or Upgrade? When Opus 4.8 and Sonnet 4.6 Are Still Enough
- jun 08 models LLM Steganography: Can Defenders Detect Payloads Hidden in Model Output?
- jun 08 models Do Privacy Defenses Actually Protect Fine-Tuned LLMs? A New Benchmark
- jun 08 models Can You Reconstruct an LLM's System Prompt From Its Activations?
- jun 08 models Does Softmax Normalization Limit What Attention Can Represent?
- jun 07 models Can an Attacker Steal Your Model's Last Layer From Its Outputs?
- jun 06 models Can LLMs Leak Training Data? A New Test Splits Capacity From Intent
- jun 06 models When an AI Agent's Tools Break, Can It Recover? A New Benchmark
- jun 05 models MiniMax M3 Bets on Sparse Attention for 1M Context. Does the Math Hold?
- jun 05 models Can One Model Handle Every CAD Task? UniCAD Tests It
- jun 05 models Do Foundation Models Actually Learn Relational Structure In-Context?
- jun 05 models Can LLMs Write Better Research Paper Titles Than Authors?
- jun 05 models Does Information-Theoretic Example Selection Beat kNN for In-Context Learning?
- jun 05 models Do Concept Bottleneck Model Benchmarks Measure Interpretability or Dataset Bias?
- jun 05 models Continuous Bit-Width Quantization vs Fixed INT4: Does LiftQuant Beat Discrete?
- jun 04 models Federated Learning for Industrial IoT Anomaly Detection: The Data-Locality Tradeoff
- jun 04 models Reading Failed LLM Reasoning Traces Won't Tell You Which Ones RL Can Fix
- jun 04 models Can You Stitch Two Foundation Models Together Without Retraining?
- jun 04 models Do Reasoning LLMs Waste Tokens? OckBench Tries to Measure It
- jun 03 models Which Layer Detects LLM Hallucinations Best? The Case Against Fixed-Layer Probes
- jun 02 models Cross-Domain RL Training Degrades Capabilities. CARE-RL Reweights to Fix It
- jun 02 models LLM Watermarking Without Quality Loss: The Non-Distortionary Approach
- jun 01 models Treating LLM Agent Memory as a Database: The VikingMem Approach
- jun 01 models Can a Language Model Work Without a Neural Network? A New arXiv Paper Says Yes
- jun 01 models Can Code-Generating LLMs Do Engineering Math? FEM-Bench Tests Them
- jun 01 models Unlearning Isn't Deletion: arXiv 2505.16831 Shows Machine Unlearning in LLMs Is Reversible
- may 31 models Why LLMs Fail at Spatial Reasoning When Planning Navigation
- may 31 models Does Giving AI Agents More Skills Help? A Controlled SkillsBench Study
- may 30 models Can an LLM Peer-Review Your Paper? A New Behavior Benchmark
- may 30 models Anthropic Scaled Sparse Autoencoders to Claude 3 Sonnet. Interpretability Now Costs Compute
- may 28 models Tracing Why LLM Agent Memory Fails: A Method for Attributing Errors
- may 28 models Persona Prompts Change Who an LLM Recommends as an Expert
- may 27 models Opus 4.8 vs Opus 4.7: What Changed and What Did Not
- may 27 models Opus 4.8 Batch API: 1M Context, 300k Output, and Team Cost Controls
- may 26 models Scale Vectors: Tiny Parameter Subsets That Disproportionately Steer LLM Behavior
- may 26 models One Learning Rate Doesn't Fit All: Heavy-Tail Layerwise LR Schedules for LLM Pretraining
- may 25 models Audio LLMs Break When the Codec Changes: A Robustness Vector Voice-AI Teams Haven't Tested
- may 25 models Do LLMs Know What Not to Say? Causal Evidence for Statistical Preemption
- may 24 models Embedding Compression at Training Time: DIVE's Gradient Trick vs Post-Hoc Quantization for Vector DBs
Every new foundation model arrives wrapped in a benchmark chart and a press release. This beat exists because the chart almost never tells you what the model actually does — and the press release tells you even less. The interesting work is one layer down: the attention variant that changes the cost curve, the parameterization fix that makes scaling laws actually transfer, the eval protocol whose failure modes flip the standings once you change a decoding parameter or a codec. We cover that layer.
We treat open-weight and closed releases as the same story told from opposite ends of a distribution shift. A trillion-parameter mixture beating a dense competitor on one harness and losing on another is not a contradiction; it is evidence about what the harness measures. Training-efficiency claims, context-window claims, and reasoning-benchmark claims all get the same treatment — read the method section, find the assumption that was load-bearing, and report whether removing it changes the result. When a paper proposes a new safety mechanism, a new compression trick, or a new continual-learning split, we are interested in the part the authors did not want to highlight.
The throughline is comparative and skeptical without being contrarian. Foundation-model research has become the field where the largest gap between published numbers and deployed behavior tends to live. Closing that gap — with reproduction notes, ablation reading, and honest accounting of what generalizes versus what was tuned into the eval — is the beat.