Category

Models & Research

33 articles exploring Models & Research. Expert analysis and insights from our editorial team.

Showing 1–15 of 33 articles · Page 1 of 3

Foundation models are moving faster than anyone’s ability to track them. This cluster covers new model releases, architectural research, benchmark methodology, and the increasingly murky transparency picture across frontier labs. Groundy focuses on what the numbers actually mean—not the headline score, but the benchmark design, the eval contamination risk, the gap between synthetic performance and production behavior.

The open-weight ecosystem has matured into a genuine competitive front. DeepSeek’s V3 and R1 models demonstrated that training efficiency—Mixture of Experts routing, multi-head latent attention, reinforcement-learning reward shaping—can close the gap with proprietary models at a fraction of the compute cost. Alibaba’s Qwen series has done similar work on the multilingual and structured-data side, largely underreported in English-language press. Meta’s Llama line and Mistral’s Devstral and Magistral models round out an open-weight landscape that now produces genuinely usable code and reasoning models.

Context windows deserve skepticism. The race to million-token advertised limits often outpaces actual usable capacity; Groundy has tested what models can reliably retrieve from middle-of-context versus end-of-context positions, and the degradation curves are steep. The same applies to reasoning model claims: Qwen3.6-Max-Preview, Gemini 3.1 Pro, and comparable models each require independent evaluation against your actual task distribution rather than aggregate leaderboard numbers.

Benchmark integrity is its own coverage thread. SWE-bench Verified has become the de facto coding-agent evaluation, but its methodology has blind spots—test isolation assumptions, the gap between isolated bug fixes and real codebase navigation—that matter when you’re making model selection decisions. Groundy reads the methodology papers, not just the press releases.

Transparency is collapsing at exactly the moment it matters most. The Stanford HAI 2026 AI Index found frontier model transparency scores dropped 31% in a single year—labs are disclosing less about training data sources, evaluation methodology, and safety testing even as they ask for regulatory deference. This cluster tracks both the technical capabilities and the institutional opacity that makes independent evaluation more important than it has ever been.

Featured in this cluster

Cornerstone

DeepSeek V3/R1: How Chinese Engineers Matched GPT-4 for $6 Million

DeepSeek's V3 and R1 models match GPT-4-class performance using a fraction of the compute through architectural innovations in Mixture of Experts, attention compression, and reinforcement learning—demonstrating that training efficiency may matter more than raw hardware scale.

· 10 min read
Cornerstone

The Million-Token Context Window: What Can You Actually Do?

Million-token context windows let you feed entire codebases, legal contracts, and hours of video to an LLM in one pass—but advertised limits routinely overstate practical capability. Here's what the benchmarks, failure modes, and real deployment patterns actually show.

· 9 min read
Cornerstone

Qwen 2.5 vs Llama 3.3: The Open-Weight Showdown Nobody Is Talking About

Alibaba's Qwen 2.5 beats Meta's Llama 3.3 on math, multilingual tasks, and structured data — yet gets a fraction of the Western press coverage.

· 8 min read
Cornerstone

Constitutional AI: Teaching Models to Self-Correct Before They Act

Anthropic's Constitutional AI trains language models to critique and revise their own outputs using principles rather than human labels, but questions remain about whether this represents genuine safety gains or sophisticated filtering mechanisms.

· 9 min read
Cornerstone

SWE-bench Verified Explained: What the Coding Agent Leaderboard Actually Measures (and What It Misses)

SWE-bench Verified tests AI agents on 500 real GitHub bug fixes. Learn what 'resolved 49%' means, how scoring works, and the benchmark's critical blind spots.

· 8 min read

Latest in Models & Research

Newest first
01

There Will Be a Scientific Theory of Deep Learning: What arXiv 2604.21691 Argues and Where It Will Lose

Fourteen theorists argue fragmented deep-learning theory is converging into 'learning mechanics,' but concede scaling exponents and nonlinear stability remain open.

02

STaD Exposes What HumanEval Hides: Compositional Skill Gaps in LLMs That Aggregate Benchmarks Miss

IBM Research's STaD shows models with identical benchmark scores can fail on different subskills, making leaderboard rank a poor proxy for compositional code generation.

03

STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide

IBM Research's STaD framework exposes compositional skill gaps aggregate benchmarks miss: two models at 32% on ToT Arithmetic needed fundamentally different fixes.

04

DuQuant++ Makes FP4 Quantization Practical for LLM Inference: What Fine-Grained Rotation Means for Blackwell Deployments

DuQuant++ aligns rotation block size with MXFP4 microscaling groups, halving preprocessing cost and pushing W4A4 accuracy close to FP8 as Blackwell FP4 Tensor Cores ship.

05

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

DuQuant++ adapts outlier-aware rotation to MXFP4, halving online rotation cost on LLaMA 3 and shifting the FP4 deployment bottleneck from memory to calibration engineering.

06

Fixed Entropy Coefficients Break Down on Mixed-Difficulty Tasks: What AER Means for Teams Running LLM RL at Scale

Static entropy regularization in GRPO underperforms on mixed-difficulty tasks. Difficulty-aware allocation closes the gap by 7-10 points on pass@1 without extra compute.

07

JumpLoRA's Sparse Adapters Break the Assumption That Continual Fine-Tuning Requires Full-Rank LoRA Stacks

JumpLoRA adds learnable JumpReLU gates to LoRA blocks for 87-95% sparse adapters with near-zero cross-task overlap. The work exposes that PEFT has no router for continual.

08

MM-JudgeBias Exposes Compositional Bias in MLLM-as-a-Judge: What It Means for Teams Running Model-Based Eval Pipelines

MM-JudgeBias shows MLLM judges inherit the compositional biases they evaluate, so teams must audit judge selection rather than assume model-based eval removes labeling work.

09

Qwen3.6-27B's Dense Architecture Challenges the MoE-Only Playbook for Flagship-Class Coding Models

Alibaba's dense Qwen3.6-27B outperforms its MoE sibling on coding benchmarks, trading predictable inference latency for a larger memory footprint than sparse alternatives.

10

Sessa Breaks the Mamba-or-Transformer Binary: Distance-Invariant Retrieval Forces a Rethink of Long-Context Architecture Choices

Sessa embeds attention inside a recurrent loop, outperforming Transformer and Mamba on long-context tasks. The interaction topology matters more than the attention-SSM ratio.

11

Self-Correction Comes to Diffusion Models: What SOAR Means for Iterative Image Generation Pipelines

Tencent's SOAR replaces SFT post-training in diffusion models, yielding an 11% GenEval lift on SD3.5-M — no reward model, no preference labels required.

· 6 min read
12

NVIDIA Ising: Open-Source AI Models That Let Quantum Processors Self-Calibrate

NVIDIA's Ising family adds a 2.5× faster pre-decoder and a 35B calibration VLM to quantum pipelines — here's what the benchmarks actually mean and who should use which.

· 6 min read
13

Qwen3.6-Max-Preview: What Alibaba's Latest Model Means for Open-Weight Competitors

Alibaba's Qwen3.6-Max-Preview launches API-only and free today, while the genuinely open-weight 35B-A3B sibling offers a separate self-hosting path.

· 6 min read
14

Swarm AI for Prediction Markets: Collective Intelligence Gets an Algorithm

MiroFish uses swarm intelligence to simulate thousands of AI agents forecasting outcomes. What it actually does—and what the benchmarks don't yet show.

· 8 min read
15

Qwen 2.5 vs Llama 3.3: The Open-Weight Showdown Nobody Is Talking About

Alibaba's Qwen 2.5 beats Meta's Llama 3.3 on math, multilingual tasks, and structured data — yet gets a fraction of the Western press coverage.

· 8 min read

Explore More Categories

Discover insights across different technology domains.

Browse All Articles