Groundy — Models & Research

Groundy — Models & ResearchWhere architecture, training tricks, and eval methodology meet the marketing layer — separating durable progress in foundation models from leaderboard theater that quietly falls apart under load.https://groundy.com/en-usTracing Why LLM Agent Memory Fails: A Method for Attributing Errorshttps://groundy.com/articles/tracing-why-llm-agent-memory-fails-a-method-for-attributing-errors/https://groundy.com/articles/tracing-why-llm-agent-memory-fails-a-method-for-attributing-errors/MemTrace constructs provenance graphs across every memory operation in an LLM agent, tracing wrong answers to the exact operation that corrupted state across sessions.Fri, 29 May 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zllm-memorydebuggingragagent-frameworkserror-attributionprovenanceGroundy EditorialPersona Prompts Change Who an LLM Recommends as an Experthttps://groundy.com/articles/persona-prompts-change-who-an-llm-recommends-as-an-expert/https://groundy.com/articles/persona-prompts-change-who-an-llm-recommends-as-an-expert/A 43-model audit finds that geographic and role framing in LLM prompts systematically shifts which scholars get recommended as experts, with no neutral default.Fri, 29 May 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zllm-biaspersona-promptsexpert-recommendationscholar-discoveryai-fairnessrecommendation-systemsGroundy EditorialOpus 4.8 Batch API: 1M Context, 300k Output, and Team Cost Controlshttps://groundy.com/articles/opus-4-8-batch-api-1m-context-300k-output-and-team-cost-controls/https://groundy.com/articles/opus-4-8-batch-api-1m-context-300k-output-and-team-cost-controls/Opus 4.8 has a 1M token context window (200k on Foundry), 128k standard output, and 300k output via Batch API beta. January 2026 cutoff. Batch design and quota allocation.Thu, 28 May 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zclaude-opusbatch-apianthropicrate-limitscontext-windowmodel-releaseteam-infrastructureGroundy EditorialOpus 4.8 vs Opus 4.7: What Changed and What Did Nothttps://groundy.com/articles/opus-4-8-vs-opus-4-7-what-changed-and-what-did-not/https://groundy.com/articles/opus-4-8-vs-opus-4-7-what-changed-and-what-did-not/Anthropic's Opus 4.8 raises SWE-Bench Pro from 64.3% to 69.2% and cuts code-flaw pass-through fourfold at unchanged $5/$25 pricing. A fast mode at $10/$50 runs 2.5x quicker.Thu, 28 May 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zclaudeanthropicopus-48model-releasebenchmarksagentic-codingGroundy EditorialOne Learning Rate Doesn't Fit All: Heavy-Tail Layerwise LR Schedules for LLM Pretraininghttps://groundy.com/articles/one-learning-rate-doesnt-fit-all-heavy-tail-layerwise-lr-schedules-for-llm/https://groundy.com/articles/one-learning-rate-doesnt-fit-all-heavy-tail-layerwise-lr-schedules-for-llm/LLR assigns per-layer learning rates from spectral heavy-tail diagnostics during LLM pretraining, achieving 1.5x faster convergence and up to 2 pp higher zero-shot accuracy.Wed, 27 May 2026 00:00:00 GMTGroundy Editorial2026-05-27T00:00:00.000Zllm-pretraininglearning-ratespectral-analysisoptimizertransformer-trainingicml-2026Groundy EditorialScale Vectors: Tiny Parameter Subsets That Disproportionately Steer LLM Behaviorhttps://groundy.com/articles/scale-vectors-tiny-parameter-subsets-that-disproportionately-steer-llm-behavior/https://groundy.com/articles/scale-vectors-tiny-parameter-subsets-that-disproportionately-steer-llm-behavior/Scale vectors are a negligible parameter class in LLM normalization layers whose outsized optimization role makes them high-value targets for quantization and safety editing.Wed, 27 May 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zscale-vectorsllm-quantizationmechanistic-interpretabilitymodel-normalizationmodel-compressionllm-trainingGroundy EditorialEmbedding Compression at Training Time: DIVE's Gradient Trick vs Post-Hoc Quantization for Vector DBshttps://groundy.com/articles/embedding-compression-at-training-time-dives-gradient-trick-vs-post-hoc/https://groundy.com/articles/embedding-compression-at-training-time-dives-gradient-trick-vs-post-hoc/DIVE's gradient-limited adapter outperforms baselines for embedding compression, but training-time methods lock RAG pipelines to specific adapters and raise refresh costs.Mon, 25 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zembedding-compressionragvector-databasesdiveadapter-methodsGroundy EditorialμP Hyperparameter Transfer Has an Embedding Layer Hole, New arXiv Paper Sayshttps://groundy.com/articles/p-hyperparameter-transfer-has-an-embedding-layer-hole-new-arxiv-paper-says/https://groundy.com/articles/p-hyperparameter-transfer-has-an-embedding-layer-hole-new-arxiv-paper-says/An arXiv paper shows the embedding learning rate accounts for most of μP's advantage over standard parameterization, and a single scaling fix recovers the bulk of the benefit.Mon, 25 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zmuphyperparameter-transferembedding-layeradamwmodel-scalingtraining-optimizationGroundy EditorialAudio LLMs Break When the Codec Changes: A Robustness Vector Voice-AI Teams Haven't Testedhttps://groundy.com/articles/audio-llms-break-when-the-codec-changes-a-robustness-vector-voice-ai-teams/https://groundy.com/articles/audio-llms-break-when-the-codec-changes-a-robustness-vector-voice-ai-teams/CodecAttack achieves 85.5% attack success on audio LLMs by optimizing in codec latent space, with 100% zero-shot transfer to MP3, proving lossy compression fails as a defense.Tue, 26 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zadversarial-audioaudio-llmscodec-robustnessvoice-aiadversarial-mlaudio-securityGroundy EditorialProject Glasswing One Month In: AI Bug Discovery Has Outpaced the Patch Pipelinehttps://groundy.com/articles/project-glasswing-one-month-in-ai-bug-discovery-has-outpaced-the-patch-pipeline/https://groundy.com/articles/project-glasswing-one-month-in-ai-bug-discovery-has-outpaced-the-patch-pipeline/Anthropic's Glasswing found over 10,000 high-severity vulnerabilities in one month. Only 97 are patched. The bottleneck shifted from discovery to triage, and it is structural.Sun, 24 May 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zai-securityvulnerability-disclosureanthropicclaude-mythoscybersecurityinterpretabilityGroundy EditorialDo LLMs Know What Not to Say? Causal Evidence for Statistical Preemptionhttps://groundy.com/articles/do-llms-know-what-not-to-say-causal-evidence-for-statistical-preemption/https://groundy.com/articles/do-llms-know-what-not-to-say-causal-evidence-for-statistical-preemption/New causal evidence shows LLMs suppress wrong continuations during pretraining via statistical preemption, suggesting output-layer safety fixes may target the wrong layer.Tue, 26 May 2026 00:00:00 GMTGroundy Editorial2026-05-26T00:00:00.000Zstatistical-preemptionllm-safetymodel-interpretabilityhallucinationcausal-probingpretrainingGroundy EditorialarXiv 2605.16428 Measures AI Search's Drag on Publisher Traffic Using Paired Google and Reddit Datahttps://groundy.com/articles/arxiv-2605-16428-measures-ai-searchs-drag-on-publisher-traffic-using-paired/https://groundy.com/articles/arxiv-2605-16428-measures-ai-searchs-drag-on-publisher-traffic-using-paired/An arXiv study finds AI Overviews boost Reddit engagement 12% for experience-based content, but Google AI Mode erases those gains, reshaping search-driven publishing economics.Sat, 23 May 2026 00:00:00 GMTGroundy Editorial2026-05-23T00:00:00.000Zai-overviewsgoogle-searchpublisher-trafficcontent-strategyredditsearch-ecologyGroundy EditorialA Theory of Time-Sensitive Language Generation Says Sparse Hallucination Beats Mode Collapsehttps://groundy.com/articles/a-theory-of-time-sensitive-language-generation-says-sparse-hallucination-beats/https://groundy.com/articles/a-theory-of-time-sensitive-language-generation-says-sparse-hallucination-beats/arXiv 2605.11302 proves timely generation requires sparse hallucination under formal bounds, reframing RLHF safety tuning as a tradeoff between two failure modes.Sat, 23 May 2026 00:00:00 GMTGroundy Editorial2026-05-23T00:00:00.000Zhallucinationrlhflanguage-generationsafety-tuningmode-collapseformal-methodsdeep-learning-theoryGroundy EditorialThe Last Word Often Wins: A Format Confound Inflates Chain-of-Thought Corruption Robustness Scoreshttps://groundy.com/articles/the-last-word-often-wins-a-format-confound-inflates-chain-of-thought-corruption/https://groundy.com/articles/the-last-word-often-wins-a-format-confound-inflates-chain-of-thought-corruption/A format confound in CoT corruption benchmarks, suffix sensitivity collapsed 19× when final-answer text was stripped, means published faithfulness scores are inflated.Tue, 19 May 2026 00:00:00 GMTGroundy Editorial2026-05-19T00:00:00.000Zchain-of-thoughteval-methodologyprocess-reward-modelsgsm8kreasoning-faithfulnessformat-confoundbenchmarkingGroundy EditorialLearning, Fast and Slow: What arXiv 2605.12484 Proposes for LLMs That Adapt Continuallyhttps://groundy.com/articles/learning-fast-and-slow-what-arxiv-2605-12484-proposes-for-llms-that-adapt/https://groundy.com/articles/learning-fast-and-slow-what-arxiv-2605-12484-proposes-for-llms-that-adapt/Fast-Slow Training splits LLM updates into prompt fast weights and parametric slow weights, cutting KL drift by 70% and lifting sample efficiency by 3×, keeping plasticity.Mon, 18 May 2026 00:00:00 GMTGroundy Editorial2026-05-18T00:00:00.000Zcontinual-learningfine-tuningllm-trainingprompt-optimizationreinforcement-learningqwensample-efficiencyGroundy EditorialThere Will Be a Scientific Theory of Deep Learning: What arXiv 2604.21691 Argues and Where It Will Losehttps://groundy.com/articles/there-will-be-a-scientific-theory-of-deep-learning-what-arxiv-2604-21691-argues/https://groundy.com/articles/there-will-be-a-scientific-theory-of-deep-learning-what-arxiv-2604-21691-argues/Fourteen theorists argue fragmented deep-learning theory is converging into 'learning mechanics,' but concede scaling exponents and nonlinear stability remain open.Tue, 28 Apr 2026 00:00:00 GMTGroundy Editorial2026-04-29T00:00:00.000Zdeep-learningscaling-lawstraining-dynamicsneural-tangent-kerneledge-of-stabilitygeneralizationGroundy EditorialQwen3.6-27B's Dense Architecture Challenges the MoE-Only Playbook for Flagship-Class Coding Modelshttps://groundy.com/articles/qwen36-27bs-dense-architecture-challenges-the-moe-only-playbook-for-flagship/https://groundy.com/articles/qwen36-27bs-dense-architecture-challenges-the-moe-only-playbook-for-flagship/Alibaba's dense Qwen3.6-27B outperforms its MoE sibling on coding benchmarks, trading predictable inference latency for a larger memory footprint than sparse alternatives.Thu, 23 Apr 2026 00:00:00 GMTGroundy Editorial2026-04-24T00:00:00.000Zqwendense-modelsmoeinferencecoding-modelsmodel-architecturellm-deploymentGroundy EditorialChinese AI Models Compared: DeepSeek, Qwen, Kimi, Doubao, and Erniehttps://groundy.com/articles/the-chinese-ai-model-ecosystem-deepseek-qwen-kimi-doubao-and-ernie-compared/https://groundy.com/articles/the-chinese-ai-model-ecosystem-deepseek-qwen-kimi-doubao-and-ernie-compared/DeepSeek isn't China's only frontier AI. Compare DeepSeek, Qwen, Kimi, Doubao, and Ernie on benchmarks, licensing, API access, and use-case fit.Tue, 24 Mar 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zdeepseekqwenkimidoubaoerniechinese-aialibababaidubytedanceGroundy EditorialRunning DeepSeek R1 Locally: Hardware Requirements, Quantization, and Real Throughputhttps://groundy.com/articles/running-deepseek-r1-locally-hardware-requirements-quantization-and-real-throughput/https://groundy.com/articles/running-deepseek-r1-locally-hardware-requirements-quantization-and-real-throughput/What hardware actually runs DeepSeek R1 at useful speeds? Specific token/s benchmarks across GPU configs, quantization options, and the honest tradeoffs.Tue, 24 Mar 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zdeepseeklocal-inferencequantizationhardwareollamaGroundy EditorialFish-Speech: The Open-Source TTS Model That's Threatening ElevenLabshttps://groundy.com/articles/fish-speech-open-source-tts-model-that-s-threatening/https://groundy.com/articles/fish-speech-open-source-tts-model-that-s-threatening/Fish Audio's S2 model reached SOTA benchmarks in March 2026 with sub-100ms latency, 80+ languages, and open-sourced weights, directly challenging ElevenLabs' commercial dominance while exposing the real costs of 'free' voice AI.Sun, 15 Mar 2026 00:00:00 GMTGroundy Editorial2026-05-22T00:00:00.000Zai-modelsopen-sourceaudio-aiGroundy EditorialGemini 2.0 Pro's 2 Million Token Context: What Can You Actually Do With It?https://groundy.com/articles/gemini-2-0-pro-s-2-million-token-context-what-can-you/https://groundy.com/articles/gemini-2-0-pro-s-2-million-token-context-what-can-you/Google's Gemini 2.0 Pro Experimental offers a 2 million token context window. Here is what practitioners have found works, what fails, and where the hard limits are.Fri, 27 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zai-modelsgooglegemininlpcontext-windowanthropicGroundy EditorialGoogle's TimesFM: A Foundation Model for Time Serieshttps://groundy.com/articles/google-s-timesfm-foundation-model-time/https://groundy.com/articles/google-s-timesfm-foundation-model-time/TimesFM is Google's pretrained, decoder-only transformer model for zero-shot time-series forecasting, trained on ~100 billion real-world time-points to deliver accurate predictions across domains without retraining.Fri, 27 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-14T00:00:00.000Zmachine-learningforecastingGroundy EditorialSynthetic Data Is Eating AI Traininghttps://groundy.com/articles/synthetic-data-eating-ai/https://groundy.com/articles/synthetic-data-eating-ai/The internet's supply of high-quality human-generated text is approaching exhaustion. Synthetic data, AI-generated training corpora, is filling the gap, but introduces new failure modes practitioners must understand, including model collapse and quality drift.Fri, 27 Feb 2026 00:00:00 GMTGroundy Editorialmachine-learningtraining-dataGroundy EditorialClaude's Web Search Changes Everything for AI Researchhttps://groundy.com/articles/claude-s-web-search-changes-everything-ai/https://groundy.com/articles/claude-s-web-search-changes-everything-ai/Claude Opus 4.8 integrates web search inside the reasoning loop with mandatory citations, domain filtering, and dynamic HTML filtering that cuts token use by 24%.Fri, 27 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zai-modelsanthropicsearchresearchopus-4-8Groundy EditorialDeepSeek V3/R1: How Chinese Engineers Matched GPT-4 for $6 Millionhttps://groundy.com/articles/deepseek-v3-r1-how-chinese-engineers-matched-gpt-4-6/https://groundy.com/articles/deepseek-v3-r1-how-chinese-engineers-matched-gpt-4-6/DeepSeek's V3 and R1 models match GPT-4-class performance using a fraction of the compute through architectural innovations in Mixture of Experts, attention compression, and reinforcement learning, demonstrating that training efficiency may matter more than raw hardware scale.Fri, 27 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-18T00:00:00.000Zai-modelsdeepseektrainingefficiencychinaGroundy EditorialThe Million-Token Context Window: What Can You Actually Do?https://groundy.com/articles/million-token-context-window-what-can-you-actually/https://groundy.com/articles/million-token-context-window-what-can-you-actually/Million-token context windows let you feed entire codebases, legal contracts, and hours of video to an LLM in one pass, but advertised limits routinely overstate practical capability. Here's what the benchmarks, failure modes, and real deployment patterns actually show.Fri, 27 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zllmcontext-windowGroundy EditorialGemini 3.1 Pro: Google's New Reasoning Model Explainedhttps://groundy.com/articles/gemini-3-1-pro-google-s-new-reasoning-model/https://groundy.com/articles/gemini-3-1-pro-google-s-new-reasoning-model/Gemini 3.1 Pro achieves 77.1% on ARC-AGI-2. See how it stacks up against Anthropic's Opus 4.8 (SWE-Bench Pro 69.2%, Terminal-Bench 74.6%) and GPT-5.5.Thu, 19 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zai-modelsgooglereasoningbenchmarksanthropicGroundy EditorialAI Code Generation Benchmarks 2026: Which Model Actually Writes Better Code?https://groundy.com/articles/ai-code-generation-benchmarks-2026-which-model-actually/https://groundy.com/articles/ai-code-generation-benchmarks-2026-which-model-actually/Claude Opus 4.8 leads SWE-Bench Pro at 69.2% as of May 2026, while GPT-5.5 leads Verified at 88.7%. Benchmark scores and real-world coding utility continue to diverge sharply.Sun, 15 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zai-researchbenchmarkscode-generationcomparisonclaudeGroundy EditorialKimi Claw: Moonshot AI's Answer to Claude and ChatGPThttps://groundy.com/articles/kimi-claw-moonshot-ai-s-answer-claude/https://groundy.com/articles/kimi-claw-moonshot-ai-s-answer-claude/Moonshot AI's Kimi models offer trillion-parameter scale, open weights, and pricing 33x below Claude Opus 4.8, making it China's leading open-source challenger to Western AI.Wed, 18 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zai-modelscompetitionchinachatbotsbenchmarksGroundy EditorialWiFi DensePose: Full-Body Tracking Through Walls Using Your Routerhttps://groundy.com/articles/wifi-densepose-full-body-tracking-through-walls-using-your/https://groundy.com/articles/wifi-densepose-full-body-tracking-through-walls-using-your/WiFi routers can perform full-body pose estimation through walls using Channel State Information, turning everyday network infrastructure into a covert tracking system.Wed, 18 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-29T00:00:00.000Zai-researchprivacysurveillancecomputer-visionwifirf-sensingGroundy EditorialThe Best AI Models for OpenClaw in 2026https://groundy.com/articles/best-ai-models-openclaw-2026/https://groundy.com/articles/best-ai-models-openclaw-2026/Which LLM to pick for OpenClaw in 2026: Opus 4.8, Kimi K2.5, Gemini 3.1 Pro, GPT-5.4, and budget options ranked by use case and benchmark evidence.Wed, 11 Feb 2026 00:00:00 GMTGroundy Editorial2026-05-28T00:00:00.000Zopenclawllmai-modelscodingclaudegptgeminiGroundy Editorial