<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Groundy — Models &amp; Research</title><description>Where architecture, training tricks, and eval methodology meet the marketing layer — separating durable progress in foundation models from leaderboard theater that quietly falls apart under load.</description><link>https://groundy.com/</link><language>en-us</language><atom:link href="https://groundy.com/category/models-research/rss.xml" rel="self" type="application/rss+xml"/><item><title>Claude Fable 5 Benchmarks: What FrontierCode, CursorBench, and ViBench Show</title><link>https://groundy.com/articles/claude-fable-5-benchmarks-what-frontiercode-cursorbench-and-vibench-show/</link><guid isPermaLink="true">https://groundy.com/articles/claude-fable-5-benchmarks-what-frontiercode-cursorbench-and-vibench-show/</guid><description>Claude Fable 5 claims top benchmark scores. Verified data shows every model below 14% on FrontierCode Diamond, and partner scores lack public methodology.</description><pubDate>Sat, 13 Jun 2026 03:20:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>claude-fable-5</category><category>frontiercode</category><category>ai-benchmarks</category><category>cursorbench</category><category>anthropic</category><category>ai-coding</category><author>Groundy Editorial</author></item><item><title>Does Attribution Patching Lie? A Fix for a Common Interpretability Shortcut</title><link>https://groundy.com/articles/does-attribution-patching-lie-a-fix-for-a-common-interpretability-shortcut/</link><guid isPermaLink="true">https://groundy.com/articles/does-attribution-patching-lie-a-fix-for-a-common-interpretability-shortcut/</guid><description>A June 2026 paper traces attribution patching&apos;s errors to downstream non-linearities and proposes a Hessian-vector-product correction that costs one extra backward pass.</description><pubDate>Sat, 13 Jun 2026 03:20:49 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>attribution-patching</category><category>mechanistic-interpretability</category><category>circuit-discovery</category><category>hessian-correction</category><category>llm-interpretability</category><category>activation-patching</category><author>Groundy Editorial</author></item><item><title>Can You Make a Multimodal Model Unlearn With Activation Steering?</title><link>https://groundy.com/articles/can-you-make-a-multimodal-model-unlearn-with-activation-steering/</link><guid isPermaLink="true">https://groundy.com/articles/can-you-make-a-multimodal-model-unlearn-with-activation-steering/</guid><description>Steering vectors suppress behavior at runtime without editing weights. Two 2026 papers show they transfer between models, so suppression alone is not unlearning.</description><pubDate>Sat, 13 Jun 2026 03:20:47 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>activation-steering</category><category>machine-unlearning</category><category>llm-safety</category><category>steering-vectors</category><category>model-evaluation</category><category>multimodal-models</category><author>Groundy Editorial</author></item><item><title>Why Pruning a Model Can Raise Its Out-of-Distribution Accuracy</title><link>https://groundy.com/articles/why-pruning-a-model-can-raise-its-out-of-distribution-accuracy/</link><guid isPermaLink="true">https://groundy.com/articles/why-pruning-a-model-can-raise-its-out-of-distribution-accuracy/</guid><description>Task-aware layer pruning removes distortion-amplifying layers and improves out-of-distribution accuracy, which means standard in-distribution benchmarks miss the real effect.</description><pubDate>Sat, 13 Jun 2026 03:20:45 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>model-pruning</category><category>out-of-distribution</category><category>model-evaluation</category><category>taopioca</category><category>generalization</category><category>representation-geometry</category><author>Groundy Editorial</author></item><item><title>Do Unified Multimodal Models Actually Interleave Understanding and Generation?</title><link>https://groundy.com/articles/do-unified-multimodal-models-actually-interleave-understanding-and-generation/</link><guid isPermaLink="true">https://groundy.com/articles/do-unified-multimodal-models-actually-interleave-understanding-and-generation/</guid><description>IMUG-Bench tests whether unified multimodal models can alternate between understanding and generation in one context, exposing gaps that separate benchmarks conceal.</description><pubDate>Sat, 13 Jun 2026 03:20:35 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>multimodal-models</category><category>benchmarks</category><category>image-generation</category><category>image-understanding</category><category>imug-bench</category><category>model-evaluation</category><author>Groundy Editorial</author></item><item><title>How LLMs Track Who Did What: The Entity Rebinding Circuit</title><link>https://groundy.com/articles/how-llms-track-who-did-what-the-entity-rebinding-circuit/</link><guid isPermaLink="true">https://groundy.com/articles/how-llms-track-who-did-what-the-entity-rebinding-circuit/</guid><description>New research isolates a compact attention-head circuit for entity rebinding in Gemma and Llama, showing tracking failures stem from a binding step, not context length.</description><pubDate>Wed, 10 Jun 2026 12:05:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>mechanistic-interpretability</category><category>entity-tracking</category><category>attention-heads</category><category>long-context</category><category>llm-circuits</category><category>activation-patching</category><author>Groundy Editorial</author></item><item><title>Claude Fable 5 vs Opus 4.8: When 2x Pricing Is Worth It</title><link>https://groundy.com/articles/claude-fable-5-vs-opus-4-8-when-2x-pricing-is-worth/</link><guid isPermaLink="true">https://groundy.com/articles/claude-fable-5-vs-opus-4-8-when-2x-pricing-is-worth/</guid><description>Claude Fable 5 prices at $10/$50 per million tokens, 2x Opus 4.8. Frontier research, long-context agents, and molecule design clear the bar. Standard coding does not.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude</category><category>anthropic</category><category>ai-pricing</category><category>frontier-models</category><category>benchmarks</category><category>agentic-coding</category><author>Groundy Editorial</author></item><item><title>Claude Mythos 5 Access Rules: Who Gets Project Glasswing and Why</title><link>https://groundy.com/articles/claude-mythos-5-access-rules-who-gets-project-glasswing-and-why/</link><guid isPermaLink="true">https://groundy.com/articles/claude-mythos-5-access-rules-who-gets-project-glasswing-and-why/</guid><description>Claude Mythos 5 shares Fable 5&apos;s architecture but with safeguards lifted in select areas. Access requires Project Glasswing approval or a biology research designation.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude</category><category>anthropic</category><category>ai-safety</category><category>frontier-models</category><category>cybersecurity</category><category>biology-ai</category><category>ai-policy</category><author>Groundy Editorial</author></item><item><title>Fable 5 Distillation Protection: How Anthropic Blocks Model Copying</title><link>https://groundy.com/articles/fable-5-distillation-protection-how-anthropic-blocks-model-copying/</link><guid isPermaLink="true">https://groundy.com/articles/fable-5-distillation-protection-how-anthropic-blocks-model-copying/</guid><description>Claude Fable 5 ships with distillation protection to prevent capability extraction. A first-principles look at what it is, how it works, and why API consumers should care.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude-fable-5</category><category>anthropic</category><category>model-security</category><category>distillation</category><category>ai-safety</category><category>frontier-models</category><author>Groundy Editorial</author></item><item><title>Skip Fable 5 or Upgrade? When Opus 4.8 and Sonnet 4.6 Are Still Enough</title><link>https://groundy.com/articles/skip-fable-5-or-upgrade-when-opus-4-8-and-sonnet-4-6-are-still-enough/</link><guid isPermaLink="true">https://groundy.com/articles/skip-fable-5-or-upgrade-when-opus-4-8-and-sonnet-4-6-are-still-enough/</guid><description>Claude Fable 5 costs $10/$50 per MTok, exactly double Opus 4.8. Here is how to decide which tier your workload actually needs and when staying put saves real money.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude-fable-5</category><category>claude-opus-4-8</category><category>ai-pricing</category><category>model-selection</category><category>anthropic</category><category>llm-cost</category><author>Groundy Editorial</author></item><item><title>LLM Steganography: Can Defenders Detect Payloads Hidden in Model Output?</title><link>https://groundy.com/articles/llm-steganography-can-defenders-detect-payloads-hidden-in-model-output/</link><guid isPermaLink="true">https://groundy.com/articles/llm-steganography-can-defenders-detect-payloads-hidden-in-model-output/</guid><description>A 2026 proof shows data hidden in LLM output must inflate text complexity. A perplexity proxy catches naive encoders, but adaptive adversaries can evade detection in.</description><pubDate>Tue, 09 Jun 2026 19:54:36 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>llm-steganography</category><category>steganalysis</category><category>kolmogorov-complexity</category><category>perplexity</category><category>llm-security</category><category>output-channel-security</category><author>Groundy Editorial</author></item><item><title>Do Privacy Defenses Actually Protect Fine-Tuned LLMs? A New Benchmark</title><link>https://groundy.com/articles/do-privacy-defenses-actually-protect-fine-tuned-llms-a-new-benchmark/</link><guid isPermaLink="true">https://groundy.com/articles/do-privacy-defenses-actually-protect-fine-tuned-llms-a-new-benchmark/</guid><description>A June 2026 benchmark shows passing privacy attack probes on fine-tuned LLMs is not a formal guarantee, exposing a compliance gap for teams deploying models on customer data.</description><pubDate>Tue, 09 Jun 2026 13:09:48 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>llm-privacy</category><category>fine-tuning</category><category>differential-privacy</category><category>membership-inference</category><category>model-security</category><category>compliance</category><author>Groundy Editorial</author></item><item><title>Can You Reconstruct an LLM&apos;s System Prompt From Its Activations?</title><link>https://groundy.com/articles/can-you-reconstruct-an-llms-system-prompt-from-its-activations/</link><guid isPermaLink="true">https://groundy.com/articles/can-you-reconstruct-an-llms-system-prompt-from-its-activations/</guid><description>PRISM recovers full instruction sets inside frozen LLMs from hidden states, enabling anyone with activation access to reconstruct system prompts without output probing.</description><pubDate>Tue, 09 Jun 2026 12:45:38 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>llm-interpretability</category><category>activation-probes</category><category>system-prompt-extraction</category><category>prism</category><category>model-security</category><category>ai-safety</category><author>Groundy Editorial</author></item><item><title>Does Softmax Normalization Limit What Attention Can Represent?</title><link>https://groundy.com/articles/does-softmax-normalization-limit-what-attention-can-represent/</link><guid isPermaLink="true">https://groundy.com/articles/does-softmax-normalization-limit-what-attention-can-represent/</guid><description>A new paper proves softmax normalization imposes geometric separation bounds on token vectors, constraining what attention can represent as context length grows.</description><pubDate>Tue, 09 Jun 2026 00:55:16 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>softmax-attention</category><category>attention-mechanism</category><category>transformer-architecture</category><category>normalization</category><category>neural-network-theory</category><category>model-architecture</category><author>Groundy Editorial</author></item><item><title>Can an Attacker Steal Your Model&apos;s Last Layer From Its Outputs?</title><link>https://groundy.com/articles/can-an-attacker-steal-your-models-last-layer-from-its-outputs/</link><guid isPermaLink="true">https://groundy.com/articles/can-an-attacker-steal-your-models-last-layer-from-its-outputs/</guid><description>A new geometric proof shows API outputs alone suffice to recover a transformer&apos;s final projection matrix up to a rotation, while deeper layers are provably irrecoverable.</description><pubDate>Mon, 08 Jun 2026 19:37:14 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-08T00:00:00.000Z</atom:updated><category>model-stealing</category><category>model-security</category><category>llm-apis</category><category>ai-safety</category><category>transformer-architecture</category><category>differential-privacy</category><author>Groundy Editorial</author></item><item><title>Can LLMs Leak Training Data? A New Test Splits Capacity From Intent</title><link>https://groundy.com/articles/can-llms-leak-training-data-a-new-test-splits-capacity-from-intent/</link><guid isPermaLink="true">https://groundy.com/articles/can-llms-leak-training-data-a-new-test-splits-capacity-from-intent/</guid><description>PropMe splits memorization audits into capability and propensity, showing that single-metric leakage reports understate what targeted prompts can extract from LLMs.</description><pubDate>Sun, 07 Jun 2026 12:55:22 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-07T00:00:00.000Z</atom:updated><category>llm-memorization</category><category>gdpr-compliance</category><category>model-evaluation</category><category>data-extraction</category><category>training-data</category><category>ai-safety</category><author>Groundy Editorial</author></item><item><title>When an AI Agent&apos;s Tools Break, Can It Recover? A New Benchmark</title><link>https://groundy.com/articles/when-an-ai-agents-tools-break-can-it-recover-a-new-benchmark/</link><guid isPermaLink="true">https://groundy.com/articles/when-an-ai-agents-tools-break-can-it-recover-a-new-benchmark/</guid><description>ToolMaze, a new arXiv benchmark, shows LLM agents&apos; recovery rates drop 37% when tools return corrupted data, exposing a gap in how agent reliability is measured.</description><pubDate>Sun, 07 Jun 2026 05:32:08 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-07T00:00:00.000Z</atom:updated><category>llm-agents</category><category>tool-failure</category><category>benchmark</category><category>agent-reliability</category><category>fault-tolerance</category><category>dynamic-replanning</category><author>Groundy Editorial</author></item><item><title>MiniMax M3 Bets on Sparse Attention for 1M Context. Does the Math Hold?</title><link>https://groundy.com/articles/minimax-m3-bets-on-sparse-attention-for-1m-context-does-the-math-hold/</link><guid isPermaLink="true">https://groundy.com/articles/minimax-m3-bets-on-sparse-attention-for-1m-context-does-the-math-hold/</guid><description>MiniMax claims M3 handles 1M tokens via sparse attention, but published no technical report or independent benchmarks. Retrieval quality at full context is unverified.</description><pubDate>Sat, 06 Jun 2026 08:46:35 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>sparse-attention</category><category>minimax-m3</category><category>long-context</category><category>llm-benchmarks</category><category>inference-cost</category><category>retrieval-quality</category><author>Groundy Editorial</author></item><item><title>Can One Model Handle Every CAD Task? UniCAD Tests It</title><link>https://groundy.com/articles/can-one-model-handle-every-cad-task-unicad-tests/</link><guid isPermaLink="true">https://groundy.com/articles/can-one-model-handle-every-cad-task-unicad-tests/</guid><description>UniCAD introduces a unified benchmark and single multi-modal model for CAD reconstruction, generation, and question answering, challenging the field&apos;s per-task silos.</description><pubDate>Sat, 06 Jun 2026 08:00:51 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>cad</category><category>multi-modal-models</category><category>deep-learning</category><category>benchmarks</category><category>3d-reconstruction</category><category>generative-design</category><author>Groundy Editorial</author></item><item><title>Do Foundation Models Actually Learn Relational Structure In-Context?</title><link>https://groundy.com/articles/do-foundation-models-actually-learn-relational-structure-in-context/</link><guid isPermaLink="true">https://groundy.com/articles/do-foundation-models-actually-learn-relational-structure-in-context/</guid><description>OpenRFM shows relational in-context learning collapses on sparse joins and introduces a dual-stage architecture that surpasses the commercial KumoRFMv1 baseline.</description><pubDate>Sat, 06 Jun 2026 06:39:42 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>relational-foundation-models</category><category>in-context-learning</category><category>tabular-models</category><category>openrfm</category><category>relational-learning</category><category>pre-training</category><author>Groundy Editorial</author></item><item><title>Can LLMs Write Better Research Paper Titles Than Authors?</title><link>https://groundy.com/articles/can-llms-write-better-research-paper-titles-than-authors/</link><guid isPermaLink="true">https://groundy.com/articles/can-llms-write-better-research-paper-titles-than-authors/</guid><description>A new study claims LLMs write &apos;appropriate&apos; research titles, but the evidence rests on similarity metrics that measure pattern matching, not whether titles actually serve.</description><pubDate>Sat, 06 Jun 2026 05:36:20 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>llm-evaluation</category><category>research-titles</category><category>academic-publishing</category><category>text-metrics</category><category>pegasus</category><category>arxiv</category><author>Groundy Editorial</author></item><item><title>Does Information-Theoretic Example Selection Beat kNN for In-Context Learning?</title><link>https://groundy.com/articles/does-information-theoretic-example-selection-beat-knn-for-in-context-learning/</link><guid isPermaLink="true">https://groundy.com/articles/does-information-theoretic-example-selection-beat-knn-for-in-context-learning/</guid><description>KITE swaps cosine-similarity kNN for a kernelized information-theoretic selector in few-shot prompting, reporting classification gains but adding inference compute overhead.</description><pubDate>Sat, 06 Jun 2026 05:21:48 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>in-context-learning</category><category>few-shot-learning</category><category>rag</category><category>example-selection</category><category>kernel-methods</category><category>retrieval-optimization</category><category>classification</category><author>Groundy Editorial</author></item><item><title>Do Concept Bottleneck Model Benchmarks Measure Interpretability or Dataset Bias?</title><link>https://groundy.com/articles/do-concept-bottleneck-model-benchmarks-measure-interpretability-or-dataset-bias/</link><guid isPermaLink="true">https://groundy.com/articles/do-concept-bottleneck-model-benchmarks-measure-interpretability-or-dataset-bias/</guid><description>Standard concept bottleneck model benchmarks confound genuine concept learning with dataset shortcuts. Synthetic benchmarks from Skirzynski et al. expose the gap.</description><pubDate>Sat, 06 Jun 2026 03:27:18 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>concept-bottleneck-models</category><category>interpretability</category><category>synthetic-benchmarks</category><category>information-leakage</category><category>model-evaluation</category><category>confounding</category><author>Groundy Editorial</author></item><item><title>Continuous Bit-Width Quantization vs Fixed INT4: Does LiftQuant Beat Discrete?</title><link>https://groundy.com/articles/continuous-bit-width-quantization-vs-fixed-int4-does-liftquant-beat-discrete/</link><guid isPermaLink="true">https://groundy.com/articles/continuous-bit-width-quantization-vs-fixed-int4-does-liftquant-beat-discrete/</guid><description>LiftQuant replaces 2/4/8-bit quantization with continuous bit-width via dimensional lifting. A 70B model at 2.4 bits fits 24GB. Kernel support is the bottleneck.</description><pubDate>Sat, 06 Jun 2026 01:26:39 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>quantization</category><category>llm-inference</category><category>liftquant</category><category>mixed-precision</category><category>model-compression</category><category>sub-4bit-quantization</category><author>Groundy Editorial</author></item><item><title>Federated Learning for Industrial IoT Anomaly Detection: The Data-Locality Tradeoff</title><link>https://groundy.com/articles/federated-learning-for-industrial-iot-anomaly-detection-the-data-locality/</link><guid isPermaLink="true">https://groundy.com/articles/federated-learning-for-industrial-iot-anomaly-detection-the-data-locality/</guid><description>A DEXA 2026 paper proposes a cyclic-dynamics benchmark for federated anomaly detection, exposing the gap between on-site compliance gains and unknown convergence costs.</description><pubDate>Fri, 05 Jun 2026 23:29:18 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>federated-learning</category><category>anomaly-detection</category><category>industrial-iot</category><category>time-series</category><category>data-locality</category><category>cyclic-dynamics</category><author>Groundy Editorial</author></item><item><title>Reading Failed LLM Reasoning Traces Won&apos;t Tell You Which Ones RL Can Fix</title><link>https://groundy.com/articles/reading-failed-llm-reasoning-traces-wont-tell-you-which-ones-rl-can-fix/</link><guid isPermaLink="true">https://groundy.com/articles/reading-failed-llm-reasoning-traces-wont-tell-you-which-ones-rl-can-fix/</guid><description>A new preprint finds the fixability of failed LLM reasoning rollouts under RL is predictable from distributional statistics, not from reading chain-of-thought text.</description><pubDate>Fri, 05 Jun 2026 18:08:14 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>reasoning-rl</category><category>chain-of-thought</category><category>reinforcement-learning</category><category>post-training</category><category>process-reward-models</category><category>llm-reasoning</category><category>test-time-compute</category><author>Groundy Editorial</author></item><item><title>Can You Stitch Two Foundation Models Together Without Retraining?</title><link>https://groundy.com/articles/can-you-stitch-two-foundation-models-together-without-retraining/</link><guid isPermaLink="true">https://groundy.com/articles/can-you-stitch-two-foundation-models-together-without-retraining/</guid><description>Splicing layers from independently trained foundation models fails without targeted training at the join point. A two-stage recipe called Final Feature Matching makes it work.</description><pubDate>Fri, 05 Jun 2026 16:44:35 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>model-stitching</category><category>foundation-models</category><category>vision-models</category><category>model-merging</category><category>transfer-learning</category><category>representation-learning</category><author>Groundy Editorial</author></item><item><title>Do Reasoning LLMs Waste Tokens? OckBench Tries to Measure It</title><link>https://groundy.com/articles/do-reasoning-llms-waste-tokens-ockbench-tries-to-measure/</link><guid isPermaLink="true">https://groundy.com/articles/do-reasoning-llms-waste-tokens-ockbench-tries-to-measure/</guid><description>OckBench scores 37 reasoning LLMs on token efficiency alongside accuracy, finding comparably accurate models differ by up to 26× in token cost under per-token billing.</description><pubDate>Fri, 05 Jun 2026 08:44:51 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>ockbench</category><category>llm-reasoning</category><category>token-efficiency</category><category>model-benchmarking</category><category>inference-cost</category><category>reasoning-models</category><author>Groundy Editorial</author></item><item><title>Which Layer Detects LLM Hallucinations Best? The Case Against Fixed-Layer Probes</title><link>https://groundy.com/articles/which-layer-detects-llm-hallucinations-best-the-case-against-fixed-layer-probes/</link><guid isPermaLink="true">https://groundy.com/articles/which-layer-detects-llm-hallucinations-best-the-case-against-fixed-layer-probes/</guid><description>An ICML 2026 paper finds that fixed-layer hallucination probes miss detection signal, and proposes FEPoID, a training-free method to calibrate layer choice per model.</description><pubDate>Thu, 04 Jun 2026 10:09:05 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-04T00:00:00.000Z</atom:updated><category>hallucination-detection</category><category>hidden-states</category><category>llm-probes</category><category>intrinsic-dimension</category><category>transformer-layers</category><category>icml-2026</category><author>Groundy Editorial</author></item><item><title>Cross-Domain RL Training Degrades Capabilities. CARE-RL Reweights to Fix It</title><link>https://groundy.com/articles/cross-domain-rl-training-degrades-capabilities-care-rl-reweights-to-fix/</link><guid isPermaLink="true">https://groundy.com/articles/cross-domain-rl-training-degrades-capabilities-care-rl-reweights-to-fix/</guid><description>CARE-RL shows that pooling math, code, and chat into one RL run causes silent capability erosion across domains, and proposes gradient subspace projection to reweight updates.</description><pubDate>Wed, 03 Jun 2026 18:17:35 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-03T00:00:00.000Z</atom:updated><category>reinforcement-learning</category><category>multi-domain-training</category><category>capability-interference</category><category>gradient-editing</category><category>llm-post-training</category><category>reward-signal-design</category><author>Groundy Editorial</author></item><item><title>LLM Watermarking Without Quality Loss: The Non-Distortionary Approach</title><link>https://groundy.com/articles/llm-watermarking-without-quality-loss-the-non-distortionary-approach/</link><guid isPermaLink="true">https://groundy.com/articles/llm-watermarking-without-quality-loss-the-non-distortionary-approach/</guid><description>LUNA&apos;s POS-adaptive watermark claims AUROC 0.9959 with 0.045 perplexity shift across six languages, but paraphrase robustness remains untested for all distortion-free schemes.</description><pubDate>Wed, 03 Jun 2026 14:51:29 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-03T00:00:00.000Z</atom:updated><category>llm-watermarking</category><category>ai-content-detection</category><category>provenance-tracking</category><category>text-generation</category><category>nlp-research</category><category>multilingual-nlp</category><author>Groundy Editorial</author></item><item><title>Treating LLM Agent Memory as a Database: The VikingMem Approach</title><link>https://groundy.com/articles/treating-llm-agent-memory-as-a-database-the-vikingmem-approach/</link><guid isPermaLink="true">https://groundy.com/articles/treating-llm-agent-memory-as-a-database-the-vikingmem-approach/</guid><description>VikingMem treats LLM agent memory as a database with events, entities, and temporal compression, reporting up to 30% better retrieval than current approaches.</description><pubDate>Tue, 02 Jun 2026 16:32:05 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>llm-agents</category><category>agent-memory</category><category>vikingmem</category><category>vector-databases</category><category>memory-management</category><category>vldb-2026</category><author>Groundy Editorial</author></item><item><title>Can a Language Model Work Without a Neural Network? A New arXiv Paper Says Yes</title><link>https://groundy.com/articles/can-a-language-model-work-without-a-neural-network-a-new-arxiv-paper-says-yes/</link><guid isPermaLink="true">https://groundy.com/articles/can-a-language-model-work-without-a-neural-network-a-new-arxiv-paper-says-yes/</guid><description>A single-author arXiv preprint claims an RBF-network variant can build a language model without backpropagation-trained deep nets, solving for global loss optimum in one step.</description><pubDate>Tue, 02 Jun 2026 15:03:29 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>rbf-networks</category><category>language-models</category><category>transformer-alternatives</category><category>arxiv</category><category>machine-learning-architecture</category><category>backpropagation</category><author>Groundy Editorial</author></item><item><title>Can Code-Generating LLMs Do Engineering Math? FEM-Bench Tests Them</title><link>https://groundy.com/articles/can-code-generating-llms-do-engineering-math-fem-bench-tests-them/</link><guid isPermaLink="true">https://groundy.com/articles/can-code-generating-llms-do-engineering-math-fem-bench-tests-them/</guid><description>FEM-Bench tests 33 finite element tasks and finds Gemini 3 Pro solved 30 in five attempts. The risk: LLM solvers that compile and run but return physically wrong results.</description><pubDate>Tue, 02 Jun 2026 14:22:58 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>fem-bench</category><category>code-generation</category><category>llm-benchmarks</category><category>finite-element-method</category><category>scientific-computing</category><category>llm-evaluation</category><author>Groundy Editorial</author></item><item><title>Unlearning Isn&apos;t Deletion: arXiv 2505.16831 Shows Machine Unlearning in LLMs Is Reversible</title><link>https://groundy.com/articles/unlearning-isnt-deletion-arxiv-2505-16831-shows-machine-unlearning-in-llms/</link><guid isPermaLink="true">https://groundy.com/articles/unlearning-isnt-deletion-arxiv-2505-16831-shows-machine-unlearning-in-llms/</guid><description>Two independent studies confirm machine unlearning methods suppress outputs without erasing internal representations, making GDPR compliance claims unverifiable.</description><pubDate>Tue, 02 Jun 2026 12:36:09 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>machine-unlearning</category><category>llm-security</category><category>gdpr-compliance</category><category>model-representation</category><category>ai-safety</category><category>data-privacy</category><author>Groundy Editorial</author></item><item><title>Why LLMs Fail at Spatial Reasoning When Planning Navigation</title><link>https://groundy.com/articles/why-llms-fail-at-spatial-reasoning-when-planning-navigation/</link><guid isPermaLink="true">https://groundy.com/articles/why-llms-fail-at-spatial-reasoning-when-planning-navigation/</guid><description>LLMs fail at spatial navigation because training text encodes geometry poorly. Two papers show explicit structural scaffolding, not prompt tweaks, is the fix teams need.</description><pubDate>Mon, 01 Jun 2026 21:05:13 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>spatial-reasoning</category><category>llm-navigation</category><category>inductive-bias</category><category>embodied-agents</category><category>search-trees</category><category>training-data-bias</category><author>Groundy Editorial</author></item><item><title>Does Giving AI Agents More Skills Help? A Controlled SkillsBench Study</title><link>https://groundy.com/articles/does-giving-ai-agents-more-skills-help-a-controlled-skillsbench-study/</link><guid isPermaLink="true">https://groundy.com/articles/does-giving-ai-agents-more-skills-help-a-controlled-skillsbench-study/</guid><description>SkillsBench study: curated skills lift agent pass rates 18 to 36 pp, while documentation granularity shifts outcomes under 1 pp. Curation, not polish, is the bottleneck.</description><pubDate>Mon, 01 Jun 2026 10:49:42 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>agent-skills</category><category>llm-benchmarks</category><category>skillsbench</category><category>ai-agents</category><category>model-evaluation</category><category>skill-catalogs</category><author>Groundy Editorial</author></item><item><title>Can an LLM Peer-Review Your Paper? A New Behavior Benchmark</title><link>https://groundy.com/articles/can-an-llm-peer-review-your-paper-a-new-behavior-benchmark/</link><guid isPermaLink="true">https://groundy.com/articles/can-an-llm-peer-review-your-paper-a-new-behavior-benchmark/</guid><description>PRAIB benchmarks LLM-generated peer reviews across 11,000 reviews on 1,000 papers, finding positive bias, compressed variance, and systematically overlooked weaknesses.</description><pubDate>Sun, 31 May 2026 13:26:57 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>peer-review</category><category>llm-benchmark</category><category>ai-peer-review</category><category>research-integrity</category><category>conference-review</category><category>llm-bias</category><author>Groundy Editorial</author></item><item><title>Anthropic Scaled Sparse Autoencoders to Claude 3 Sonnet. Interpretability Now Costs Compute</title><link>https://groundy.com/articles/anthropic-scaled-sparse-autoencoders-to-claude-3-sonnet-interpretability-now/</link><guid isPermaLink="true">https://groundy.com/articles/anthropic-scaled-sparse-autoencoders-to-claude-3-sonnet-interpretability-now/</guid><description>Anthropic extracted 34M interpretable features from Claude 3 Sonnet, proving sparse autoencoders work on production models. Interpretability now has its own compute budget.</description><pubDate>Sun, 31 May 2026 10:23:06 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>sparse-autoencoders</category><category>mechanistic-interpretability</category><category>claude-3-sonnet</category><category>ai-safety</category><category>dictionary-learning</category><category>anthropic</category><author>Groundy Editorial</author></item><item><title>Tracing Why LLM Agent Memory Fails: A Method for Attributing Errors</title><link>https://groundy.com/articles/tracing-why-llm-agent-memory-fails-a-method-for-attributing-errors/</link><guid isPermaLink="true">https://groundy.com/articles/tracing-why-llm-agent-memory-fails-a-method-for-attributing-errors/</guid><description>MemTrace constructs provenance graphs across every memory operation in an LLM agent, tracing wrong answers to the exact operation that corrupted state across sessions.</description><pubDate>Fri, 29 May 2026 17:30:46 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-29T00:00:00.000Z</atom:updated><category>llm-memory</category><category>debugging</category><category>rag</category><category>agent-frameworks</category><category>error-attribution</category><category>provenance</category><author>Groundy Editorial</author></item><item><title>Persona Prompts Change Who an LLM Recommends as an Expert</title><link>https://groundy.com/articles/persona-prompts-change-who-an-llm-recommends-as-an-expert/</link><guid isPermaLink="true">https://groundy.com/articles/persona-prompts-change-who-an-llm-recommends-as-an-expert/</guid><description>A 43-model audit finds that geographic and role framing in LLM prompts systematically shifts which scholars get recommended as experts, with no neutral default.</description><pubDate>Fri, 29 May 2026 15:41:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-29T00:00:00.000Z</atom:updated><category>llm-bias</category><category>persona-prompts</category><category>expert-recommendation</category><category>scholar-discovery</category><category>ai-fairness</category><category>recommendation-systems</category><author>Groundy Editorial</author></item><item><title>Opus 4.8 vs Opus 4.7: What Changed and What Did Not</title><link>https://groundy.com/articles/opus-4-8-vs-opus-4-7-what-changed-and-what-did-not/</link><guid isPermaLink="true">https://groundy.com/articles/opus-4-8-vs-opus-4-7-what-changed-and-what-did-not/</guid><description>Anthropic&apos;s Opus 4.8 raises SWE-Bench Pro from 64.3% to 69.2% and cuts code-flaw pass-through fourfold at unchanged $5/$25 pricing. A fast mode at $10/$50 runs 2.5x quicker.</description><pubDate>Thu, 28 May 2026 13:30:41 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude</category><category>anthropic</category><category>opus-48</category><category>model-release</category><category>benchmarks</category><category>agentic-coding</category><author>Groundy Editorial</author></item><item><title>Opus 4.8 Batch API: 1M Context, 300k Output, and Team Cost Controls</title><link>https://groundy.com/articles/opus-4-8-batch-api-1m-context-300k-output-and-team-cost-controls/</link><guid isPermaLink="true">https://groundy.com/articles/opus-4-8-batch-api-1m-context-300k-output-and-team-cost-controls/</guid><description>Opus 4.8 has a 1M token context window (200k on Foundry), 128k standard output, and 300k output via Batch API beta. January 2026 cutoff. Batch design and quota allocation.</description><pubDate>Thu, 28 May 2026 12:14:25 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude-opus</category><category>batch-api</category><category>anthropic</category><category>rate-limits</category><category>context-window</category><category>model-release</category><category>team-infrastructure</category><author>Groundy Editorial</author></item><item><title>Scale Vectors: Tiny Parameter Subsets That Disproportionately Steer LLM Behavior</title><link>https://groundy.com/articles/scale-vectors-tiny-parameter-subsets-that-disproportionately-steer-llm-behavior/</link><guid isPermaLink="true">https://groundy.com/articles/scale-vectors-tiny-parameter-subsets-that-disproportionately-steer-llm-behavior/</guid><description>Scale vectors are a negligible parameter class in LLM normalization layers whose outsized optimization role makes them high-value targets for quantization and safety editing.</description><pubDate>Wed, 27 May 2026 19:21:51 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-28T00:00:00.000Z</atom:updated><category>scale-vectors</category><category>llm-quantization</category><category>mechanistic-interpretability</category><category>model-normalization</category><category>model-compression</category><category>llm-training</category><author>Groundy Editorial</author></item><item><title>One Learning Rate Doesn&apos;t Fit All: Heavy-Tail Layerwise LR Schedules for LLM Pretraining</title><link>https://groundy.com/articles/one-learning-rate-doesnt-fit-all-heavy-tail-layerwise-lr-schedules-for-llm/</link><guid isPermaLink="true">https://groundy.com/articles/one-learning-rate-doesnt-fit-all-heavy-tail-layerwise-lr-schedules-for-llm/</guid><description>LLR assigns per-layer learning rates from spectral heavy-tail diagnostics during LLM pretraining, achieving 1.5x faster convergence and up to 2 pp higher zero-shot accuracy.</description><pubDate>Wed, 27 May 2026 14:01:42 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-27T00:00:00.000Z</atom:updated><category>llm-pretraining</category><category>learning-rate</category><category>spectral-analysis</category><category>optimizer</category><category>transformer-training</category><category>icml-2026</category><author>Groundy Editorial</author></item><item><title>Audio LLMs Break When the Codec Changes: A Robustness Vector Voice-AI Teams Haven&apos;t Tested</title><link>https://groundy.com/articles/audio-llms-break-when-the-codec-changes-a-robustness-vector-voice-ai-teams/</link><guid isPermaLink="true">https://groundy.com/articles/audio-llms-break-when-the-codec-changes-a-robustness-vector-voice-ai-teams/</guid><description>CodecAttack achieves 85.5% attack success on audio LLMs by optimizing in codec latent space, with 100% zero-shot transfer to MP3, proving lossy compression fails as a defense.</description><pubDate>Tue, 26 May 2026 11:55:22 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>adversarial-audio</category><category>audio-llms</category><category>codec-robustness</category><category>voice-ai</category><category>adversarial-ml</category><category>audio-security</category><author>Groundy Editorial</author></item><item><title>Do LLMs Know What Not to Say? Causal Evidence for Statistical Preemption</title><link>https://groundy.com/articles/do-llms-know-what-not-to-say-causal-evidence-for-statistical-preemption/</link><guid isPermaLink="true">https://groundy.com/articles/do-llms-know-what-not-to-say-causal-evidence-for-statistical-preemption/</guid><description>New causal evidence shows LLMs suppress wrong continuations during pretraining via statistical preemption, suggesting output-layer safety fixes may target the wrong layer.</description><pubDate>Tue, 26 May 2026 10:04:15 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>statistical-preemption</category><category>llm-safety</category><category>model-interpretability</category><category>hallucination</category><category>causal-probing</category><category>pretraining</category><author>Groundy Editorial</author></item><item><title>Embedding Compression at Training Time: DIVE&apos;s Gradient Trick vs Post-Hoc Quantization for Vector DBs</title><link>https://groundy.com/articles/embedding-compression-at-training-time-dives-gradient-trick-vs-post-hoc/</link><guid isPermaLink="true">https://groundy.com/articles/embedding-compression-at-training-time-dives-gradient-trick-vs-post-hoc/</guid><description>DIVE&apos;s gradient-limited adapter outperforms baselines for embedding compression, but training-time methods lock RAG pipelines to specific adapters and raise refresh costs.</description><pubDate>Mon, 25 May 2026 20:33:59 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>embedding-compression</category><category>rag</category><category>vector-databases</category><category>dive</category><category>adapter-methods</category><author>Groundy Editorial</author></item><item><title>μP Hyperparameter Transfer Has an Embedding Layer Hole, New arXiv Paper Says</title><link>https://groundy.com/articles/p-hyperparameter-transfer-has-an-embedding-layer-hole-new-arxiv-paper-says/</link><guid isPermaLink="true">https://groundy.com/articles/p-hyperparameter-transfer-has-an-embedding-layer-hole-new-arxiv-paper-says/</guid><description>An arXiv paper shows the embedding learning rate accounts for most of μP&apos;s advantage over standard parameterization, and a single scaling fix recovers the bulk of the benefit.</description><pubDate>Mon, 25 May 2026 20:07:07 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>mup</category><category>hyperparameter-transfer</category><category>embedding-layer</category><category>adamw</category><category>model-scaling</category><category>training-optimization</category><author>Groundy Editorial</author></item><item><title>Project Glasswing One Month In: AI Bug Discovery Has Outpaced the Patch Pipeline</title><link>https://groundy.com/articles/project-glasswing-one-month-in-ai-bug-discovery-has-outpaced-the-patch-pipeline/</link><guid isPermaLink="true">https://groundy.com/articles/project-glasswing-one-month-in-ai-bug-discovery-has-outpaced-the-patch-pipeline/</guid><description>Anthropic&apos;s Glasswing found over 10,000 high-severity vulnerabilities in one month. Only 97 are patched. The bottleneck shifted from discovery to triage, and it is structural.</description><pubDate>Sun, 24 May 2026 21:12:14 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>ai-security</category><category>vulnerability-disclosure</category><category>anthropic</category><category>claude-mythos</category><category>cybersecurity</category><category>interpretability</category><author>Groundy Editorial</author></item></channel></rss>