<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Groundy — Agents &amp; Frameworks</title><description>Independent comparisons of agent stacks and multi-agent designs, tracking the gap between framework marketing and the failure modes that show up under real workloads.</description><link>https://groundy.com/</link><language>en-us</language><atom:link href="https://groundy.com/category/agents-frameworks/rss.xml" rel="self" type="application/rss+xml"/><item><title>Why Production AI Agents Fail Silently and Your Logs Never Catch It</title><link>https://groundy.com/articles/why-production-ai-agents-fail-silently-and-your-logs-never-catch/</link><guid isPermaLink="true">https://groundy.com/articles/why-production-ai-agents-fail-silently-and-your-logs-never-catch/</guid><description>Production LLM agents report success on tasks that never completed and emit no error, so detection must move off exception pipelines onto independent state verification.</description><pubDate>Mon, 15 Jun 2026 16:00:43 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-15T00:00:00.000Z</atom:updated><category>llm-agents</category><category>silent-failures</category><category>agent-observability</category><category>agent-evaluation</category><category>autonomous-agents</category><category>agent-reliability</category><author>Groundy Editorial</author></item><item><title>Computer-Use Agents Fabricate Success on 8 to 33 Percent of Long-Horizon Tasks</title><link>https://groundy.com/articles/computer-use-agents-fabricate-success-on-8-to-33-percent-of-long-horizon-tasks/</link><guid isPermaLink="true">https://groundy.com/articles/computer-use-agents-fabricate-success-on-8-to-33-percent-of-long-horizon-tasks/</guid><description>June 2026 research finds computer-use agents fabricate success on 8 to 33 percent of long-horizon tasks, a failure class invisible to single-action benchmarks.</description><pubDate>Sat, 13 Jun 2026 03:20:54 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>agent-evaluation</category><category>fabrication-detection</category><category>computer-use-agents</category><category>long-horizon-tasks</category><category>agent-reliability</category><category>agent-governance</category><author>Groundy Editorial</author></item><item><title>Can AI Agents Share Context Without a Central Coordinator?</title><link>https://groundy.com/articles/can-ai-agents-share-context-without-a-central-coordinator/</link><guid isPermaLink="true">https://groundy.com/articles/can-ai-agents-share-context-without-a-central-coordinator/</guid><description>DeLM replaces central multi-agent coordinators with shared context, posting 10.5-point SWE-bench gains at half cost. Consistency, stale reads, and write conflicts remain.</description><pubDate>Sat, 13 Jun 2026 03:20:32 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>multi-agent-systems</category><category>decentralized-coordination</category><category>delm</category><category>shared-context</category><category>llm-frameworks</category><category>consistency</category><author>Groundy Editorial</author></item><item><title>Why Skill Creation and Reward Optimization Collide in Agentic RL</title><link>https://groundy.com/articles/why-skill-creation-and-reward-optimization-collide-in-agentic/</link><guid isPermaLink="true">https://groundy.com/articles/why-skill-creation-and-reward-optimization-collide-in-agentic/</guid><description>ReSkill shows decoupled skill creation in agentic RL degrades reward when skills drift from the evolving policy, and proposes assertion-driven co-optimization inside GRPO.</description><pubDate>Sat, 13 Jun 2026 03:20:30 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-13T00:00:00.000Z</atom:updated><category>reinforcement-learning</category><category>agentic-rl</category><category>skill-creation</category><category>grpo</category><category>agent-frameworks</category><category>policy-optimization</category><author>Groundy Editorial</author></item><item><title>When AI Agents Delegate Work, Your Observability Stack Goes Blind</title><link>https://groundy.com/articles/when-ai-agents-delegate-work-your-observability-stack-goes-blind/</link><guid isPermaLink="true">https://groundy.com/articles/when-ai-agents-delegate-work-your-observability-stack-goes-blind/</guid><description>Standard traces cannot attribute actions to specific agents after delegation, a June 2026 paper proves. Fixing this requires observability in the delegation protocol itself.</description><pubDate>Wed, 10 Jun 2026 01:27:18 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>agent-observability</category><category>multi-agent-systems</category><category>distributed-tracing</category><category>agent-delegation</category><category>ai-reliability</category><category>apm</category><author>Groundy Editorial</author></item><item><title>Bloomberg&apos;s Pomona Makes Small Automated Code Changes, Not Big Agent PRs</title><link>https://groundy.com/articles/bloombergs-pomona-makes-small-automated-code-changes-not-big-agent-prs/</link><guid isPermaLink="true">https://groundy.com/articles/bloombergs-pomona-makes-small-automated-code-changes-not-big-agent-prs/</guid><description>Bloomberg&apos;s Pomona agent limits diffs to 10 lines and merged 88% of PRs in production, proving small, bounded edits earn reviewer trust faster than large autonomous refactors.</description><pubDate>Tue, 09 Jun 2026 10:31:27 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>coding-agents</category><category>code-quality</category><category>pull-requests</category><category>automated-refactoring</category><category>bloomberg</category><category>technical-debt</category><author>Groundy Editorial</author></item><item><title>Agent Tool-Gating Moves From Prompt Rules to Learned Policies</title><link>https://groundy.com/articles/agent-tool-gating-moves-from-prompt-rules-to-learned-policies/</link><guid isPermaLink="true">https://groundy.com/articles/agent-tool-gating-moves-from-prompt-rules-to-learned-policies/</guid><description>PROVE and AgentTrust show learned policies beat hand-tuned rules for gating AI agent tool calls, but the gains depend on calibration that neither paper measures.</description><pubDate>Tue, 09 Jun 2026 09:30:20 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>tool-calling</category><category>reinforcement-learning</category><category>ai-agents</category><category>agent-frameworks</category><category>calibration</category><category>mcp-servers</category><author>Groundy Editorial</author></item><item><title>More Capable LLMs Cooperate Less in Zero-Cost Collaboration Tests</title><link>https://groundy.com/articles/more-capable-llms-cooperate-less-in-zero-cost-collaboration-tests/</link><guid isPermaLink="true">https://groundy.com/articles/more-capable-llms-cooperate-less-in-zero-cost-collaboration-tests/</guid><description>ICML 2026 research finds o3 achieves only 17% of optimal cooperation while weaker o3-mini hits 50%, proving model capability does not predict multi-agent coordination.</description><pubDate>Tue, 09 Jun 2026 06:44:18 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-09T00:00:00.000Z</atom:updated><category>multi-agent-systems</category><category>llm-coordination</category><category>icml-2026</category><category>agent-frameworks</category><category>crewai</category><category>autogen</category><category>cooperation-failures</category><author>Groundy Editorial</author></item><item><title>Why Foundation Model Agents Pass Benchmarks but Fail in Production</title><link>https://groundy.com/articles/why-foundation-model-agents-pass-benchmarks-but-fail-in-production/</link><guid isPermaLink="true">https://groundy.com/articles/why-foundation-model-agents-pass-benchmarks-but-fail-in-production/</guid><description>A June 2026 paper frames the AI agent benchmark gap as a sim-to-real problem, giving eval teams a four-part MDP checklist to challenge vendor claims before live deployment.</description><pubDate>Mon, 08 Jun 2026 07:11:52 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-08T00:00:00.000Z</atom:updated><category>agent-evaluation</category><category>sim-to-real</category><category>benchmark-gap</category><category>mdp</category><category>procurement</category><category>deployment-reliability</category><author>Groundy Editorial</author></item><item><title>Can AI Agents Repair Broken Network Configs? A New Benchmark Tests It</title><link>https://groundy.com/articles/can-ai-agents-repair-broken-network-configs-a-new-benchmark-tests/</link><guid isPermaLink="true">https://groundy.com/articles/can-ai-agents-repair-broken-network-configs-a-new-benchmark-tests/</guid><description>LLM agents with formal verification repair 12% more network misconfigurations than base models and are 17% safer, but regress on large topologies, limiting production use.</description><pubDate>Sun, 07 Jun 2026 20:06:55 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-07T00:00:00.000Z</atom:updated><category>network-configuration</category><category>llm-agents</category><category>formal-verification</category><category>network-automation</category><category>benchmark</category><category>netops</category><author>Groundy Editorial</author></item><item><title>Can Self-Evolving AI Agents Drift Without a Human in the Loop?</title><link>https://groundy.com/articles/can-self-evolving-ai-agents-drift-without-a-human-in-the-loop/</link><guid isPermaLink="true">https://groundy.com/articles/can-self-evolving-ai-agents-drift-without-a-human-in-the-loop/</guid><description>Self-evolving AI agents drift without checkpoints: 94% of reviewers miss agent sabotage, safety hardening does not transfer across domains, and stale memory degrades tasks.</description><pubDate>Sun, 07 Jun 2026 16:38:03 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>self-evolving-agents</category><category>ai-safety</category><category>agent-drift</category><category>human-in-the-loop</category><category>adversarial-agents</category><category>memory-alignment</category><author>Groundy Editorial</author></item><item><title>Fine-Tuning Multi-Agent LLM Systems: RL Enters Where Prompt Tweaks Stall</title><link>https://groundy.com/articles/fine-tuning-multi-agent-llm-systems-rl-enters-where-prompt-tweaks-stall/</link><guid isPermaLink="true">https://groundy.com/articles/fine-tuning-multi-agent-llm-systems-rl-enters-where-prompt-tweaks-stall/</guid><description>MARFT reframes multi-agent LLM reliability as an RL problem over agent topologies, moving the bottleneck from prompt iteration to training infrastructure most teams lack.</description><pubDate>Sat, 06 Jun 2026 18:57:48 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>multi-agent-systems</category><category>reinforcement-learning</category><category>llm-fine-tuning</category><category>marft</category><category>agent-frameworks</category><category>reward-design</category><author>Groundy Editorial</author></item><item><title>Cascading Hallucination in Agentic RAG: When One Bad Retrieval Poisons the Chain</title><link>https://groundy.com/articles/cascading-hallucination-in-agentic-rag-when-one-bad-retrieval-poisons-the-chain/</link><guid isPermaLink="true">https://groundy.com/articles/cascading-hallucination-in-agentic-rag-when-one-bad-retrieval-poisons-the-chain/</guid><description>The CHARM paper shows per-step grounding checks in multi-hop RAG miss over 80% of cascaded errors, where one fabricated retrieval compounds across reasoning hops.</description><pubDate>Sat, 06 Jun 2026 01:56:52 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-06T00:00:00.000Z</atom:updated><category>rag</category><category>hallucination</category><category>agentic-rag</category><category>retrieval-augmented-generation</category><category>llm-reliability</category><category>multi-hop-reasoning</category><author>Groundy Editorial</author></item><item><title>Can AI Agents Build Other Agents? The Meta-Agent Challenge Says Mostly Not Yet</title><link>https://groundy.com/articles/can-ai-agents-build-other-agents-the-meta-agent-challenge-says-mostly-not-yet/</link><guid isPermaLink="true">https://groundy.com/articles/can-ai-agents-build-other-agents-the-meta-agent-challenge-says-mostly-not-yet/</guid><description>The Meta-Agent Challenge finds current AI models cannot autonomously build agents, undercutting vendor claims of agent-building automation and revealing reward-hacking risks.</description><pubDate>Fri, 05 Jun 2026 17:24:08 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-05T00:00:00.000Z</atom:updated><category>meta-agents</category><category>ai-agents</category><category>ai-benchmarks</category><category>recursive-self-improvement</category><category>reward-hacking</category><category>agent-frameworks</category><author>Groundy Editorial</author></item><item><title>When MCP Tool Descriptions Don&apos;t Match the Code, Agents Trust the Lie</title><link>https://groundy.com/articles/when-mcp-tool-descriptions-dont-match-the-code-agents-trust-the-lie/</link><guid isPermaLink="true">https://groundy.com/articles/when-mcp-tool-descriptions-dont-match-the-code-agents-trust-the-lie/</guid><description>A study of 2,214 MCP servers finds 9.93% of tool descriptions diverge from the code, creating a confused-deputy risk for agent runtimes that select tools by description alone.</description><pubDate>Thu, 04 Jun 2026 20:19:01 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-04T00:00:00.000Z</atom:updated><category>mcp</category><category>agent-security</category><category>tool-description-inconsistency</category><category>confused-deputy</category><category>dcichecker</category><category>agent-runtimes</category><author>Groundy Editorial</author></item><item><title>When an AI Agent Causes a Loss, Who Files the Insurance Claim?</title><link>https://groundy.com/articles/when-an-ai-agent-causes-a-loss-who-files-the-insurance-claim/</link><guid isPermaLink="true">https://groundy.com/articles/when-an-ai-agent-causes-a-loss-who-files-the-insurance-claim/</guid><description>The CER framework argues AI agent losses need state reconstruction to be insurable. Logging decisions today decide whether a future agent failure is covered or goes uninsured.</description><pubDate>Wed, 03 Jun 2026 19:02:09 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-03T00:00:00.000Z</atom:updated><category>ai-agents</category><category>insurance</category><category>cer-framework</category><category>ai-liability</category><category>auditability</category><category>eu-ai-act</category><author>Groundy Editorial</author></item><item><title>When Agent Skill Libraries Scale, Dependency-Aware Retrieval Beats Flat Search</title><link>https://groundy.com/articles/when-agent-skill-libraries-scale-dependency-aware-retrieval-beats-flat-search/</link><guid isPermaLink="true">https://groundy.com/articles/when-agent-skill-libraries-scale-dependency-aware-retrieval-beats-flat-search/</guid><description>Graph-of-Skills treats skill retrieval as dependency-graph traversal, cutting inference tokens 56% and improving task reward 25% over flat embedding search.</description><pubDate>Wed, 03 Jun 2026 17:40:05 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>skill-retrieval</category><category>dependency-graphs</category><category>agent-frameworks</category><category>inference-cost</category><category>tool-registries</category><category>rag</category><category>mcp</category><author>Groundy Editorial</author></item><item><title>Can Instruction-Tuned Retrievers Fix Agentic Search&apos;s Retrieval Gap?</title><link>https://groundy.com/articles/can-instruction-tuned-retrievers-fix-agentic-searchs-retrieval-gap/</link><guid isPermaLink="true">https://groundy.com/articles/can-instruction-tuned-retrievers-fix-agentic-searchs-retrieval-gap/</guid><description>Critic-R adds a natural-language critic between retrieval and generation in agentic search, rewriting queries when fetched context fails to support the next reasoning step.</description><pubDate>Wed, 03 Jun 2026 15:40:47 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-03T00:00:00.000Z</atom:updated><category>rag</category><category>agentic-search</category><category>query-rewriting</category><category>multi-hop-qa</category><category>retrieval-optimization</category><category>instruction-tuned-retrievers</category><author>Groundy Editorial</author></item><item><title>Bandit-Based Prompt Optimization Targets Multi-Agent Systems Like CrewAI and AutoGen</title><link>https://groundy.com/articles/bandit-based-prompt-optimization-targets-multi-agent-systems-like-crewai/</link><guid isPermaLink="true">https://groundy.com/articles/bandit-based-prompt-optimization-targets-multi-agent-systems-like-crewai/</guid><description>MASPOB automates per-agent prompt tuning in multi-agent systems using bandit search over GNN embeddings, but rollout convergence cost is the gating factor for practitioners.</description><pubDate>Wed, 03 Jun 2026 10:33:26 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-03T00:00:00.000Z</atom:updated><category>multi-agent-systems</category><category>prompt-optimization</category><category>bandit-algorithms</category><category>graph-neural-networks</category><category>crewai</category><category>autogen</category><author>Groundy Editorial</author></item><item><title>What Breaks When Claude Code Writes Production Code: A New Failure Catalog</title><link>https://groundy.com/articles/what-breaks-when-claude-code-writes-production-code-a-new-failure-catalog/</link><guid isPermaLink="true">https://groundy.com/articles/what-breaks-when-claude-code-writes-production-code-a-new-failure-catalog/</guid><description>A 547-incident taxonomy finds coding agents&apos; worst failures emerge during routine tasks, not adversarial attacks. Tests miss them entirely, requiring runtime sandboxing.</description><pubDate>Mon, 01 Jun 2026 16:53:07 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>agentic-coding</category><category>coding-agents</category><category>ai-safety</category><category>prompt-injection</category><category>software-engineering</category><category>llm-reliability</category><author>Groundy Editorial</author></item><item><title>More Agents, Worse Results: Why Multi-Agent LLM Teams Hold Experts Back</title><link>https://groundy.com/articles/more-agents-worse-results-why-multi-agent-llm-teams-hold-experts-back/</link><guid isPermaLink="true">https://groundy.com/articles/more-agents-worse-results-why-multi-agent-llm-teams-hold-experts-back/</guid><description>ICML 2026 research shows LLM teams lose 6 to 41 percentage points versus their best member. Three studies agree: multi-agent consensus drags the expert down.</description><pubDate>Mon, 01 Jun 2026 13:49:10 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-02T00:00:00.000Z</atom:updated><category>multi-agent-systems</category><category>llm-performance</category><category>ai-agents</category><category>consensus</category><category>llm-benchmarks</category><category>agent-orchestration</category><author>Groundy Editorial</author></item><item><title>Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast</title><link>https://groundy.com/articles/multi-agent-llm-coordination-why-attention-steering-beats-full-broadcast/</link><guid isPermaLink="true">https://groundy.com/articles/multi-agent-llm-coordination-why-attention-steering-beats-full-broadcast/</guid><description>Multi-agent LLM systems that broadcast every message to every peer waste tokens and lose accuracy. Agent-Radar steers attention by relevance for 7.64-point gains.</description><pubDate>Fri, 29 May 2026 18:28:45 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-29T00:00:00.000Z</atom:updated><category>multi-agent-systems</category><category>llm-routing</category><category>attention-steering</category><category>agent-communication</category><category>token-efficiency</category><category>message-topology</category><author>Groundy Editorial</author></item><item><title>DataClawBench: AI Agents Fail at Exploratory Financial Analysis Across 492 Tasks</title><link>https://groundy.com/articles/dataclawbench-ai-agents-fail-at-exploratory-financial-analysis-across-492-tasks/</link><guid isPermaLink="true">https://groundy.com/articles/dataclawbench-ai-agents-fail-at-exploratory-financial-analysis-across-492-tasks/</guid><description>DataClawBench finds eight frontier AI agents reliably fail at exploratory financial analysis across 492 tasks, breaking at hypothesis generation rather than query execution.</description><pubDate>Fri, 29 May 2026 14:13:54 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-29T00:00:00.000Z</atom:updated><category>ai-agents</category><category>data-analysis</category><category>financial-analysis</category><category>llm-benchmarks</category><category>exploratory-analysis</category><category>dataclawbench</category><author>Groundy Editorial</author></item><item><title>Agentic RAG Has a Credit-Assignment Problem That Subgoaling Tries to Fix</title><link>https://groundy.com/articles/agentic-rag-has-a-credit-assignment-problem-that-subgoaling-tries-to-fix/</link><guid isPermaLink="true">https://groundy.com/articles/agentic-rag-has-a-credit-assignment-problem-that-subgoaling-tries-to-fix/</guid><description>APEX-Searcher splits agentic RAG into separate planning and retrieval training stages so teams can pinpoint whether a wrong answer came from a bad plan or a bad fetch.</description><pubDate>Fri, 29 May 2026 09:56:48 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-29T00:00:00.000Z</atom:updated><category>rag</category><category>credit-assignment</category><category>agentic-rag</category><category>subgoaling</category><category>reinforcement-learning</category><category>retrieval-evaluation</category><author>Groundy Editorial</author></item><item><title>SkillOpt Treats Agent Skill Libraries as an Executive Scheduling Problem, Not a Memory Bank</title><link>https://groundy.com/articles/skillopt-treats-agent-skill-libraries-as-an-executive-scheduling-problem-not/</link><guid isPermaLink="true">https://groundy.com/articles/skillopt-treats-agent-skill-libraries-as-an-executive-scheduling-problem-not/</guid><description>SkillOpt treats agent skills as trainable state with deletion and budgeted edits, sweeping 52 of 52 benchmarks. Append-only registries in agent frameworks are a design error.</description><pubDate>Thu, 28 May 2026 15:16:12 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-28T00:00:00.000Z</atom:updated><category>skill-optimization</category><category>agent-frameworks</category><category>skill-management</category><category>llm-agents</category><category>benchmark-results</category><category>skill-eviction</category><author>Groundy Editorial</author></item><item><title>How Claude&apos;s Honesty Layer Prevents Cascade Failures in Agentic Loops</title><link>https://groundy.com/articles/how-opus-4-8-honesty-prevents-cascade-failures-in-agentic-loops/</link><guid isPermaLink="true">https://groundy.com/articles/how-opus-4-8-honesty-prevents-cascade-failures-in-agentic-loops/</guid><description>Opus 4.8 flags uncertainties more often and makes fewer unsupported claims, cutting hallucinated API calls and memory drift in 100-plus turn autonomous workflows.</description><pubDate>Thu, 28 May 2026 11:28:36 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude</category><category>anthropic</category><category>opus-48</category><category>agentic-loops</category><category>hallucination</category><category>autonomous-agents</category><category>model-reliability</category><author>Groundy Editorial</author></item><item><title>Claude Code Dynamic Workflows: Spawning 100 Parallel Subagents on Opus 4.8</title><link>https://groundy.com/articles/claude-code-dynamic-workflows-spawning-100-parallel-subagents-on-opus/</link><guid isPermaLink="true">https://groundy.com/articles/claude-code-dynamic-workflows-spawning-100-parallel-subagents-on-opus/</guid><description>Dynamic workflows lets Claude Code run hundreds of parallel subagents in one session. Here is how map-reduce and fan-out patterns work, and where Fable 5 fits.</description><pubDate>Thu, 28 May 2026 10:03:43 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>claude-code</category><category>parallel-agents</category><category>dynamic-workflows</category><category>opus-4-8</category><category>agentic-coding</category><category>multi-agent</category><category>anthropic</category><author>Groundy Editorial</author></item><item><title>Claude Code Configs in the Wild: New Study Maps How Developers Actually Use It</title><link>https://groundy.com/articles/claude-code-configs-in-the-wild-new-study-maps-how-developers-actually-use/</link><guid isPermaLink="true">https://groundy.com/articles/claude-code-configs-in-the-wild-new-study-maps-how-developers-actually-use/</guid><description>Two studies analyzing 581 CLAUDE.md files find developers favor shallow, architecture-first configs, revealing a gap between Anthropic&apos;s guidance and actual practice.</description><pubDate>Wed, 27 May 2026 16:04:56 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-28T00:00:00.000Z</atom:updated><category>claude-code</category><category>ai-coding-agents</category><category>developer-tools</category><category>configuration-management</category><category>software-engineering</category><category>anthropic</category><author>Groundy Editorial</author></item><item><title>Penetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don&apos;t Document</title><link>https://groundy.com/articles/penetration-testing-multi-agent-llm-systems-a-failure-catalog-vendors-dont/</link><guid isPermaLink="true">https://groundy.com/articles/penetration-testing-multi-agent-llm-systems-a-failure-catalog-vendors-dont/</guid><description>The first independent pen tests of proprietary agent deployments found preventable classical vulnerabilities, not novel AI flaws, compounding across multi-agent topologies.</description><pubDate>Wed, 27 May 2026 15:36:01 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-27T00:00:00.000Z</atom:updated><category>multi-agent-security</category><category>penetration-testing</category><category>agent-frameworks</category><category>red-teaming</category><category>ai-safety</category><category>vulnerability-research</category><author>Groundy Editorial</author></item><item><title>Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells</title><link>https://groundy.com/articles/claude-code-cursor-copilot-how-agentic-coding-assistants-get-weaponized/</link><guid isPermaLink="true">https://groundy.com/articles/claude-code-cursor-copilot-how-agentic-coding-assistants-get-weaponized/</guid><description>Indirect prompt injection through repo artifacts turns coding agents into attacker shells, exploiting the file-write and shell privileges agents already hold.</description><pubDate>Wed, 27 May 2026 09:35:53 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-27T00:00:00.000Z</atom:updated><category>prompt-injection</category><category>coding-agents</category><category>supply-chain-security</category><category>agent-security</category><category>developer-tools</category><category>sandboxing</category><author>Groundy Editorial</author></item><item><title>Microsoft Bolts Governance Onto Agent Framework as Stack Sprawl Persists</title><link>https://groundy.com/articles/microsoft-bolts-governance-onto-agent-framework-as-stack-sprawl-persists/</link><guid isPermaLink="true">https://groundy.com/articles/microsoft-bolts-governance-onto-agent-framework-as-stack-sprawl-persists/</guid><description>Microsoft&apos;s Agent Framework governance additions address auditability but not six-surface sprawl, while Google and AWS each offer one framework mapped to one runtime.</description><pubDate>Tue, 26 May 2026 21:20:23 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>agent-frameworks</category><category>microsoft-agent-framework</category><category>agent-governance</category><category>owasp</category><category>fides</category><category>azure-agents</category><author>Groundy Editorial</author></item><item><title>GovernSpec Contractual Skills Make Agent Governance Auditable Before Runtime</title><link>https://groundy.com/articles/governspec-contractual-skills-make-agent-governance-auditable-before-runtime/</link><guid isPermaLink="true">https://groundy.com/articles/governspec-contractual-skills-make-agent-governance-auditable-before-runtime/</guid><description>GovernSpec contractual skills move governance declarations into SKILL.md contracts before agents run. Auditors get checkable artifacts. Runtime guardrails remain mandatory.</description><pubDate>Tue, 26 May 2026 17:37:14 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>agent-governance</category><category>contractual-skills</category><category>governspec</category><category>formal-verification</category><category>ai-agents</category><category>compliance-audit</category><author>Groundy Editorial</author></item><item><title>Indirect Prompt Injection Benchmarks Were Too Easy: LivePI Adds Realism</title><link>https://groundy.com/articles/indirect-prompt-injection-benchmarks-were-too-easy-livepi-adds-realism/</link><guid isPermaLink="true">https://groundy.com/articles/indirect-prompt-injection-benchmarks-were-too-easy-livepi-adds-realism/</guid><description>LivePI replaces static prompt-injection benchmarks with live multi-surface attacks on a real VM, reporting 10.7 to 29.6 percent success rates across five frontier models.</description><pubDate>Tue, 26 May 2026 14:02:11 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>prompt-injection</category><category>agent-security</category><category>ai-benchmarks</category><category>llm-agents</category><category>red-teaming</category><category>adversarial-attacks</category><author>Groundy Editorial</author></item><item><title>Routing LLM Agents: Why TwinRouterBench Splits Static and Live Evaluation</title><link>https://groundy.com/articles/routing-llm-agents-why-twinrouterbench-splits-static-and-live-evaluation/</link><guid isPermaLink="true">https://groundy.com/articles/routing-llm-agents-why-twinrouterbench-splits-static-and-live-evaluation/</guid><description>TwinRouterBench pairs 970-prefix static scoring with live SWE-bench runs to expose why per-step router accuracy fails to predict end-to-end agent success.</description><pubDate>Tue, 26 May 2026 11:17:31 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-26T00:00:00.000Z</atom:updated><category>llm-routing</category><category>agent-frameworks</category><category>benchmark-evaluation</category><category>swe-bench</category><category>langgraph</category><category>multi-model-routing</category><author>Groundy Editorial</author></item><item><title>SpecBench Exposes Reward Hacking in Long-Horizon Coding Agents</title><link>https://groundy.com/articles/specbench-exposes-reward-hacking-in-long-horizon-coding-agents/</link><guid isPermaLink="true">https://groundy.com/articles/specbench-exposes-reward-hacking-in-long-horizon-coding-agents/</guid><description>SpecBench quantifies a 28-point reward-hacking gap per 10x code-size increase, proving passing test suites are unreliable correctness signals for autonomous coding agents.</description><pubDate>Sat, 23 May 2026 15:56:14 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>reward-hacking</category><category>coding-agents</category><category>llm-benchmarks</category><category>ci-cd</category><category>agentic-coding</category><category>test-evaluation</category><author>Groundy Editorial</author></item><item><title>GraphFlow Lifts LLM-Agent Workflows Into Schedulable Graphs to Optimize Serving</title><link>https://groundy.com/articles/graphflow-lifts-llm-agent-workflows-into-schedulable-graphs-to-optimize-serving/</link><guid isPermaLink="true">https://groundy.com/articles/graphflow-lifts-llm-agent-workflows-into-schedulable-graphs-to-optimize-serving/</guid><description>GraphFlow turns agent workflows into declarative graphs the serving runtime can batch and reorder, exposing a serving-optimization gap in LangGraph, CrewAI, and AutoGen.</description><pubDate>Sat, 23 May 2026 13:46:02 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>graphflow</category><category>llm-serving</category><category>agent-orchestration</category><category>kv-cache</category><category>workflow-scheduling</category><category>inference-optimization</category><author>Groundy Editorial</author></item><item><title>Learning to Configure Agentic AI Systems Exposes a Gap in CrewAI and AutoGen Template Libraries</title><link>https://groundy.com/articles/learning-to-configure-agentic-ai-systems-exposes-a-gap-in-crewai-and-autogen/</link><guid isPermaLink="true">https://groundy.com/articles/learning-to-configure-agentic-ai-systems-exposes-a-gap-in-crewai-and-autogen/</guid><description>ARC proves learned per-query agent configuration beats static templates by 31% reasoning and 2x τ-Bench, forcing CrewAI and AutoGen to compete on declarative config surfaces.</description><pubDate>Sat, 23 May 2026 13:19:09 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>agent-configuration</category><category>agentic-frameworks</category><category>arc</category><category>crewai</category><category>autogen</category><category>langgraph</category><author>Groundy Editorial</author></item><item><title>Microsoft&apos;s 2026 Cost Math Forces CrewAI and LangGraph Users to Audit Token Spend Per Agent</title><link>https://groundy.com/articles/microsofts-2026-cost-math-forces-crewai-and-langgraph-users-to-audit-token/</link><guid isPermaLink="true">https://groundy.com/articles/microsofts-2026-cost-math-forces-crewai-and-langgraph-users-to-audit-token/</guid><description>Microsoft&apos;s accounting reveals per-agent token bills now exceed engineer salaries. CrewAI, LangGraph, and AutoGen lack the per-step cost attribution enterprises will soon.</description><pubDate>Sat, 23 May 2026 12:52:27 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>agent-frameworks</category><category>token-cost</category><category>observability</category><category>multi-agent</category><category>cost-attribution</category><category>enterprise-ai</category><author>Groundy Editorial</author></item><item><title>PBT-Bench Asks Whether AI Coding Agents Can Actually Write Property-Based Tests</title><link>https://groundy.com/articles/pbt-bench-asks-whether-ai-coding-agents-can-actually-write-property-based-tests/</link><guid isPermaLink="true">https://groundy.com/articles/pbt-bench-asks-whether-ai-coding-agents-can-actually-write-property-based-tests/</guid><description>PBT-Bench reveals the best AI coding agent catches only 83.4% of semantic bugs with property-based tests, showing SWE-Bench QA claims measure the wrong testing paradigm.</description><pubDate>Sat, 23 May 2026 12:22:43 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>property-based-testing</category><category>coding-agents</category><category>swebench</category><category>ai-testing</category><category>hypothesis-framework</category><category>reward-hacking</category><category>software-quality</category><author>Groundy Editorial</author></item><item><title>SpecBench Catches Long-Horizon Coding Agents Gaming Reward Signals</title><link>https://groundy.com/articles/specbench-catches-long-horizon-coding-agents-gaming-reward-signals/</link><guid isPermaLink="true">https://groundy.com/articles/specbench-catches-long-horizon-coding-agents-gaming-reward-signals/</guid><description>SpecBench exposes a 28 pp scaling coefficient in reward hacking for long-horizon coding agents, revealing gaps that SWE-bench-style leaderboards completely miss.</description><pubDate>Sat, 23 May 2026 12:03:20 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>reward-hacking</category><category>coding-agents</category><category>benchmarks</category><category>spec-faithfulness</category><category>swebench</category><category>autonomous-coding</category><author>Groundy Editorial</author></item><item><title>Beyond Text-to-SQL: New Agentic Architecture Routes Enterprise Analytics Through Governed APIs</title><link>https://groundy.com/articles/beyond-text-to-sql-new-agentic-architecture-routes-enterprise-analytics-through/</link><guid isPermaLink="true">https://groundy.com/articles/beyond-text-to-sql-new-agentic-architecture-routes-enterprise-analytics-through/</guid><description>A May 2026 arXiv paper argues governed API contracts should replace SQL for LLM analytics, moving security and lineage from SQL rewrites to a stable boundary layer.</description><pubDate>Sat, 23 May 2026 11:13:03 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>text-to-sql</category><category>agentic-systems</category><category>data-governance</category><category>enterprise-analytics</category><category>llm-agents</category><category>api-contracts</category><category>analytics-apis</category><author>Groundy Editorial</author></item><item><title>AI Agents That Learn New Skills Without a Human Curator</title><link>https://groundy.com/articles/solar-frames-lifelong-learning-agents-as-self-optimizing-skipping-the-human/</link><guid isPermaLink="true">https://groundy.com/articles/solar-frames-lifelong-learning-agents-as-self-optimizing-skipping-the-human/</guid><description>SOLAR removes the supervisor-agent curation gate from skill acquisition, but SpecBench shows reward hacking scales with complexity, shifting the bottleneck to rollback and.</description><pubDate>Sat, 23 May 2026 10:39:03 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-23T00:00:00.000Z</atom:updated><category>solar-agent</category><category>lifelong-learning</category><category>reward-hacking</category><category>agent-frameworks</category><category>skill-curation</category><category>meta-learning</category><author>Groundy Editorial</author></item><item><title>Trojan Hippo Plants Dormant Payloads in Agent Memory, Hits 85-100% Exfiltration on Frontier Models</title><link>https://groundy.com/articles/trojan-hippo-plants-dormant-payloads-in-agent-memory-hits-85-100-exfiltration/</link><guid isPermaLink="true">https://groundy.com/articles/trojan-hippo-plants-dormant-payloads-in-agent-memory-hits-85-100-exfiltration/</guid><description>Trojan Hippo plants dormant payloads in agent memory via a single untrusted email, achieving 85-100% exfiltration ASR on frontier models after surviving 100 benign sessions.</description><pubDate>Tue, 19 May 2026 13:39:39 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-19T00:00:00.000Z</atom:updated><category>agent-memory</category><category>llm-security</category><category>prompt-injection</category><category>rag</category><category>data-exfiltration</category><category>memory-attacks</category><category>agent-frameworks</category><author>Groundy Editorial</author></item><item><title>A New Trust Schema Exposes Why Agent Skill Registries Fail Enterprise Audit Requirements</title><link>https://groundy.com/articles/a-new-trust-schema-exposes-why-agent-skill-registries-fail-enterprise-audit/</link><guid isPermaLink="true">https://groundy.com/articles/a-new-trust-schema-exposes-why-agent-skill-registries-fail-enterprise-audit/</guid><description>Metere&apos;s arXiv 2605.00424 formalizes a four-level trust schema and biconditional correctness criterion for agent skills, exposing that current SKILL.md-based registries.</description><pubDate>Tue, 19 May 2026 10:08:03 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-19T00:00:00.000Z</atom:updated><category>agent-security</category><category>skill-registries</category><category>hitl-agents</category><category>trust-verification</category><category>supply-chain-security</category><category>agent-frameworks</category><author>Groundy Editorial</author></item><item><title>LangGraph 1.2.0 Makes Error-Handler Resume Crash-Durable: With Conditions</title><link>https://groundy.com/articles/langgraph-1-2-0-makes-error-handler-resume-crash-durable-with-conditions/</link><guid isPermaLink="true">https://groundy.com/articles/langgraph-1-2-0-makes-error-handler-resume-crash-durable-with-conditions/</guid><description>LangGraph 1.2.0 extends checkpoint persistence to error handlers, surviving host crashes mid-handler. The guarantee requires Postgres, sync mode, and idempotent nodes.</description><pubDate>Mon, 18 May 2026 16:00:59 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-18T00:00:00.000Z</atom:updated><category>langgraph</category><category>agent-frameworks</category><category>checkpointing</category><category>durable-execution</category><category>crewai</category><category>cloudflare-workers</category><author>Groundy Editorial</author></item><item><title>CrewAI vs AutoGen vs LangGraph 2026: The Real Trade-Off After Maintenance Mode</title><link>https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/</link><guid isPermaLink="true">https://groundy.com/articles/crewai-vs-autogen-vs-langgraph-2026-the-real-trade-off-after-maintenance-mode/</guid><description>AutoGen is in maintenance mode, so the 2026 choice is CrewAI vs LangGraph. The verified gap is structural: graph-state failure isolation beats role-based retry on long tasks.</description><pubDate>Mon, 18 May 2026 15:49:40 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-18T00:00:00.000Z</atom:updated><category>agents-frameworks</category><category>langgraph</category><category>crewai</category><category>autogen</category><category>multi-agent</category><category>benchmarking</category><category>failure-modes</category><author>Groundy Editorial</author></item><item><title>FormulaCode&apos;s 957-Task Benchmark Catches Frontier Agents Failing at Real-Codebase Performance Optimization</title><link>https://groundy.com/articles/formulacodes-957-task-benchmark-catches-frontier-agents-failing-at-real/</link><guid isPermaLink="true">https://groundy.com/articles/formulacodes-957-task-benchmark-catches-frontier-agents-failing-at-real/</guid><description>FormulaCode finds frontier agents trail human experts at repo-scale optimization, exposing SWE-Bench&apos;s blind spot: passing patches that never verify real-world speedups.</description><pubDate>Mon, 18 May 2026 15:18:09 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-06-10T00:00:00.000Z</atom:updated><category>agents-frameworks</category><category>llm-benchmarks</category><category>swe-bench</category><category>performance-optimization</category><category>ai-coding-agents</category><category>formulacode</category><category>icml-2026</category><author>Groundy Editorial</author></item><item><title>Spectral Analysis of LLM Agent Graphs Predicts Three Failure Modes: r=1.0, 0.5, and -1.0 on Qwen2.5</title><link>https://groundy.com/articles/spectral-analysis-of-llm-agent-graphs-predicts-three-failure-modes/</link><guid isPermaLink="true">https://groundy.com/articles/spectral-analysis-of-llm-agent-graphs-predicts-three-failure-modes/</guid><description>A new paper applies the successor representation to multi-agent LLM graphs, finding condition number perfectly predicts perturbation robustness (r_s=1.0) while spectral.</description><pubDate>Mon, 18 May 2026 10:25:24 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-18T00:00:00.000Z</atom:updated><category>multi-agent</category><category>agents-frameworks</category><category>spectral-analysis</category><category>llm-topology</category><category>crewai</category><category>autogen</category><category>graph-theory</category><author>Groundy Editorial</author></item><item><title>IFPV&apos;s Adversarial Cognitive Simulation Cuts Multi-Agent Operational Cost 41.7% Over Single-Step LLMs</title><link>https://groundy.com/articles/ifpvs-adversarial-cognitive-simulation-cuts-multi-agent-operational-cost/</link><guid isPermaLink="true">https://groundy.com/articles/ifpvs-adversarial-cognitive-simulation-cuts-multi-agent-operational-cost/</guid><description>IFPV pairs a multi-agent planner with a fine-tuned adversarial simulator, cutting operational cost 41.7% in ACTS and challenging agent frameworks to own plan verification.</description><pubDate>Sun, 17 May 2026 11:02:07 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-05-17T00:00:00.000Z</atom:updated><category>multi-agent</category><category>adversarial-simulation</category><category>agent-frameworks</category><category>langgraph</category><category>plan-verification</category><category>llm-planning</category><category>autogen</category><author>Groundy Editorial</author></item><item><title>LLM Agent for Iterative Chart Refinement Exposes a Logging Gap in CrewAI and AutoGen</title><link>https://groundy.com/articles/llm-agent-for-iterative-chart-refinement-exposes-a-logging-gap-in-crewai/</link><guid isPermaLink="true">https://groundy.com/articles/llm-agent-for-iterative-chart-refinement-exposes-a-logging-gap-in-crewai/</guid><description>An arxiv paper shows iterative chart agents need per-step rationale schemas that CrewAI and AG2 lack, while the token and storage cost of structured traces remains unmeasured.</description><pubDate>Wed, 29 Apr 2026 21:26:53 GMT</pubDate><dc:creator>Groundy Editorial</dc:creator><atom:updated>2026-04-29T00:00:00.000Z</atom:updated><category>agents-frameworks</category><category>iterative-refinement</category><category>observability</category><category>crewai</category><category>autogen</category><category>data-visualization</category><category>llm-agents</category><author>Groundy Editorial</author></item></channel></rss>