groundy

agents & frameworks

70 articles · rss

Top in agents & frameworks


  1. jun 09 agents Why Skill Creation and Reward Optimization Collide in Agentic RL
  2. jun 09 agents When AI Agents Delegate Work, Your Observability Stack Goes Blind
  3. jun 08 agents Bloomberg's Pomona Makes Small Automated Code Changes, Not Big Agent PRs
  4. jun 08 agents Agent Tool-Gating Moves From Prompt Rules to Learned Policies
  5. jun 08 agents More Capable LLMs Cooperate Less in Zero-Cost Collaboration Tests
  6. jun 07 agents Why Foundation Model Agents Pass Benchmarks but Fail in Production
  7. jun 06 agents Can AI Agents Repair Broken Network Configs? A New Benchmark Tests It
  8. jun 06 agents Can Self-Evolving AI Agents Drift Without a Human in the Loop?
  9. jun 05 agents Fine-Tuning Multi-Agent LLM Systems: RL Enters Where Prompt Tweaks Stall
  10. jun 05 agents Cascading Hallucination in Agentic RAG: When One Bad Retrieval Poisons the Chain
  11. jun 04 agents Can AI Agents Build Other Agents? The Meta-Agent Challenge Says Mostly Not Yet
  12. jun 03 agents When MCP Tool Descriptions Don't Match the Code, Agents Trust the Lie
  13. jun 02 agents When an AI Agent Causes a Loss, Who Files the Insurance Claim?
  14. jun 02 agents When Agent Skill Libraries Scale, Dependency-Aware Retrieval Beats Flat Search
  15. jun 02 agents Can Instruction-Tuned Retrievers Fix Agentic Search's Retrieval Gap?
  16. jun 02 agents Bandit-Based Prompt Optimization Targets Multi-Agent Systems Like CrewAI and AutoGen
  17. may 31 agents What Breaks When Claude Code Writes Production Code: A New Failure Catalog
  18. may 31 agents More Agents, Worse Results: Why Multi-Agent LLM Teams Hold Experts Back
  19. may 28 agents Multi-Agent LLM Coordination: Why Attention Steering Beats Full Broadcast
  20. may 28 agents DataClawBench: AI Agents Fail at Exploratory Financial Analysis Across 492 Tasks
  21. may 28 agents Agentic RAG Has a Credit-Assignment Problem That Subgoaling Tries to Fix
  22. may 27 agents SkillOpt Treats Agent Skill Libraries as an Executive Scheduling Problem, Not a Memory Bank
  23. may 27 agents How Claude's Honesty Layer Prevents Cascade Failures in Agentic Loops
  24. may 27 agents Claude Code Dynamic Workflows: Spawning 100 Parallel Subagents on Opus 4.8
  25. may 26 agents Claude Code Configs in the Wild: New Study Maps How Developers Actually Use It
  26. may 26 agents Penetration Testing Multi-Agent LLM Systems: A Failure Catalog Vendors Don't Document
  27. may 26 agents Claude Code, Cursor, Copilot: How Agentic Coding Assistants Get Weaponized as Attacker Shells
  28. may 25 agents Microsoft Bolts Governance Onto Agent Framework as Stack Sprawl Persists
  29. may 25 agents GovernSpec Contractual Skills Make Agent Governance Auditable Before Runtime
  30. may 25 agents Indirect Prompt Injection Benchmarks Were Too Easy: LivePI Adds Realism
  31. may 25 agents Routing LLM Agents: Why TwinRouterBench Splits Static and Live Evaluation
  32. may 22 agents SpecBench Exposes Reward Hacking in Long-Horizon Coding Agents
  33. may 22 agents GraphFlow Lifts LLM-Agent Workflows Into Schedulable Graphs to Optimize Serving
  34. may 22 agents Learning to Configure Agentic AI Systems Exposes a Gap in CrewAI and AutoGen Template Libraries
  35. may 22 agents Microsoft's 2026 Cost Math Forces CrewAI and LangGraph Users to Audit Token Spend Per Agent
  36. may 22 agents PBT-Bench Asks Whether AI Coding Agents Can Actually Write Property-Based Tests
  37. may 22 agents SpecBench Catches Long-Horizon Coding Agents Gaming Reward Signals
  38. may 22 agents Beyond Text-to-SQL: New Agentic Architecture Routes Enterprise Analytics Through Governed APIs
  39. may 22 agents AI Agents That Learn New Skills Without a Human Curator
  40. may 18 agents Trojan Hippo Plants Dormant Payloads in Agent Memory, Hits 85-100% Exfiltration on Frontier Models
  41. may 18 agents A New Trust Schema Exposes Why Agent Skill Registries Fail Enterprise Audit Requirements
  42. may 17 agents LangGraph 1.2.0 Makes Error-Handler Resume Crash-Durable: With Conditions
  43. may 17 agents CrewAI vs AutoGen vs LangGraph 2026: The Real Trade-Off After Maintenance Mode
  44. may 17 agents FormulaCode's 957-Task Benchmark Catches Frontier Agents Failing at Real-Codebase Performance Optimization
  45. may 17 agents Spectral Analysis of LLM Agent Graphs Predicts Three Failure Modes: r=1.0, 0.5, and -1.0 on Qwen2.5
  46. may 16 agents IFPV's Adversarial Cognitive Simulation Cuts Multi-Agent Operational Cost 41.7% Over Single-Step LLMs
  47. apr 28 agents LLM Agent for Iterative Chart Refinement Exposes a Logging Gap in CrewAI and AutoGen
  48. apr 28 agents CrewAI 1.14.2 Lands Checkpoint TUI with Tree View, Fork Support, and Lineage Tracking
  49. apr 28 agents Council Mode Cuts Multi-Agent LLM Hallucination 35.9% at 4.2x Token Cost on HaluEval
  50. apr 28 agents Salesforce TDX 2026: Headless 360 Ships 60+ MCP Tools and Agentforce Vibes 2.0 With Claude Sonnet 4.5

Agent frameworks ship faster than the rigor operators need to run them. Vendor docs promise orchestration, memory, and tool use; academic benchmarks and production post-mortems keep exposing the same structural gaps: diversity collapse in multi-agent ideation, hallucination amplification across consensus topologies, missing per-step rationale traces, role-based retry losing to graph-state failure isolation on long tasks, and configuration surfaces that punish static templates. This beat covers that delta.

The second through-line is governance and trust. Skill registries, tool-use protocols, and capability manifests are accumulating faster than auditable contracts for them. Trust schemas, contractual skill specs, and information-flow controls are arriving as bolt-ons rather than primitives, while the infrastructure layer — sandbox execution, private networking, agent memory — keeps absorbing functionality the framework layer used to own. The question of where the agent stack actually lives, and who is liable when it misbehaves, stays unresolved.

Coverage is comparative and opinionated. When a benchmark or paper exposes a gap that a major framework cannot close without redesign, that gets named. When a vendor ships governance theater rather than enforcement, that gets named too. The goal is help readers pick stacks that survive contact with production, not a taxonomy of every framework release.