When AI Agents Delegate Work, Your Observability Stack Goes Blind

Your traces look fine. They tell you nothing.

When an AI agent delegates a subtask to another agent or an external tool, every individual action gets logged. The spans fire, the traces render cleanly in your dashboard. But according to arXiv:2606.09692, published June 8, 2026 by Mishra and Sharad, those logs cannot reconstruct which agent caused which outcome. Standard observability treats each call as an independent event. Delegation turns those events into interleaved, fragmented sequences that no amount of span-wiring can untangle after the fact.

The paper’s central claim is precise and deliberately narrow: delegation-scoped execution is structurally underdetermined. That term has a specific meaning here. Given a complete set of audit logs and execution traces, multiple incompatible delegation assignments can produce identical observable outputs. You cannot work backwards from the logs to determine which actions belonged to which delegated subtask.

This is not a tooling gap. It is a structural property of how delegation works in LLM-based agentic systems. Agents dynamically select tools. Execution sequences vary across runs for the same instruction. Sub-agents spawn and cooperate in patterns that differ from one invocation to the next. These dynamics fragment and interleave traces at the causal-structure level, meaning the information needed to reconstruct delegation scope was never captured in the first place.

What “structurally underdetermined” means for incident response

For teams accustomed to single-agent debugging, the mental model is straightforward: trace the call chain, find the failing span, read the input/output, diagnose the bug. OpenTelemetry and similar APM systems were designed around this model. A trace is a tree of spans. Each span has a parent. The tree reconstructs causality.

Delegation breaks this in a specific way. When Agent A delegates a research subtask to Agent B, and Agent B calls a search tool, a code interpreter, and a summarizer, those three tool calls appear in the trace as children of Agent B’s execution span. But if Agent B then delegates part of its work to Agent C, and Agent C calls the same search tool with different parameters, the trace now contains interleaved calls to the same tool from different delegation scopes. The trace records that the calls happened. It does not record which delegation context each call belonged to, because the delegation protocol does not bind that context to the execution.

Post-incident, you are left with a flat list of tool invocations and no reliable way to attribute them to the agent that initiated them. The paper proves this is not a failure of correlation heuristics or time-window matching. The information is simply not present in the data.

The scale of the coordination problem

This observability gap is not theoretical. The Alem benchmark, also published in June 2026, tested 13 modern LLMs on open-ended multi-agent coordination tasks. The average normalized return was approximately 6%. Individual task competence did not translate into coordination competence. Agents that performed well in isolation failed when they had to delegate, synchronize, and recover from each other’s errors.

That ~6% figure matters because it establishes the baseline failure rate for the regime where observability matters most. Multi-agent systems fail often. When they fail, the teams responsible for them need to know which agent’s action caused the failure. The Alem results show that multi-agent coordination failures are common. The observability paper shows that existing tooling cannot attribute those failures to specific agents after the fact.

The gateway-and-context-model approach

Mishra and Sharad propose a two-part fix. First, an agent-aware observability substrate: a lightweight gateway that sits between agents and the tools or sub-agents they invoke. Second, a common information model that binds delegation context to each action at execution time.

The gateway approach is architecturally straightforward. Every tool call, every sub-agent invocation, passes through a layer that tags the request with its delegation scope: which agent initiated the call, which parent delegation it belongs to, and what the delegated objective was. This context travels with the execution, so the resulting logs carry enough information to reconstruct delegation boundaries without heuristic correlation.

The common information model provides the schema for those tags. It is designed to work across heterogeneous systems, meaning it can bind context from a LangGraph agent calling a CrewAI tool, or a custom runtime invoking an external API. The model enables cross-tool, delegation-scoped reconstruction and direct forensic queries. You ask “which actions were taken under delegation scope D3?” and get an answer, rather than reconstructing D3 from timestamps and hope.

The paper describes this at the architecture level. Implementation details, including performance overhead and deployment patterns, would require the full paper beyond the abstract.

Implicit Execution Tracing: provenance in the output itself

A complementary approach comes from arXiv:2603.17445, published March 2026, which proposes Implicit Execution Tracing (IET). Instead of building infrastructure around the agents, IET embeds agent-specific, key-conditioned statistical signals into the token generation process itself.

The output text becomes a self-verifying execution record. Each agent’s contribution to the final text carries a statistical fingerprint that can be extracted after the fact, even when execution metadata is unavailable. The paper reports that IET achieves accurate segment-level attribution and reliable transition recovery under three degraded conditions: identity removal (metadata stripped), boundary corruption (text segments concatenated without markers), and privacy-preserving redaction (sensitive content removed).

The tradeoff is that IET operates at the text level. It tells you which agent generated which segment of output. It does not reconstruct the full action chain the way a gateway approach can. The two approaches are complementary: the gateway captures execution provenance at call time, while IET captures output provenance as a fallback when execution metadata is lost.

Why existing APM breaks under delegation

OpenTelemetry, the de facto standard for distributed tracing, models execution as a directed acyclic graph of spans. Each span represents a unit of work. Spans have parent-child relationships that encode causality. This model works well for microservices: Service A calls Service B, which calls Service C. The trace is a tree. The tree tells you what happened.

Agent delegation does not map cleanly onto this tree. The delegation relationship is not a parent-child call. It is a transfer of authority: Agent A entrusts Agent B with a goal, and Agent B decides how to achieve it, including calling tools that Agent A also calls. The resulting execution graph is not a tree. It is a set of interleaved, potentially overlapping subgraphs that share tools and state.

Existing audit, tracing, and security schemas lack the semantics to express delegation scope as a first-class concept. You can tag spans with agent identifiers. You can add custom attributes. But the delegation boundary, the point at which one agent hands authority to another, is not a span boundary. It is a semantic boundary that current schemas do not model.

Bolting on OpenTelemetry after the fact, which is how most teams currently approach agent observability, cannot solve this. The instrumentation captures events. It does not capture the delegation semantics that give those events causal structure across agent boundaries.

What this means for teams running multi-agent systems

For teams deploying LangGraph, CrewAI, AutoGen, or custom agent runtimes, the practical implications are immediate.

Post-incident root-cause analysis is harder than it looks. If your multi-agent system produces an incorrect result and you need to trace which agent’s action caused the error, your existing tracing infrastructure may not support that query. The traces exist. The attribution does not.

The debugging cost scales with delegation depth. A two-agent system with shallow delegation is tractable. A four-agent system with nested delegation, where Agent A delegates to B, which delegates to C, which calls a tool also used by A, creates interleaved traces that require manual forensic work to disentangle. That manual work does not scale.

Observability must be designed into the delegation protocol. The gateway-and-context-model approach from arXiv:2606.09692 works because it binds delegation context at execution time, before the trace fragments. Retrofitting this after deployment requires re-architecting the delegation layer. Teams building multi-agent systems now should treat delegation-scoped observability as a protocol requirement, not an operational afterthought.

IET provides a fallback for uncontrolled environments. When you cannot instrument the execution path, Implicit Execution Tracing offers attribution at the output level. It is not a replacement for execution-time instrumentation, but it provides provenance guarantees in environments where metadata is stripped, corrupted, or simply unavailable.

The coordination failure rates from Alem suggest that multi-agent systems in production will fail frequently enough that post-incident debugging is a routine activity, not an exceptional one. If your observability stack cannot attribute those failures to specific agents, you are operating a system you cannot debug. That is the state of multi-agent deployment today.

Frequently Asked Questions

Can LangSmith, LangFuse, or similar AI observability tools solve this?

Most AI observability platforms model traces as spans with agent-ID tags, which works for single-agent debugging. The paper identifies that delegation scope is a semantic boundary, not a span boundary. Current vendor tools can show that Agent B called a search tool, but cannot distinguish whether that call was made under a delegation from Agent A with objective X versus an independent task. Teams evaluating these platforms should test whether they support delegation-scoped forensic queries, not just per-agent trace visualization.

What happens when delegation crosses to a third-party agent you don’t control?

The gateway approach requires instrumentation on both sides of the delegation boundary. When your agent delegates to a third-party API or external service, you cannot install your gateway there. IET provides a fallback, but only at the text-output level. You can attribute which agent produced which output segment, but you lose visibility into what tools the third-party agent called, what data it accessed, or how it reached its conclusion. This creates an observability asymmetry: full provenance inside your perimeter, partial provenance at the boundary.

How does this affect compliance audits that need to track data access?

The paper’s common information model supports access and share footprint reconstruction: a record of which tools, APIs, and data sources each delegation scope touched. For regulated industries, a delegated subtask may access a data store that the delegating agent was not authorized to touch. Existing audit logs record the access event but not the delegation context that triggered it, making it difficult to prove during a compliance review that a restricted dataset was never accessed under a specific delegation scope.

If single-agent benchmarks look good, does that predict multi-agent reliability?

The Alem benchmark tested 13 LLMs and found that strong individual performance did not carry over to coordination. Models that completed solo tasks reliably still failed when they had to delegate, synchronize, and recover from peer errors, averaging roughly 6% normalized return across open-ended coordination tasks. The observability gap identified by Mishra and Sharad therefore hits hardest in the regime teams are least prepared for: agents that look competent in isolation, then produce coordination failures in production with no tooling to diagnose which agent caused the breakdown.