Casserini, Facchini, and Ferrario’s position paper introduces “agentic entropy” as the systemic drift between an agent’s incremental actions and the architectural intent it was asked to preserve. The problem is not that any single step fails review—it is that current tools evaluate diffs locally while the cumulative trajectory quietly diverges. Each green test and approved patch can mask a widening gap between what the agent did and what the system actually needs.
What ‘Agentic Entropy’ Actually Means
The term does not refer to output randomness or thermodynamic analogies. In “Beyond the Diff,” agentic entropy denotes structural drift: the progressive misalignment between an agentic system’s behavior and the architectural intent that originally governed it. A coding agent may generate a diff that compiles, passes tests, and satisfies a code reviewer, yet still move the codebase away from its intended design constraints. The paper argues that this drift is systemic, not accidental—an emergent property of long-horizon agentic workflows rather than a bug in any single model call.
The camera-ready revision was accepted to the Human-Centered Explainable AI (HCXAI) Workshop at CHI 2026, titled “Re-examining XAI in the Era of Agentic AI”. That timing matters because the workshop frames the problem as one of explainability: if we cannot trace how an agent’s trajectory departs from intent, we cannot correct it before the departure becomes expensive.
Why Every Diff Can Pass While the System Drifts
Current benchmarks reward local correctness. SWE-bench evaluates whether a generated patch resolves the described issue end-to-end, requiring coordination across functions, classes, and files, but it does not assess whether intermediate steps remained coherent with the original architectural goal. A trajectory that eventually produces a passing patch may have taken detours that introduced technical debt, violated modularity constraints, or quietly rewrote interfaces that other components depend on.
GitClear’s January 2026 research on AI coding tools found 9x higher code churn and 4-10x output differences alongside increases in test coverage. Those numbers suggest that agents are producing more code, and more of it is being rewritten shortly after—consistent with the entropy pattern the paper describes, where locally plausible steps accumulate into globally unstable structures.
The Three-Pillar Framework: Seeding, Monitoring, and Causal Graphs
The authors propose a process-oriented explainability framework rather than a new benchmark or scoring metric. It rests on three pillars: conformity seeding, reasoning monitoring, and a causal graph interface.
Conformity seeding embeds intent constraints at the start of an agent’s reasoning process, not just in the prompt but in the structural assumptions the agent carries forward. Reasoning monitoring tracks how the agent’s internal reasoning evolves across steps, flagging when later justifications begin to drift from earlier commitments. The causal graph interface surfaces these relationships visually, showing which decisions led to which architectural consequences rather than presenting each diff in isolation.
The Gap in Current Tools: SWE-Bench, Claude Code Skills, and PR Bots
The paper does not claim SWE-bench is broken. It argues the benchmark is incomplete for measuring long-horizon coherence: a passing score confirms the patch solved the issue, not that the agent reached the solution without compromising architectural integrity. For teams running agentic sessions overnight or across dozens of files, that distinction is the difference between a clean merge and a refactor request three weeks later.
Claude Code’s documentation illustrates the current best-practice ceiling. It emphasizes per-step verification through tests, screenshots, and expected outputs, and recommends aggressive context management via /clear, /compact, and subagents. These are effective tactics for keeping individual steps bounded and reviewable. What they do not address is whether the tenth step still serves the intent established in the first. There is no cumulative-coherence metric, no trajectory-level dashboard, no equivalent of a code-review tool that operates across the entire agent session rather than per diff.
The Cost Shift: From Compilation Checks to Intent Verification
The practical implication is a relocation of the bottleneck. When agents were confined to single-file edits or short scripts, the dominant cost was “does this compile” and “do these tests pass.” As long-horizon runs become standard, the expensive failures happen higher up the stack: an agent that faithfully implements a feature request while quietly breaking the domain model, or that patches a bug by duplicating logic the architecture deliberately centralized.
Without intent-level telemetry, the only way to catch this drift is human review of the entire session—precisely the labor agents were supposed to reduce. The cost does not disappear; it shifts from writing code to reconstructing what the agent was thinking across forty steps. The paper’s framework suggests that the next generation of agent tools will need to make that reasoning as inspectable as a git diff, or teams will spend their time undoing what the agent did while no one was watching.
What Teams Should Do Before a Cumulative-Coherence Metric Exists
Until conformity seeding and reasoning monitoring are available as product features, teams are left with adaptations of existing workflows. The Claude Code best-practices guidance offers one partial defense: break long tasks into smaller subagent sessions, verify state explicitly at each boundary, and compact context before it accumulates irrelevant history. These tactics reduce the window in which entropy can compound, even if they do not measure it directly.
Teams can also borrow from the paper’s conceptual model by writing down architectural intent as explicit, checkable constraints before starting an agent session, then reviewing the full session trajectory against those constraints rather than reviewing diffs in isolation. It is manual, it is slower than trusting the green checkmarks, and it is currently the only available counterweight to a failure mode that diff-based review is not designed to detect.
Frequently Asked Questions
How is agentic entropy different from a single bad diff or buggy patch?
Agentic entropy is cumulative structural drift across a sequence of steps, not a flaw in any individual diff. Each step can compile, pass tests, and satisfy review while the overall trajectory still diverges from architectural intent.
What can teams do now to limit agentic entropy without new tools?
Teams can break long tasks into shorter subagent sessions, verify state at each boundary, and compact context regularly. They can also write down architectural intent as explicit constraints before a session and review the full trajectory against them rather than checking diffs in isolation.
Does the paper claim SWE-bench is broken?
No. The paper argues SWE-bench is incomplete for measuring long-horizon coherence because a passing score confirms the patch solved the issue, not that the agent preserved architectural integrity throughout the session.
Is the three-pillar framework available as a downloadable product?
No, it is a position paper describing what intent-level telemetry should look like. The framework has not been shipped as an implementation teams can use today.
What would a cumulative-coherence metric change about current coding tools?
It would shift verification from per-step compilation and test passing to trajectory-level intent alignment, making an agent’s reasoning across an entire session as inspectable as a single git diff.