groundy
agents & frameworks

How Much Repo Structure Does a Coding Agent Actually Need?

ISSTA 2026 ablation: lightweight call topology halves code agent run variance, and forward edges in hub-heavy repos degrade results. More structure stops paying off fast.

9 min · · · 4 sources ↓

A paper accepted to ISSTA 2026 (arXiv:2606.26979) runs the first systematic ablation of exactly this question. The answer: lightweight call/inheritance topology, injected as plain-text comments, reduces run-to-run variance and shortens agent trajectories. Beyond that threshold, adding more structure stops paying off, and in hub-heavy repositories, adding forward edges actively degrades results. The practical line is lower than most teams are drawing it.

Why code agent navigation is stochastic by default

Without static context, a coding agent navigating a repository is a probabilistic path-search at every step. It decides which file to open, which function to read, which cross-reference to follow, and those decisions vary across runs even with identical inputs. The variance compounds: one different early choice changes which files are read, which in turn changes what context is available for the next decision. Two identical requests against the same codebase can produce trajectories that diverge significantly by step three.

This is a navigation problem, not strictly a retrieval problem. Even agents with access to semantic search must still decide what to search for, in what order, and when to stop. Without a structural anchor, that decision space stays open on every step, and the randomness in model sampling means it shifts run to run.

The cost surfaces in two distinct places. First, lower Pass@1 rates: the agent picks a wrong early path and doesn’t recover. Second, run-to-run inconsistency that makes evaluation expensive. A team trying to measure agent performance against a benchmark needs multiple passes to distinguish signal from variance, and each extra pass costs tokens, latency, and money. If your agent is navigating a 500-file codebase across 10-20 sequential model invocations per task, that instability isn’t an eval inconvenience; it’s a line item.

What is deterministic anchoring, and what isn’t it?

Deterministic anchoring, as defined in the ISSTA 2026 paper, is the injection of lightweight static topology (call graphs, inheritance relationships) as plain-text comment annotations directly in source files. The mechanism is minimal: an agent reading a function sees, as a comment, a list of its callers and callees. This constrains the navigation decision without expanding the retrieval system or introducing a separate preprocessing pipeline.

This is distinct from what Aider calls a “repo map” (a higher-level file-and-function outline generated and injected once at context start) and from semantic embedding retrieval. The authors are making a specific claim about topology-as-inline-comments: the structure lives in the files the agent is already reading, not in a separate context document. Don’t conflate the two when evaluating whether this applies to your stack.

The core finding is also specific in a way that matters for practitioners: injecting topology didn’t substantially change what the agent could figure out. It changed how consistently the agent navigated to the right place. The authors call this the “deterministic anchoring effect”, structure’s value is reproducibility, not raw capability.

Does it actually work, and by how much?

Yes, with an important caveat about what “works” means here.

Using Codex as the baseline agent, lightweight call/inheritance topology improved function-level localization by +2.2 percentage points on Func@5 and shortened agent trajectories by 1.6 interaction rounds. Structural tags raised link-following rates from 0.15-0.18 to 0.21-0.24 and improved Pass@1 by +3.4pp on medium-scale repositories. Input token overhead: approximately 10% more than baseline.

The +2.2pp improvement on Func@5 sounds modest. The more practically significant result is that run-to-run spread roughly halved. For teams running evaluations or CI pipelines where each invocation costs real money, halved variance means you need far fewer passes to produce a stable success-rate estimate. If you’re currently running five passes to average out stochasticity, you might get away with three.

A January 2026 paper studying a complementary approach, the Repository Intelligence Graph (arXiv:2601.10112), tested deterministic architectural maps across Claude Code, Cursor, and Codex and found mean accuracy improvements of 12.2% with a 53.9% reduction in completion time. Gains were roughly 2.5x larger in multilingual repositories than single-language ones, which the authors attributed to the higher structural complexity that cross-language dependency tracking creates. The RIG paper also noted something directionally consistent with the ISSTA findings: structured context shifted agent failure modes from structural misunderstandings toward reasoning mistakes over a correct structure. The errors got more interesting, not just less frequent, which is arguably what you want, since reasoning errors are easier to address through prompting and few-shot examples than navigation errors are.

The two studies use different agent baselines and different context representations. Their numbers can’t be compared directly. But both reinforce the same directional claim.

When do forward edges backfire?

In hub-heavy repositories, they actively hurt.

The ISSTA paper found that denser semantic annotations show diminishing returns as repository size increases, and for codebases where a small number of functions are called by many others, the inverse-only configuration (who-calls-me) outperformed the bidirectional one. Adding forward edges (who-do-I-call) on top of inverse edges degraded results rather than extending the benefit.

The mechanism is straightforward: in a hub-heavy codebase, the forward-edge list for a central utility function is extremely long. Injecting it as a comment floods the local context with potentially irrelevant callees, and the agent’s navigation gets noisier. The inverse direction (callers only) remains consistently useful because it constrains impact scope, which callsites would be affected by a change, and that’s the information relevant to most code-modification tasks regardless of repository size.

This is the finding that most directly complicates the “more structure = better” intuition from the earlier RIG work. The RIG paper found substantial gains from structured context across multilingual repositories but didn’t ablate which edges to include. The new study does, and for large repos the answer is: fewer.

Does the 10% token overhead matter?

Against the baseline it’s replacing, not much.

A 2026 token-cost analysis for coding agents estimated naïve grep-and-read navigation at approximately 95,000 tokens per query on a 500-file codebase, versus roughly 1,900 tokens for a retrieval-first approach using static embedding and BM25 fusion. That’s a 50x difference in baseline. Against it, a 10% overhead for anchoring annotations is negligible: you spend a little more on each file read in exchange for reading fewer files and completing in fewer rounds.

The math improves when you account for trajectory compression across the full task. Enterprise inference budgets in early 2026 were dominated by token volume rather than per-token price, and a single agentic task routinely triggered 10-20 sequential model invocations. Cutting the trajectory by 1.6 rounds compresses total token spend across the entire chain, not just one step. The per-file annotation overhead gets paid back through trajectory reduction.

The caveat: this arithmetic only holds on the lightweight side of the granularity curve. Dense semantic tags that balloon per-file input by 50-100% shift the calculation quickly in the other direction.

Where should you draw the anchoring line?

The ISSTA study gives three concrete decision rules organized by repository type.

Medium-scale repositories: Default to lightweight call/inheritance topology. Inject as plain-text comments, include both forward and inverse edges, and expect roughly halved run-to-run variance and trajectories a couple of rounds shorter. This is the regime where anchoring’s full benefit materializes without the noise that scale introduces.

Large repositories, especially hub-heavy ones: Prune forward edges. Inverse-only (who-calls-me) annotations preserve the structural benefit without the noise overhead that long call-out lists create for high-fan-out functions. For functions with unusually large fan-in, consider omitting them from the annotation pass entirely; their caller lists would flood context without adding navigation value.

Implicit-dependency codebases: Reserve dense semantic tags for code where call topology systematically misses the real coupling, configuration-driven initialization, reflection-heavy frameworks, dependency injection containers. For these systems, plain call graphs are incomplete by construction: the DI container wires things that have no direct call edge, and framework annotations drive behavior that never appears in a function call. In these contexts, the semantic layer earns its token cost; for everything else, it doesn’t.

What the paper doesn’t give you is a universal cutoff: a file count or edge density at which to switch from medium to large repo strategy. Those numbers are model- and codebase-specific, and calibrating them requires at least one ablation pass on your own repository.

The second-order effect worth tracking: if you’re running agents in CI or evaluation pipelines where multiple passes are expected, variance reduction may be more valuable than the raw accuracy lift. A consistent agent is a measurable agent. Teams running three or five agentic passes to average out stochastic behavior are paying for noise that lightweight static topology largely eliminates at a 10% annotation overhead. The answer to how much repo structure a coding agent needs is smaller than the “pipe everything into context” instinct suggests, and what it costs you to find that answer is one ablation run.

Frequently Asked Questions

Does lightweight topology anchoring still apply if the codebase is primarily dynamically typed?

Static call graphs only capture edges resolvable at parse time. For Python or Ruby codebases heavy with metaclasses, monkey-patching, or dynamic dispatch, a large fraction of actual runtime call edges will be absent from annotations. The ISSTA study does not report results on dynamically-typed repositories. Its own guidelines treat implicit-dependency systems as a separate case requiring dense semantic tags, which may cover the majority of a heavily dynamic codebase rather than a narrow corner of it. In those codebases, lightweight topology alone is an insufficient primary annotation strategy.

How does per-file inline anchoring compare to a session-level global repo map like Aider’s for cross-cutting tasks?

A global repo map covers all files but inserts topology once at context start and at coarse granularity. Inline anchoring provides precise call-edge detail only for files the agent actually opens. For cross-cutting tasks touching dozens of unrelated files, a global map gives structural overview before any file is read; for localized tasks contained within a small call cluster, inline anchoring’s local precision tends to outperform. Neither the ISSTA paper nor the RIG study compares the two approaches directly on cross-cutting versus localized task distributions.

What metric limitation should teams account for when interpreting the Func@5 improvement numbers?

Func@5 measures whether the correct function appears in the top-five navigation choices, which is a localization benchmark, not a code-correctness benchmark. An agent that navigates to the right function still needs to produce the right edit. The +3.4pp Pass@1 result suggests some correctness benefit transfers, but Pass@1 conflates navigation success with edit quality. Teams running agents on complex refactoring tasks, where the hard part is reasoning about the change rather than locating the target function, should expect smaller gains than the Func@5 numbers imply.

Would agents with richer tool access, such as live language server calls, make pre-annotated static topology unnecessary?

Potentially. The anchoring approach works specifically because topology is injected without extra tool invocations during the task. If an agent could issue a cheap get-callers or get-callees call against a live language server, it would replicate the inverse-edge benefit dynamically without any annotation maintenance overhead. Cursor’s codebase indexing and GitHub’s code navigation tools are moving in this direction. The 10% per-file token overhead from static annotations would then represent a maintenance liability with no net benefit over on-demand tool calls, making the annotation approach an interim solution rather than a durable design.

sources · 4 cited

  1. Agent Token Cost Optimization in 2026: Cut AI Inference Spend by 60-80% agentmarketcap.ai analysis accessed 2026-06-26