Do AGENTS.md Files Actually Help Coding Agents? A New Benchmark Tests It

No, not on task success. A recent benchmark finds that providing AGENTS.md context files does not generally improve coding-agent success rates, while increasing inference cost by over 20% on average (arXiv:2602.11988). The assumption baked into the 60,000-plus repositories now shipping these files, that more curated context is strictly free value, does not survive the measurement.

What did AGENTbench actually measure?

AGENTbench tests whether repository-level context files change how often coding agents resolve real GitHub issues, holding the agent and model fixed while swapping the context file in and out. Posted to arXiv in February 2026 by Thibaud Gloaguen, it evaluates issues drawn from Python repositories that ship developer-committed context files (arXiv:2602.11988).

Each instance runs under three conditions: no context file at all, an LLM-generated file produced by the agent’s own /init command, and the developer-written file already in the repo. Multiple frontier coding agents across different underlying LLMs were evaluated. To stress-test the LLM-generated case on popular codebases, the authors also ran agents on SWE-bench Lite, an established benchmark sourced from well-known Python repositories, none of which contain a developer-written context file. The headline metric is success rate: the share of instances where the agent’s patch makes the test suite pass.

The methodology matters because it sidesteps two failure modes that had left the question open. Context files did not exist when SWE-bench was built, and popular repos are unrepresentative, since agents already know them cold. AGENTbench deliberately mines niche repositories where a context file plausibly has something to add, which is the friendliest possible test for the format.

How big is the effect on task success?

Small and inconsistent, and that is the finding. Context files do not generally improve task success rates, and the result holds across different underlying LLMs, different generation prompts, and for both LLM-generated and developer-committed files (arXiv:2602.11988).

The cost, by contrast, is real. Inference rises by over 20% on average, whether the file was written by a developer or generated by the agent itself. More steps, more tokens, more dollars, for a success-rate change that does not consistently materialize.

Condition	Effect on task success	Inference cost	Applies to
No context file	baseline	baseline	any repo
LLM-generated (`/init`)	no consistent improvement	+20%+ avg	popular repos
Developer-written	modest, inconsistent	+20%+ avg	repos that already ship one

Why do LLM-generated AGENTS.md files make agents worse?

Because they duplicate documentation the model already has, and because the act of reading and obeying them burns inference budget on content that was already available.

The paper confirms that instructions in context files are well followed by agents (arXiv:2602.11988). The model is not ignoring the file. It is reading it, taking it seriously, and spending compute to comply. When the file largely restates what existing documentation already says, the spend is pure overhead.

Inference cost rises over 20% on average even though task success does not improve. That is the tension the format creates: the agent obeys, the agent pays, and the improvement does not arrive.

Do AGENTS.md files help agents navigate the repository?

No, and this is the result that most directly undercuts the format’s own guidance. A common recommendation, embedded in agent generation prompts and present in many of the developer-written files studied, is to include a repository overview that enumerates directories and subdirectories. The authors measured whether such overviews help agents find the relevant files faster, specifically the number of steps before the agent first touches a file modified by the golden patch.

Context files did not reduce that metric, on either benchmark (arXiv:2602.11988). The overview section, in other words, is cargo. Agents locate the code they need by grepping and reading, the same way they would without the file. One trace inspection found an agent burning extra steps to issue commands that found and re-read the context file already sitting in its context, a behavior that only appeared when a context file was present at all.

The useful thing a context file does is name tools. Repository-specific tooling was used 2.5 times per instance when mentioned in the file, versus fewer than 0.05 times when not mentioned; uv was used 1.6 times per instance when mentioned versus under 0.01 when not (arXiv:2602.11988). That is a real, large effect. It just is not about navigation. It is about telling the agent which non-obvious command to run.

Doesn’t another study show AGENTS.md saves time and tokens?

A second paper does, and reconciling the two is the part of this story most existing coverage flattens. An efficiency-focused study evaluated Codex across 10 repositories and 124 pull requests, and found that the presence of an AGENTS.md file was associated with a 28.64% lower median runtime and 16.58% lower output-token consumption, while maintaining comparable task completion (arXiv:2601.20404).

Those numbers look like they contradict the AGENTbench result. They do not, because they measure a different thing. The efficiency study’s dependent variable is wall-clock time and token cost; the AGENTbench study’s is test-suite success rate. An AGENTS.md file can simultaneously save the agent time on the easy part of a task, by pointing it at the right build command up front, and cost it success on the hard part, by injecting redundant requirements that misdirect reasoning. “It runs faster” and “it succeeds more often” are different claims, and the two papers are each internally consistent.

There is a second caveat the authors flag explicitly: AGENTbench is Python-only, and Python is heavily represented in training data, so models may already know its tooling well enough to nullify the context file’s value. The case where AGENTS.md arguably matters most, niche languages, proprietary codebases, and unusual toolchains, is out of scope. Treat the negative result as “no measured benefit on open-source Python,” not as “no benefit anywhere.”

What should you actually put in an AGENTS.md file?

Only the constraints the agent cannot discover by reading the repo, paired with deterministic enforcement so it cannot ignore them even if it tries. The paper’s prescription is blunt: omit LLM-generated files, and in human-written files state only minimal, non-discoverable requirements, then evaluate any performance claim before trusting it (arXiv:2602.11988).

Concretely, that means: the environment variable the test suite needs, the non-standard test invocation that is not pytest, the linter rule that overrides the default, the internal tool the agent will not find in node_modules. It does not mean a prose tour of the directory structure, and it especially does not mean the output of /init, which the data shows is a net negative on popular repos that already have documentation. The format went from an OpenAI proposal in August 2025 to a Linux Foundation Agentic AI Foundation donation alongside MCP in December 2025 to native support across Codex, Cursor, Copilot, Gemini CLI, Aider, Windsurf, and Zed (codersera). Adoption is not the same as efficacy.

The deeper lesson is about where authority should live in an agentic codebase. A context file is a suggestion the model reads and may follow at token cost. A linter, a type checker, and a pre-commit hook are constraints the model cannot route around. The data shows the suggestion is, on average, not worth its cost. The deterministic layer has no such measured downside, and it works whether or not the agent bothers to read your AGENTS.md at all.

Frequently Asked Questions

What are the specific success-rate deltas for LLM-generated versus developer-written AGENTS.md files?

LLM-generated context files decreased agent success rates by 2-3% on average in the AGENTbench results, while hand-written developer files gained roughly 4% on average. Both came with the same 20%+ inference cost increase, putting the ceiling for the human-curated case at about 4 percentage points of gain against a guaranteed 20%+ cost penalty.

Is there any condition where LLM-generated AGENTS.md files actually improve agent performance?

Yes, in one condition: when the repository lacks a README or other existing documentation. Removing those docs allowed LLM-generated AGENTS.md files to improve agent performance by 2.7% in the AGENTbench analysis. The typical case is a project that already has documentation, where the generated file duplicates it and adds only overhead.

How much of the 20%+ inference cost increase comes from reasoning rather than just carrying extra tokens in context?

The mechanism analysis attributes 14-22% more reasoning tokens specifically to parsing the context file, distinct from the baseline overhead of holding the file in context. The agent spends additional compute deciding how to act on the instructions, and this overhead appears regardless of whether the file contains genuinely new information.

Does AGENTS.md have any practical advantage over tool-specific formats like CLAUDE.md or Cursor rules?

The main advantage is portability. AGENTS.md is read natively by at least seven tools (Codex, Cursor, Copilot, Gemini CLI, Aider, Windsurf, Zed) without per-tool configuration, while CLAUDE.md activates only in Claude Code and Cursor rules only in Cursor. For codebases where multiple agents operate, a single AGENTS.md avoids duplicating instructions across formats. The AGENTbench cost-to-success finding applies to all these tools equally; portability does not improve the tradeoff, it only reduces format maintenance overhead.