The paper that launched a thousand procurement decks
A dose-response analysis of GitHub Copilot usage intensity, submitted to arXiv in late May 2026, correlates how heavily developers use Copilot with measurable output. The framing is familiar: more Copilot, more productivity. But the study design is observational, and the paper’s own title signals as much. It measures correlation between dose and response without randomizing who receives the dose. That distinction collapses most of the causal claims that will be extracted from its abstract over the next procurement cycle.
Observational Copilot studies share one structural blind spot
The confound is selection bias, and it is not subtle. Developers who use Copilot most intensely are not a random sample drawn from the engineering population. They are self-selected along axes the study cannot observe: comfort with AI tooling, type of work assigned, seniority, task complexity, team norms around code review, and whether their manager explicitly encouraged adoption. Any of these variables could independently predict higher output. An observational design cannot separate the tool’s contribution from the characteristics of the people who chose to use it heavily.
This is not a novel critique. It is introductory epidemiology applied to developer-tool evaluation. But it keeps getting ignored in procurement conversations because the alternative, a randomized controlled trial, is expensive and administratively difficult inside enterprises that have already committed to seat licenses. The observational study gives buyers a number. The RCT gives them an answer. One is cheaper to produce.
What Microsoft’s 5.5M-session M365 study actually measures
A Microsoft-authored study of M365 Copilot Chat, analyzing approximately 5.5 million anonymized sessions, provides the closest analogue. The study found that writing tasks dominate Copilot Chat usage, with a time-trend shift away from “chat as search” toward content and communication-related work. Usage was “broad but uneven” across occupational groupings: some patterns cut across job categories while others were occupation-specific.
Notable: the study maps how people use the tool, not whether the tool made them more effective at what they were doing. The usage-intensity data is real and well-sampled. The productivity claim requires a causal inference step the study’s observational design does not support. Microsoft’s own researchers are careful with their language on this point; the procurement-deck version of the findings typically is not.
The study also identifies areas of relative underrepresentation in Copilot usage across labor-market segments, suggesting the next adoption frontier for enterprise AI. This is useful signal for product strategy. It is not evidence that the underrepresented segments would benefit from the tool, only that they have not adopted it yet.
GitHub’s repo-maintenance agent: velocity gains with a human gate
GitHub Next reports that an AI repository-maintenance agent deployed across 13 open source repositories closed 578 issues, achieving a median 8x increase in issue-closure velocity and 10x in PR-merge velocity. The numbers are striking. The caveat is in the mechanism: the single most important factor was the rate at which human maintainers decided to act on the agent’s output, not the agent’s raw production volume.
This is a cleaner attribution story than most Copilot-productivity claims because the agent operates in a bounded domain (issue triage, patch generation) with an explicit human-in-the-loop checkpoint. The velocity gain is real, but it is a property of the human-agent system, not of the agent alone. Remove the maintainers willing to review and merge, and the 8x figure collapses. The study design makes this visible. Most enterprise Copilot rollouts do not.
The procurement checklist: what methodology to demand
For teams evaluating Copilot seat-license renewals or new deployments, the relevant question is not “does the study show improvement” but “does the study design allow attribution.” A short checklist:
- Was assignment randomized? If not, the study measures correlation, not causation. Say so in the budget memo.
- Is the control group credible? Matched-pair or difference-in-differences designs can partially compensate for non-randomization, but require transparent reporting of how controls were selected.
- What is the outcome metric? Lines of code, commit frequency, PR merge time, and issue-closure rate measure different things. A study that optimizes for one may degrade another. Verify the metric aligns with what the team actually values.
- Who funded it? Microsoft’s M365 study is employer-authored. GitHub Next’s agent numbers are vendor-reported. Disclosure is not the same as independence.
- Does the study measure cost? Velocity gains that require 10x more human review time are not free. If review latency or rejection rate is not reported, the productivity claim is incomplete.
The broader pattern: causal inference is hard, even for AI researchers
The difficulty is not unique to Copilot studies. ORCA, a copilot designed to guide end-to-end causal analysis, exists specifically because the AI research community recognizes that causal inference is methodologically complex and frequently misapplied by domain experts. The tool’s stated purpose is bridging the gap between causal methods’ complexity and practitioners’ ability to use them correctly. When the field building the productivity tools also has to build tools to help researchers avoid misapplying the statistics that measure those tools’ impact, the methodological stakes are visible.
MOOSE-Copilot, accepted to ACL 2026, reinforces the point from a different angle: injecting structured human expert signals significantly outperformed purely autonomous baselines in scientific hypothesis discovery. The finding implies that the collaboration pattern between human and AI, not just the volume of AI usage, drives outcomes. A dose-response model that treats “Copilot usage intensity” as a single continuous variable cannot capture this distinction. Heavier use is not the same as better use.
The security dimension observational studies ignore
One variable that never appears in productivity curves: risk. PromptArmor demonstrated that Copilot Cowork can be manipulated via indirect prompt injection to exfiltrate files from M365, with a 5/5 success rate against Claude Opus 4.7. Agentic copilot features with broad permissions amplify both capability and attack surface. Observational productivity studies measure the numerator. They do not measure the denominator of organizational risk that expanded agent permissions introduce.
Remote control for GitHub Copilot sessions reached general availability on github.com and GitHub Mobile by May 2026, extending Copilot’s reach across devices and workflows. Each new surface area adds measurement opportunities and selection-bias confounds in equal measure. The more contexts in which Copilot is used, the harder it becomes to isolate a clean treatment effect from the characteristics of users who adopt early and across multiple devices.
What the dose-response framing costs
The dose-response paper’s contribution, even without its specific findings, is making the observational-design assumption explicit in the title. Most Copilot-productivity coverage does not. The headline number gets extracted, stripped of its methodology section, and pasted into a budget justification. The procurement team does not see the selection-bias confound. The CFO does not ask whether the control group was randomized. The seat licenses get renewed.
The fix is not to ignore observational studies. It is to read them for what they are: descriptions of how a self-selected group of users behaved, not measurements of what the tool caused. The dose-response paper, by naming its design in the title, inadvertently raises the bar for every Copilot-productivity claim that follows it. That is the finding worth reporting.
Frequently Asked Questions
What evaluation methods work when full randomization isn’t feasible?
Three quasi-experimental designs offer partial remedies: difference-in-differences (compare outcome trends before and after adoption across teams that adopted at different times), regression discontinuity (exploit natural cutoffs like license-availability dates), and instrumental-variable approaches (use factors that predict adoption but are unrelated to output, such as regional rollout schedules). The causal-analysis copilot ORCA was built because researchers routinely misapply these methods, which suggests teams should involve a statistician when designing an internal evaluation rather than relying on vendor-provided benchmarks.
How do selection-bias profiles differ between GitHub Copilot and M365 Copilot Chat?
GitHub Copilot adoption correlates with developer-specific factors: editor preference, language ecosystem, and team code-review norms. M365 Copilot Chat adoption spreads across non-engineering roles with a documented shift over time from search-substitution tasks toward content creation and communication. The selection function draws from different pools in each case, so a dose-response coefficient measured on developers writing code cannot justify seat licenses for procurement analysts writing reports. The two products require separate evaluation designs.
What security costs should be weighed alongside velocity gains in license decisions?
GenAI-driven threat detection with Microsoft Security Copilot (arXiv 2605.20896) is itself an active research topic, which signals that the attack surface introduced by AI assistants is large enough to warrant its own detection tooling. Indirect prompt injection against Copilot Cowork exfiltrated files in every test run PromptArmor conducted. Each new integration point, including the general-availability remote-control feature on GitHub Mobile, adds both measurement surface and attack surface. A procurement model that accounts for security-response overhead alongside velocity gains produces a more defensible ROI estimate than one that ignores the risk denominator.
What would a credible Copilot RCT need to measure beyond code output volume?
A credible design would randomize seat-license allocation across similar teams, pre-register the outcome metric (throughput, quality, review burden, or developer satisfaction), and run long enough for the novelty effect to wear off, typically 8 to 12 weeks based on internal-tool adoption curves. It would also track review-rejection rates and downstream defect rates, not just commit volume. MOOSE-Copilot’s ACL 2026 finding that structured human-AI collaboration outperforms raw usage volume suggests the trial should vary the collaboration protocol alongside tool access, to isolate which mechanism actually drives outcomes.