The Trust Problem With AI Code Review

AI code review tools cannot explain their reasoning—and developers are making consequential decisions based on outputs they neither understand nor can verify. Adoption of AI coding tools has reached 84% of developers as of 2025, according to Stack Overflow’s annual survey of 49,000+ respondents. (Stack Overflow. “2025 Developer Survey.” December 2025) Trust in those tools has simultaneously declined: only 33% of developers trust AI accuracy, and just 3% express high trust. (Stack Overflow. “2025 Developer Survey.” December 2025) The tools are everywhere. The confidence to act on them is not.

The Adoption-Trust Paradox

The statistical picture is unusual. Technology typically builds trust as users gain familiarity. AI coding tools are doing the opposite.

Stack Overflow’s 2025 Developer Survey found that 46% of developers actively distrust AI tool accuracy—up from 31% in 2024. (Stack Overflow. “2025 Developer Survey.” December 2025) Senior developers are the most skeptical: those with 10 or more years of experience show a “highly trust” rate of only 2.6% and a “highly distrust” rate of 20%—the highest of any experience cohort. (Stack Overflow. “2025 Developer Survey.” December 2025)

Sonar’s January 2026 State of Code survey of 1,100 developers surfaced the sharpest gap: 96% of developers do not fully trust that AI-generated code is functionally correct, yet only 48% say they always verify AI-assisted code before committing it. (SonarSource. “Sonar Data Reveals Critical Verification Gap in AI Coding.” Press release, January 2026. https://www.sonarsource.com/company/press-releases/sonar-data-reveals-critical-verification-gap-in-ai-coding/ ; The Register. “Devs Doubt AI Code, Don’t Check It.” January 9, 2026) Roughly half of developers are shipping code they’ve declared themselves unconfident about.

The Qodo State of AI Code Quality report (609 developers surveyed, 2025) documents a specific inversion that should alarm engineering teams: junior developers with under two years of experience report being confident shipping AI code without review at a rate of 60.2%. Senior developers with 10+ years of experience: 25.8%. (Qodo. “State of AI Code Quality 2025.” 2025) The people with the least capacity to catch AI mistakes are the most likely to let them through.

What AI Code Review Actually Does

Understanding the trust problem requires understanding the architecture of these tools—and where that architecture breaks down.

The dominant AI code review systems operate in one of two modes. The first is diff-aware generation: tools like GitHub Copilot Code Review, generally available since late 2024, analyze the changed lines in a pull request and generate natural-language comments about what they detect. The second layer adds static analysis integration: tools like CodeRabbit combine abstract syntax tree (AST) evaluation and traditional SAST scanning with LLM-generated feedback; Qodo integrates symbolic execution alongside generative output.

What none of these tools do is reason about the full codebase. GitHub Copilot Code Review sees only the diff—not the system the diff is part of. CodeRabbit stays within PR boundaries. PRs exceeding 1,500 lines require manual chunking due to token limits. A change that adds a required field to a shared request schema can silently break dozens of downstream services; a diff-aware reviewer has no way to know those services exist.

This structural limit is not a product shortcoming that future iterations will solve easily. It is an architectural consequence of how LLMs work: token-level pattern matching over a bounded context window, without ground-truth access to runtime behavior, dependency graphs, or cross-module data flow.

The Security Test That Should Alarm Everyone

In September 2025, researchers Amena Amro and Manar H. Alalfi at Toronto Metropolitan University published a direct test of GitHub Copilot Code Review’s security capabilities against intentionally vulnerable codebases (arXiv

.13650). (Amro, Amena and Manar H. Alalfi. “GitHub’s Copilot Code Review: Can AI Spot Security Flaws Before You Commit?” arXiv

.13650. September 17, 2025)

The results are specific and damning:

WebGoat (a Java application designed to demonstrate OWASP Top 10 vulnerabilities): Copilot reviewed 1,011 of 1,019 files. It generated one comment—about a typographical error.
SARD XSS test suite: Copilot reviewed 6 of 9 files. Zero security comments.
SARD SQL injection cases: Copilot reviewed 4 of 9 files. It flagged spelling mistakes.
Wireshark test suite: Copilot reviewed 878 of 898 files. Zero comments of any kind.

The paper’s conclusion: Copilot’s review model “is not security-aware in any practical sense.” The researchers attribute this to its reliance on shallow, token-based reasoning that cannot track data flow across functions or files. (Amro, Amena and Manar H. Alalfi. “GitHub’s Copilot Code Review: Can AI Spot Security Flaws Before You Commit?” arXiv

.13650. September 17, 2025)

This is not a test of an AI code generation tool. This is a test of the AI code review tool—the component sold specifically as a safety net. Against a codebase engineered to contain known critical vulnerabilities, it detected none.

Explainability: The Reasoning Problem

Beyond what AI code review detects or misses is the deeper problem of what it communicates when it does flag something.

When a human reviewer comments “this SQL query is vulnerable to injection because user input is concatenated without parameterization,” a developer receives a reasoning chain they can evaluate, extend, and learn from. When an AI tool flags the same issue—or flags a false positive—it produces a verdict without an auditable chain of logic.

This is not unique to code review. Dario Amodei, CEO of Anthropic, wrote in April 2025: “People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.” (Amodei, Dario. “Machines of Loving Grace.” Anthropic, April 2025. Referenced in Sahota, Neil. “The AI Black Box Problem.” neilsahota.com, 2025)

Applied to code review, the explainability gap creates two failure modes. False positives without explanation get dismissed—developers learn to ignore AI flags without understanding why those flags might have been correct. False negatives without explanation provide false assurance—the absence of a flag is interpreted as a clean bill of health, even when the system had no ability to detect the relevant vulnerability class.

A November 2025 academic review published in the Academic Journal of Science and Technology explicitly identifies “insufficient explainability of the results of automated performance analysis” as an unresolved challenge in the field. (“A Review of Research on AI-Assisted Code Generation and AI-Driven Code Review.” Academic Journal of Science and Technology, November 2025)

The False Confidence Effect

The explainability gap has a documented downstream consequence: developers become more confident their code is secure when AI helps write or review it—even when it is not.

Stanford’s 2023 user study (arXiv

.03622, Perry et al., published at ACM CCS 2023) remains the foundational empirical evidence. (Perry, Neil, Megha Srivastava, Deepak Kumar, and Dan Boneh. “Do Users Write More Insecure Code with AI Assistants?” ACM CCS 2023. arXiv

.03622) Researchers tested 47 participants—undergraduates, graduate students, and industry professionals—using an AI coding assistant based on OpenAI’s codex model. Participants with AI assistance wrote significantly less secure code than those without it, and were significantly more likely to believe they had written secure code. The false confidence effect was not limited to inexperienced users; graduate students and professionals showed the same pattern.

The mechanism is plausible: AI assistance reduces visible surface errors—syntax problems, obvious bugs—which signals to the developer that the code is clean. The deeper, harder-to-see vulnerabilities remain, but the developer’s confidence has been raised by the absence of the obvious ones.

Veracode’s 2025 GenAI Code Security Report—analyzing 80 coding tasks across 100+ LLMs and four programming languages—found that 45% of AI-generated code contained security flaws. (Veracode. “2025 GenAI Code Security Report.” July 2025) For XSS vulnerabilities (CWE-80), models failed to produce secure code 86% of the time. For log injection (CWE-117), 88% failure rate. Security performance remained flat regardless of model size or training sophistication—larger, newer models were no better at writing secure code than smaller ones. (Veracode. “2025 GenAI Code Security Report.” July 2025)

The Scale Amplifier

The trust problem compounds at enterprise scale. Apiiro’s September 2025 analysis of Fortune 50 repositories found that AI-assisted developers produced 3 to 4 times more commits than their non-AI peers. (Apiiro. “4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks.” September 2025) They also shipped 10 times more security vulnerabilities. Security findings across analyzed repositories jumped 10x in six months—over 10,000 new security issues per month by June 2025. Privilege escalation vulnerabilities specifically surged 322%. (Apiiro. “4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks.” September 2025)

The volume increase overruns review capacity. Apiiro found that PR volume simultaneously fell 30% while individual PR scope grew—meaning each review carries more code, more surface area, and more potential issues while reviewers have the same amount of time.

Benchmarking AI Code Review Tools (Early 2026)

Tool	Primary Approach	Security Accuracy	False Positive Rate	Key Limitation
CodeQL	Semantic analysis	~88%	~5%	Requires manual rule configuration (sanj.dev. “2025 AI Code Security Benchmark: Snyk vs Semgrep vs CodeQL.” 2025)
Snyk Code	Semantic + AI	~85%	~8%	Limited to diff context (sanj.dev. “2025 AI Code Security Benchmark: Snyk vs Semgrep vs CodeQL.” 2025)
Semgrep	Pattern matching + AI	~82%	~12%	Rule-dependent; misses novel patterns (sanj.dev. “2025 AI Code Security Benchmark: Snyk vs Semgrep vs CodeQL.” 2025)
SonarQube	SAST + AI augmentation	~97% (self-reported)	3.2% (137M issues)	Self-reported data (SonarSource. “How SonarQube Minimizes False Positives Below 5%.” 2025)
CodeRabbit	AST + SAST + LLM	46% (runtime bugs)	Not published	Bounded to PR context (AIMultiple Research. “AI Code Review Tools Benchmark.” 2025)
GitHub Copilot Review	LLM diff analysis	Near zero for OWASP	Not published	Token-based; no data flow (Amro, Amena and Manar H. Alalfi. “GitHub’s Copilot Code Review: Can AI Spot Security Flaws Before You Commit?” arXiv .13650. September 17, 2025)
Qodo	Symbolic + LLM	71.2% (SWE-bench)	Not published	Context window limits (Qodo. “State of AI Code Quality 2025.” 2025)

Accuracy figures reflect security vulnerability detection on independent benchmarks or researcher testing, not vendor marketing claims. Self-reported figures should be weighted accordingly.

What the Explainability Gap Means for Practitioners

The practical implication is not “stop using AI code review.” The tools provide genuine signal on style issues, documentation gaps, common anti-patterns, and low-complexity bugs. The problem is miscalibration—using tools that are effective for pattern-matching tasks as if they were effective for security reasoning.

The Sonar survey found developers perceive AI tools as strong “explainers” and “prototypers” but significantly weaker at modification and optimization of existing, mission-critical code. (SonarSource. “Sonar Data Reveals Critical Verification Gap in AI Coding.” Press release, January 2026. https://www.sonarsource.com/company/press-releases/sonar-data-reveals-critical-verification-gap-in-ai-coding/ ; The Register. “Devs Doubt AI Code, Don’t Check It.” January 9, 2026) That perception matches the empirical data. AI tools generate confidently and explain poorly; they review shallowly and flag selectively.

For engineering teams, the calibration questions are concrete:

Is human review covering data flow and cross-module dependencies that diff-aware AI tools structurally cannot see?
Are AI review verdicts being treated as conclusions or as additional signals to be weighed alongside other evidence?
Is there explicit training that teaches junior developers why AI code review has specific blind spots—and that the absence of an AI flag is not a clean bill of health?
Are security-focused tools (CodeQL, Semgrep, Snyk Code) running in addition to, not instead of, AI review?

The 25% of developers in the Qodo survey who estimate that one in five AI suggestions contain factual errors or misleading code have calibrated their trust appropriately. (Qodo. “State of AI Code Quality 2025.” 2025) The 59% of developers who use AI code they do not fully understand have not. (Clutch. “Software Developers Use AI-Generated Code They Don’t Understand.” June 2025)

Trust should track evidence. The evidence, as of early 2026, does not support trusting AI code review for security-critical paths without layered human verification. The tools are useful. They are not reliable enough to be the last line of defense.

Frequently Asked Questions

Q: Can AI code review tools detect OWASP Top 10 vulnerabilities? A: Inconsistently, and often poorly. A September 2025 study (arXiv

.13650) tested GitHub Copilot Code Review against WebGoat—a codebase engineered to contain OWASP Top 10 vulnerabilities—and found it detected zero critical security issues across 1,011 reviewed files. Dedicated SAST tools like CodeQL perform significantly better on known vulnerability classes.

Q: Why do developers keep trusting AI code review if the accuracy is low? A: The Stanford ACM CCS 2023 study documents the core mechanism: AI assistance reduces visible surface errors, which raises developer confidence even when deeper vulnerabilities remain. Developers perceive cleaner-looking code as more correct code. The absence of AI flags is interpreted as a clean bill of health rather than as the absence of detection capability.

Q: How should AI code review fit into a mature code review process? A: As one signal among several, not as a gate. Use AI review for style, documentation, and common anti-patterns—areas where it is demonstrably useful. Use dedicated SAST tools for security vulnerability scanning. Reserve human review for architecture decisions, data flow analysis, and any change with security implications. Do not allow AI review verdicts to substitute for human judgment on security-critical code paths.

Q: Do newer or larger AI models improve code review security accuracy? A: Not meaningfully, based on available research. Veracode’s 2025 analysis of 100+ LLMs across four programming languages found security performance remained flat regardless of model size or training sophistication. The architectural limitation—bounded context window, token-level pattern matching without data flow tracking—is not overcome by scaling the model.

Q: What is “verification debt” and how serious is it? A: Verification debt is the accumulating burden of reviewing AI-generated code you didn’t author and cannot ask questions of. AWS CTO Werner Vogels coined the term. The Sonar 2026 survey quantifies it: 38% of developers say reviewing AI code requires more effort than reviewing human code, and 42% of committed code now contains significant AI contribution. As AI-assisted development increases, the time cost of proper verification scales with it unless teams build explicit processes to manage it.

The Trust Problem With AI Code Review

The Adoption-Trust Paradox

What AI Code Review Actually Does

The Security Test That Should Alarm Everyone

Explainability: The Reasoning Problem

The False Confidence Effect

The Scale Amplifier

Benchmarking AI Code Review Tools (Early 2026)

What the Explainability Gap Means for Practitioners

Frequently Asked Questions

Sources

Enjoyed this article?

The Adoption-Trust Paradox

What AI Code Review Actually Does

The Security Test That Should Alarm Everyone

Explainability: The Reasoning Problem

The False Confidence Effect

The Scale Amplifier

Benchmarking AI Code Review Tools (Early 2026)

What the Explainability Gap Means for Practitioners

Frequently Asked Questions

Sources

Related Articles

Specialized Skills for Claude Code: Transform It Into Your Expert Pair Programmer

AI-Generated Docs: Better Than What You Were Writing?

AI-Powered Code Refactoring: Automating the Maintenance Burden

Enjoyed this article?