Can an AI Agent Catch Cryptographic Misuse Before It Ships? Chai Tests the Claim

Can an AI agent catch cryptographic misuse before it ships? On a narrow class of bug, the Chai preprint says yes. Posted June 25, 2026 by Corban Villa et al., it claims an LLM-driven agent discovered a previously unknown critical vulnerability in an SSL library running on billions of devices, plus over 100 cryptographic-misuse bugs across X.509, JWT, and SAML code (abstract). The headline is three days old and author-graded. The structural claim underneath it is the part worth taking seriously: that catching crypto misuse is a recognition problem, not a pattern-matching problem, and that reframing exposes the real gap between static analysis and agent reasoning.

What does Chai actually do?

Chai catalogs cryptographic flaws at the library level, then propagates each flaw across a dependency graph of downstream consumers instead of auditing one application at a time. That inversion is the whole pitch. The dominant agentic-vuln-discovery pattern scans a single codebase for many bug types and stops there. Chai starts from the library, records where a primitive is used incorrectly, and follows the dependency edges to find the applications that inherited the misuse, which the authors describe as “compounding efficiency gains” over re-auditing each consumer independently.

The detection mechanism is a reworking of differential testing. Classical differential testing compares two or more library implementations of the same spec against shared inputs and treats divergences as bug candidates. Chai uses an LLM agent for two jobs: improving precision on real security issues inside libraries, and repurposing the discrepancies differential testing normally discards as noise, treating them instead as leads for tangible vulnerabilities in applications downstream of the library. The agent decides which discrepancies are security-relevant in protocol context, then chases them through the graph.

The three protocol surfaces evaluated are X.509, JWT, and SAML. These are not arbitrary: all three are cryptographic protocols where a primitive can be syntactically correct, type-check clean, and still be wrong because of how it is composed into a handshake, a token validation path, or an assertion flow. That composition is exactly what a presence-based detector cannot see.

Why does cryptographic misuse break standard SAST tools?

Crypto misuse sits in a gap that API-pattern detection was never built to close. A SAST tool can flag the presence of MD5 or a hardcoded IV, and the better ones do. But misuse is frequently a question of how a primitive is composed in protocol context, not whether the primitive appears at all. An RSA operation with no padding is the textbook example: the call site is legal, the algorithm is correct, and the result is exploitable. Recognizing that requires reasoning about what the surrounding handshake is supposed to guarantee, which is a different task from matching a tainted-data flow to a sink.

The two dominant SAST engines occupy distinct niches and neither was architected for protocol-context reasoning. Semgrep is pattern-based, runs YAML rules, and completes scans in 10 to 30 seconds; CodeQL builds a semantic database for deep taint tracking but takes 10 to 60 minutes or more per the 2026 Semgrep-versus-CodeQL comparison. Neither design prioritizes the step Chai is built around: deciding that a discrepancy between two library implementations is a security defect rather than a spec-permissible divergence.

The honest version of this argument does not strawman SAST as “just grepping for dangerous APIs.” That is false. Existing engines ship crypto-specific taint queries and catch a real class of bug that Chai does not claim to replace. The fair comparison is between two different capabilities: API-pattern detection, which is mature and cheap, and protocol-context reasoning, which is what Chai is trying to demonstrate an agent can do. The question is not whether SAST finds crypto bugs today. It does. The question is whether an agent can surface the bugs that require understanding the protocol semantics, and whether that capability lands inside SAST suites before it lands as a competing product.

What did Chai find, and what is actually verified?

The abstract-level results are these, verbatim from arXiv:2606.26933: Chai discovered a previously unknown critical vulnerability in an SSL library that powers billions of devices, security bugs in one library behind a major web browser, and bugs in libraries used by major Linux distributions. Across X.509, JWT, and SAML libraries and their downstream consumers, the techniques surfaced over 100 vulnerabilities total (abstract).

That is where verification stops. The paper is a preprint under cs.CR, not yet peer-reviewed, submitted June 25, 2026. The abstract does not name the SSL library, the browser, or the Linux distributions. It does not publish a CVSS score for the critical finding. It does not publish a head-to-head precision or recall number against CodeQL or Semgrep on the same corpora. Any such figure requires the full PDF, which is outside what this abstract supports. That total is author-graded, not independently confirmed as of 2026-06-28.

What is structurally checkable is the dependency-graph argument. Propagating a library-level flaw to downstream consumers is a real efficiency claim independent of which specific bug was found. If one JWT library validates signatures incorrectly, every service that transitively depends on it inherits the defect, and a graph traversal catches all of them at the cost of one library-level discovery. That is the part of Chai that survives even if the critical-finding headline softens under review.

How good are agents at security reasoning today?

This is the counterweight the Chai headline needs. Agent reasoning is itself a liability surface: the planner that confirms a crypto flaw inherits every weakness of the underlying model.

Agent Security Bench (ASB), accepted at ICLR 2025, covered 10 scenarios and 13 LLM backbones; per its abstract, the highest average attack success rate against LLM-based agents was 84.30%. That figure is about the agent layer being exploited, not about agents failing to find bugs, but it establishes that the agent itself is a soft target. A system that depends on an LLM planner to distinguish a real vulnerability from a benign divergence inherits whatever susceptibility that planner has to malformed inputs, adversarial code in the corpus, or simply confident hallucination on a borderline case.

Chai’s design limits that exposure by changing the unit of work. Instead of asking the agent to find a needle in one application, it asks the agent to confirm a library-level discrepancy and then propagate the confirmed flaw mechanically across the graph. The graph traversal does not require agent reasoning; only the library-level confirmation does. That division of labor is the plausible explanation for why Chai can claim over 100 vulnerabilities from a tractable amount of agent work (abstract): the agent earns its keep on the hard recognition step, and the dependency graph does the cheap, reliable propagation. Whether the agent’s library-level confirmation is reliable enough to anchor that propagation is not something the abstract proves with independent numbers.

What does this mean for SAST vendors and appsec buyers?

The actionable question is narrow. If an agent can reliably confirm that a cryptographic primitive is wrong in protocol context, that capability either lands inside existing SAST suites as a new query class or it gets priced away from them as a separate product. There is no stable third option, because crypto misuse is the category SAST vendors currently own, and a planner-style verifier that finds bugs their engines miss is a direct competitive threat to that ownership.

The case for absorbing the capability is stronger than it looks at first. A 2025 study of 1,080 LLM-generated code samples found CodeQL and Semgrep agreed with human reviewers in aggregate but disagreed badly per-sample: only 61% of CodeQL reports and 65% of Semgrep reports matched the human-validated ground truth (study). Per-sample disagreement at that scale is not a rounding error. It means a SAST buyer cannot treat either engine as the sole evaluator of whether a reported crypto finding is real, and it means the same engines are not well positioned to act as the ground-truth oracle for an agent’s output either. If the agent needs human confirmation anyway, the SAST vendor that bundles a planner-style verifier with its existing recall gets the upside; the standalone agent product has to clear a higher bar to justify adoption.

The case against absorption is the cost structure. CodeQL already takes 10 to 60 minutes per build for semantic analysis; layering an LLM agent’s per-primitive protocol reasoning on top is a different compute profile and a different unit economics. A SAST vendor that bolts agent verification onto every scan pays per-inference for a capability whose real-world precision on protocol-context bugs is not yet measured. The likely settlement is tiering: pattern-based crypto detection stays in the always-on scan, and agent-verified protocol-context reasoning runs as a higher-cost tier or a targeted pass on the flagged findings rather than the whole corpus.

For an appsec buyer evaluating Chai-class systems today, the discipline is the same one the 2025 study imposes: treat every quantitative claim as needing its own source anchor, treat every agent finding as needing human confirmation, and treat the dependency-graph propagation as the durable idea regardless of how the headline count holds up under review. The preprint is a strong structural argument wearing a strong-sounding result. The argument is what will outlive the next round of corrections. The result is what will be checked first.

Frequently Asked Questions

How does Chai’s finding rate compare to agentic results on other security benchmarks?

The closest published reference point is CyberChainBench, a smart-contract benchmark, where the best agent-model configuration scored 37.5% on detection, 43.7% on exploitation, and 23.4% on patching. Those numbers anchor where agentic security reasoning currently ceilings out on a comparable vuln-discovery task, and they explain why Chai confines the agent to library-level confirmation and offloads consumer-level propagation to a deterministic graph traversal.

What is the blast radius if the agent misclassifies a library-level discrepancy?

A false positive at the library level does not stay local. Chai propagates each confirmed flaw across the dependency graph, so a single misclassified divergence becomes a finding in every downstream consumer that transitively imports the library. The abstract publishes no per-library precision figure, which means the spread of one agent error is unmeasured in the preprint.

Why are X.509, JWT, and SAML the only protocols in the evaluation?

All three are protocols with multiple independent implementations against a shared specification, which is the precondition for differential testing to produce a discrepancy in the first place. A protocol with a single canonical implementation gives Chai nothing to compare against, and a protocol where implementation divergences are spec-permissible rather than defective would generate leads the agent has no grounds to confirm.

What would force SAST vendors to absorb this capability rather than price it as a separate product?

The forcing function is whether protocol-context reasoning can be expressed as a query. If the composition checks Chai’s agent performs collapse into a CodeQL or Semgrep rule pack, the capability folds into the existing scan and the standalone product loses its wedge. If the reasoning stays irreducibly per-primitive and requires an LLM call per finding, the standalone product keeps its margin and SAST vendors tier it rather than own it outright.