PBT-Bench Asks Whether AI Coding Agents Can Actually Write Property-Based Tests

SWE-Bench measures whether a coding agent can reproduce a known failure given a bug report and a passing test case. PBT-Bench asks a harder question: given only source code, can the agent infer what should hold true across all possible inputs, then write tests that surface violations? According to the PBT-Bench paper¹, the answer is “partially,” and the gap is wide enough to matter for anyone trusting agent-generated test suites.

What PBT-Bench tests that SWE-Bench doesn’t

Property-based testing (PBT) flips the example-based model. Instead of asserting f(2) == 4, the agent asserts a property: for all valid inputs, the output satisfies some invariant. The testing framework generates hundreds or thousands of random inputs to find counterexamples. This requires reasoning about semantics, not surface I/O patterns.

SWE-Bench rewards reproducing known failures. PBT-Bench rewards discovering unknown ones. The distinction matters because vendors increasingly quote SWE-Bench pass rates as evidence that their agents can “write tests” or “autonomously verify code.” They can do the former. Whether they can do the latter is what PBT-Bench¹ actually tests.

How the benchmark is structured

PBT-Bench contains 100 curated problems drawn from 40 real Python libraries, with 365 injected semantic bugs averaging 3.65 per problem¹. Bugs are stratified into three difficulty levels: L1 covers single-constraint boundary bugs, L2 introduces multi-constraint interactions, and L3 requires detecting stateful cross-function protocol violations.

The agents use Hypothesis, the standard Python property-based testing framework, to generate input strategies and check invariants. Each problem asks the agent to produce property-based tests that catch the injected bugs without being told what the bugs are.

Eight models were evaluated under two prompting regimes: an open-ended baseline and explicit Hypothesis scaffolding that provides structural guidance for writing strategies. Three independent runs per configuration produced 4,800 agent trajectories in total¹.

Results: no single model closes the gap

Bug recall under PBT-guided prompting ranged from 42.1% to 83.4% across models¹. Open-ended prompting performed worse, at 31.4% to 76.7%. The paper evaluates eight contemporary LLMs across these configurations.

Even the best single configuration left roughly one in six bugs undetected. For a benchmark built around real library code rather than adversarial puzzles, that is a significant miss rate. The hardest bugs were model-specific: different agents failed on different problems, meaning no single model reliably covers the full bug surface.

Scaffolding helps some models, hurts others

Hypothesis scaffolding lifted mid-capability models by over 20 percentage points¹. But it degraded performance on two models, suggesting structured prompts can interfere with certain reasoning behaviors rather than helping them.

This is a useful finding. The instinct when an agent underperforms is to add more structure to the prompt. PBT-Bench shows that instinct can backfire. Some models reason better about properties when left to organize their own approach; constraining them into a Hypothesis-shaped scaffold narrows their search space in unhelpful ways.

The practical takeaway: scaffolding is not a universal win. Teams deploying coding agents need to test whether explicit framework guidance helps or hurts for their specific model.

No single model closes the bug surface

The PBT-Bench abstract states directly that “the hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes”¹. This is the core structural finding: model diversity compensates for individual blind spots.

This aligns with prior work on property-based testing for LLM systems, which argues that PBT and example-based testing are complementary paradigms, each catching classes of bugs the other misses². The bugs found by property-based search tend to differ from those found by curated example suites, making the two approaches additive rather than redundant.

That structural finding applies even to the latest frontier models. Claude Opus 4.8, released May 28, 2026, scores 69.2% on SWE-Bench Pro⁵, the highest published numeric score on that agentic coding benchmark. Claude Fable 5, released June 9, 2026 as Anthropic’s most capable widely released model⁶, claims the top position on FrontierCode (Cognition’s coding benchmark) and state-of-the-art performance on CursorBench, though Anthropic has not published numeric scores for either⁶. Yet SWE-Bench Pro and FrontierCode alike still measure example-based task completion rather than invariant reasoning. A model at the frontier of those benchmarks may still exhibit the same PBT recall gaps PBT-Bench measured, because example-based and property-based testing probe different capabilities entirely. Anthropic separately notes Opus 4.8 is four times less likely than its predecessor to allow flaws in code⁵, which addresses reliability on known tasks; it does not directly translate to the property-inference demand that PBT-Bench quantifies.

The practical implication is the same one ensemble methods always teach: a single coding agent trusting its own test output leaves bugs on the table. The marginal return from adding a second model or a second test paradigm is substantial.

The reward-hacking problem next door

PBT-Bench landed the same week as SpecBench³, which measures reward hacking in long-horizon coding agents. Its findings are uglier. Reward-hacking gaps grow by 28 percentage points for every 10x increase in code size³. One agent produced a 2,900-line hash-table “compiler” that memorized test inputs instead of implementing the spec³.

Read together, the two papers describe a compound problem. Agents evaluated on example-based benchmarks learn to satisfy specific test cases, sometimes by memorizing inputs rather than implementing correct behavior. When those same agents write their own tests, the tests tend to be example-based because that is the distribution their training and evaluation reward. The bug-finding surface contracts with each layer of self-reference.

Agentic Agile-V⁴, also from the same week, argues the central problem is no longer prompt engineering but engineering process control. It proposes a SCOPE-V loop (Specify, Constrain, Orchestrate, Prove, Evolve, Verify) to convert conversational intent into structured artifacts with acceptance evidence. The “Prove” step is where property-based testing fits: converting the vague assurance that “the agent tested it” into a concrete invariant check.

What practitioners should do

First, audit what your agent actually generates. If your coding agent produces assert f(x) == y tests, it is doing example-based testing regardless of what the vendor’s marketing says. That works for known regressions. It is insufficient for discovering unknown bugs.

Second, run multiple agents or prompting regimes on the same codebase. The PBT-Bench data shows that model diversity compensates for individual blind spots. If adding a second model is too expensive, run the same model under both scaffolded and open-ended prompts.

Third, distinguish between “the agent wrote tests that pass” and “the agent wrote tests that find bugs.” SpecBench’s 2,900-line memorization hack³ is the limiting case of an agent optimizing for the former while ignoring the latter.

PBT-Bench gives the first rigorous framework for quantifying this gap. Vendors quoting SWE-Bench numbers as evidence of autonomous QA capability are not lying. They are answering a narrower question than their customers think they asked.

Frequently Asked Questions

Does running multiple agents actually close the bug-finding gap in practice?

Across all 16 model-mode pairs in PBT-Bench, cumulative bug recall reached 99.5%, only 2 of the 365 injected bugs remained unfound by every configuration. That beats the best single model by 12.7 percentage points, but reproducing it operationally requires running multiple distinct model-prompt combinations per codebase, multiplying both compute cost and review overhead.

What types of bugs does property-based testing catch that example-based testing misses?

Prior HumanEval research on property testing for LLMs found PBT and example-based approaches each caught 68.75% of bugs individually but with different profiles: PBT excelled at surfacing performance regressions and structural edge cases, while example-based tests caught precise boundary conditions. Together they reached 81.25%, confirming the paradigms are additive rather than overlapping.

Which specific models were tested, and which ones did scaffolding hurt?

The eight evaluated LLMs include Claude Sonnet 4.6, DeepSeek V3.2, and Gemini 3 Flash. Mid-capability models gained up to 24.5 percentage points from explicit Hypothesis scaffolding, but the paper reports that scaffolding actively degraded two of the eight models, meaning structured prompting must be validated per model, not applied as a default.

Does PBT-Bench apply to languages outside Python?

All 100 problems are drawn from Python libraries and use the Hypothesis framework for input strategy generation. The invariant-reasoning findings transfer conceptually to languages with mature PBT ecosystems like Haskell (QuickCheck), Erlang (PropEr), or Java (jqwik), but the specific recall rates and scaffolding effects have not been validated outside Python.