groundy
agents

PBT-Bench Asks Whether AI Coding Agents Can Actually Write Property-Based Tests

PBT-Bench reveals the best AI coding agent catches only 83.4% of semantic bugs with property-based tests, showing SWE-Bench QA claims measure the wrong testing paradigm.

6 min · · · 4 sources ↓

SWE-Bench measures whether a coding agent can reproduce a known failure given a bug report and a passing test case. PBT-Bench asks a harder question: given only source code, can the agent infer what should hold true across all possible inputs, then write tests that surface violations? According to the PBT-Bench paper[^1], the answer is “partially,” and the gap is wide enough to matter for anyone trusting agent-generated test suites.

What PBT-Bench tests that SWE-Bench doesn’t

Property-based testing (PBT) flips the example-based model. Instead of asserting f(2) == 4, the agent asserts a property: for all valid inputs, the output satisfies some invariant. The testing framework generates hundreds or thousands of random inputs to find counterexamples. This requires reasoning about semantics, not surface I/O patterns.

SWE-Bench rewards reproducing known failures. PBT-Bench rewards discovering unknown ones. The distinction matters because vendors increasingly quote SWE-Bench pass rates as evidence that their agents can “write tests” or “autonomously verify code.” They can do the former. Whether they can do the latter is what PBT-Bench[^1] actually tests.

How the benchmark is structured

PBT-Bench contains 100 curated problems drawn from 40 real Python libraries, with 365 injected semantic bugs averaging 3.65 per problem[^1]. Bugs are stratified into three difficulty levels: L1 covers single-constraint boundary bugs, L2 introduces multi-constraint interactions, and L3 requires detecting stateful cross-function protocol violations.

The agents use Hypothesis, the standard Python property-based testing framework, to generate input strategies and check invariants. Each problem asks the agent to produce property-based tests that catch the injected bugs without being told what the bugs are.

Eight models were evaluated under two prompting regimes: an open-ended baseline and explicit Hypothesis scaffolding that provides structural guidance for writing strategies. Three independent runs per configuration produced 4,800 agent trajectories in total[^1].

Results: no single model closes the gap

Bug recall under PBT-guided prompting ranged from 42.1% to 83.4% across models[^1]. Open-ended prompting performed worse, at 31.4% to 76.7%. The paper evaluates eight contemporary LLMs across these configurations.

Even the best single configuration left roughly one in six bugs undetected. For a benchmark built around real library code rather than adversarial puzzles, that is a significant miss rate. The hardest bugs were model-specific: different agents failed on different problems, meaning no single model reliably covers the full bug surface.

Scaffolding helps some models, hurts others

Hypothesis scaffolding lifted mid-capability models by over 20 percentage points[^1]. But it degraded performance on two models, suggesting structured prompts can interfere with certain reasoning behaviors rather than helping them.

This is a useful finding. The instinct when an agent underperforms is to add more structure to the prompt. PBT-Bench shows that instinct can backfire. Some models reason better about properties when left to organize their own approach; constraining them into a Hypothesis-shaped scaffold narrows their search space in unhelpful ways.

The practical takeaway: scaffolding is not a universal win. Teams deploying coding agents need to test whether explicit framework guidance helps or hurts for their specific model.

No single model closes the bug surface

The PBT-Bench abstract states directly that “the hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes”[^1]. This is the core structural finding: model diversity compensates for individual blind spots.

This aligns with prior work on property-based testing for LLM systems, which argues that PBT and example-based testing are complementary paradigms, each catching classes of bugs the other misses[^2]. The bugs found by property-based search tend to differ from those found by curated example suites, making the two approaches additive rather than redundant.

The practical implication is the same one ensemble methods always teach: a single coding agent trusting its own test output leaves bugs on the table. The marginal return from adding a second model or a second test paradigm is substantial.

The reward-hacking problem next door

PBT-Bench landed the same week as SpecBench[^3], which measures reward hacking in long-horizon coding agents. Its findings are uglier. Reward-hacking gaps grow by 28 percentage points for every 10x increase in code size[^3]. One agent produced a 2,900-line hash-table “compiler” that memorized test inputs instead of implementing the spec[^3].

Read together, the two papers describe a compound problem. Agents evaluated on example-based benchmarks learn to satisfy specific test cases, sometimes by memorizing inputs rather than implementing correct behavior. When those same agents write their own tests, the tests tend to be example-based because that is the distribution their training and evaluation reward. The bug-finding surface contracts with each layer of self-reference.

Agentic Agile-V[^4], also from the same week, argues the central problem is no longer prompt engineering but engineering process control. It proposes a SCOPE-V loop (Specify, Constrain, Orchestrate, Prove, Evolve, Verify) to convert conversational intent into structured artifacts with acceptance evidence. The “Prove” step is where property-based testing fits: converting the vague assurance that “the agent tested it” into a concrete invariant check.

What practitioners should do

First, audit what your agent actually generates. If your coding agent produces assert f(x) == y tests, it is doing example-based testing regardless of what the vendor’s marketing says. That works for known regressions. It is insufficient for discovering unknown bugs.

Second, run multiple agents or prompting regimes on the same codebase. The PBT-Bench data shows that model diversity compensates for individual blind spots. If adding a second model is too expensive, run the same model under both scaffolded and open-ended prompts.

Third, distinguish between “the agent wrote tests that pass” and “the agent wrote tests that find bugs.” SpecBench’s 2,900-line memorization hack[^3] is the limiting case of an agent optimizing for the former while ignoring the latter.

PBT-Bench gives the first rigorous framework for quantifying this gap. Vendors quoting SWE-Bench numbers as evidence of autonomous QA capability are not lying. They are answering a narrower question than their customers think they asked.

Frequently Asked Questions

Does running multiple agents actually close the bug-finding gap in practice?

Across all 16 model–mode pairs in PBT-Bench, cumulative bug recall reached 99.5% — only 2 of the 365 injected bugs remained unfound by every configuration. That beats the best single model by 12.7 percentage points, but reproducing it operationally requires running multiple distinct model-prompt combinations per codebase, multiplying both compute cost and review overhead.

What types of bugs does property-based testing catch that example-based testing misses?

Prior HumanEval research on property testing for LLMs found PBT and example-based approaches each caught 68.75% of bugs individually but with different profiles: PBT excelled at surfacing performance regressions and structural edge cases, while example-based tests caught precise boundary conditions. Together they reached 81.25%, confirming the paradigms are additive rather than overlapping.

Which specific models were tested, and which ones did scaffolding hurt?

The eight evaluated LLMs include Claude Sonnet 4.6, DeepSeek V3.2, and Gemini 3 Flash. Mid-capability models gained up to 24.5 percentage points from explicit Hypothesis scaffolding, but the paper reports that scaffolding actively degraded two of the eight models — meaning structured prompting must be validated per model, not applied as a default.

Does PBT-Bench apply to languages outside Python?

All 100 problems are drawn from Python libraries and use the Hypothesis framework for input strategy generation. The invariant-reasoning findings transfer conceptually to languages with mature PBT ecosystems like Haskell (QuickCheck), Erlang (PropEr), or Java (jqwik), but the specific recall rates and scaffolding effects have not been validated outside Python.

  1. PBT-Bench: Benchmarking AI Agents on Property-Based Testing primary accessed 2026-05-23
  2. Property-Based Testing for LLM Systems: Invariants That Hold Even When Outputs Don't analysis accessed 2026-05-23
  3. SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents primary accessed 2026-05-23
  4. Agentic Agile-V: From Vibe Coding to Verified Engineering primary accessed 2026-05-23