DPrivBench: LLMs Score 99.5% on Textbook DP but Collapse on Advanced Reasoning

Wang et al.’s DPrivBench¹ benchmarks 11 LLMs across 713 differential-privacy verification instances and surfaces a gap that matters: GPT-5-High scores 0.995 on textbook mechanism verification, while GPT-5.5-High, the best on advanced DP reasoning, reaches F1 0.829. On parallel composition specifically, every model except one fails every trial.

The benchmark: what DPrivBench actually tests

Category 1 covers 588 instances² across 6 mechanism types: Laplace, Gaussian-GDP, Gaussian-zCDP, Exponential Mechanism, Report-Noisy-Max Laplace, and Permute-and-Flip. Each instance pairs a function with parameters and poses a binary question: does this satisfy ε-DP? The task is mechanistic pattern-matching on known structures.

Category 2 spans 125 instances² across 16 topics grouped into DP accounting, DP statistics, DP-ML, and data-adaptive mechanisms. Composition, sensitivity analysis, and algorithmic subtlety are all in scope. It is also where the benchmark’s discriminating signal lives.

Labeling every instance as DP-satisfying yields F1 ≈ 0.503 on Category 2, per the paper. Any model scoring below that on a topic is less useful than a prior that says everything is fine.

The headline numbers and what they hide

GPT-5-High at 0.995 on Category 1 is real, but Category 1 is deterministic verification with known parameters. The gap opens in Category 2. GPT-5.5-High, the best performer on advanced reasoning, sits at F1 0.829, meaning roughly 1 in 5 advanced DP claims is misclassified.

The per-topic breakdown in the paper shows the variance clearly. Four topics (Quantile, DP-Adam, Accounting, Private Selection) achieve ≥ 0.95 accuracy across models. The bottom three (Smooth Sensitivity, PATE, Output Perturbation) stay below accuracy 0.6. On those three, the distance between the best frontier model and always-yes is within noise.

Advanced DP problems appear wherever systems compose mechanisms: federated learning, private aggregation pipelines, PATE-style training. An F1 of 0.829 on an audit tool, in a system where false positives produce uncorrected privacy violations, is not a capability gap to iterate on. It is a disqualifier for unsupervised use.

Three failure modes that break production pipelines

The paper documents three reproducible failure modes.

Sensitivity-bound hallucination. Models assert incorrect sensitivity bounds, particularly when deriving the bound requires reasoning about a function’s output range rather than reading a standard value. The models are confident. They are wrong.

Parallel-composition confusion. Parallel composition holds when mechanisms operate on genuinely disjoint data subsets, not sequentially filtered ones or overlapping projections. On Q106, targeting exactly this distinction, only Gemini-3.1-Pro succeeds, and only in 1 out of 5 trials. Every other model fails every trial.

The n→n-1 summation range. Some DP mechanisms require summing over n-1 items rather than n, a shift that doubles the required noise budget. Models miss this systematically. Missing it produces a stated privacy guarantee the actual mechanism doesn’t satisfy.

On a specific output perturbation question², Gemini-3-Pro achieves 0% accuracy across all five trials, GPT-5.4-High reaches 20%, and GPT-5.5-High reaches 40%. Even the best performers on this question, Gemini-3.1 and GPT-5-High, manage only 60%. The per-question variance on a single mechanism type is exactly what aggregate scores obscure.

Report-Noisy-Max: the mechanism nobody can verify

Within Category 1, which frontier models generally handle well, Report-Noisy-Max Laplace is an outlier. Every model except GPT-5-High scores near or below 0.5 accuracy on it, per the benchmark data. This is a foundational mechanism type: textbook DP, fixed parameters, deterministic verification.

A 0.5 accuracy on a binary classification task with known structure is the always-yes baseline. It is not a near-miss.

The practical implication: Category 1 accuracy in the 0.8-0.9 range, which most models achieve, covers 6 mechanism types. The per-type variance means that score conceals at least one complete failure. An operator who deploys LLM-assisted verification on a Laplace mechanism today and Report-Noisy-Max tomorrow is relying on a capability that isn’t there.

The RAG and theorem-augmentation gap

Retrieval-augmented generation and theorem augmentation (providing relevant formal definitions alongside the query) both improve model performance on DPrivBench. Neither closes the gap on the hardest instances, per the paper.

The three documented failure modes persist under augmentation. Providing a model with the formal definition of parallel composition does not fix its inability to reason about whether a specific data partition is genuinely disjoint. Retrieval gets the model to the relevant theorem; the reasoning step that applies it correctly remains broken.

The natural response to a benchmark like this is to propose retrieval over a curated DP theorem library. That response has already been tested, and the ceiling it produces is visible in the results.

Conflict of interest and what it means for the results

Two of the seven authors, Om Thakkar and Ruihan Wu, are OpenAI employees, and the work received an OpenAI security research grant, per the paper’s disclosures.

The benchmark dataset is public on HuggingFace and the evaluation code on GitHub³, making independent replication feasible. The results are verifiable.

What the disclosure affects is the framing incentive. GPT-5-High’s 0.995 on Category 1 is the number most likely to circulate in isolation. It is also the result on a category designed for deterministic pattern-matching, produced by a different model version from the one achieving the best Category 2 result. GPT-5-High and GPT-5.5-High are not the same model. Category 1 and Category 2 are not the same task. The difference between them is the difference between “LLMs can verify DP” and “the best LLM misclassifies 1 in 5 advanced DP claims.”

Frequently Asked Questions

How does DPrivBench differ from formal DP verification tools like PinQ or DiffPrivLib?

DPrivBench evaluates an LLM’s reasoning about DP proofs, not whether a mechanism formally satisfies DP. Tools like PinQ and DiffPrivLib provide mathematically guaranteed verification; DPrivBench exposes where LLM reasoning diverges from those guarantees. No directly competing LLM DP benchmark exists at this scale (713 instances across 11+ models).

Is anyone using DPrivBench in production CI/CD pipelines today?

No practitioner tooling has integrated DPrivBench-style LLM verification into CI/CD or data pipeline audit workflows. The HuggingFace dataset shows roughly 199 monthly downloads, indicating early-stage researcher adoption only. Production use would require per-mechanism confidence scoring that the benchmark does not currently surface.

How quickly will these model-specific results become outdated?

The v2 revision adding GPT-5.5-High and Gemini-3.1-Pro dropped 2026-05-15, three days before this analysis. The benchmark structure (two categories, 713 instances) is evergreen, but specific model scores have a relevance window of roughly 2–4 weeks, they age with each new model generation.

What do the high-performing Category 2 topics have in common?

The four topics above 0.95 accuracy (Quantile, DP-Adam, Accounting, Private Selection) all involve applying well-documented compositional formulas with bounded parameter spaces. The failures cluster where reasoning requires judging whether a specific data partition is genuinely disjoint or deriving sensitivity from a function’s output range, tasks that demand situational reasoning rather than formula application.