OpenAI's New Safety Bug Bounty Pays Researchers for Jailbreaks and Policy Bypasses

OpenAI now pays researchers to jailbreak its models. In March 2026, the company launched a Safety Bug Bounty on Bugcrowd, separate from its existing security program, that treats prompt injection, policy circumvention, and harmful-output elicitation as reportable vulnerabilities. Six weeks later, a second, narrower bounty offered $25,000 to anyone who could produce a universal jailbreak prompt against GPT-5.5’s biosafety guardrails. Together, the two programs define a new category of paid disclosure: vendor-controlled, NDA-gated, and priced at a fraction of what comparable research earns elsewhere.

What the Safety Bug Bounty covers

The general Safety Bug Bounty, hosted on Bugcrowb, targets three categories of AI-specific vulnerability:

Agentic Risks Including MCP: prompt injection and data exfiltration attacks against tool-using agents. Submissions require a 50% reproducibility threshold.
OpenAI Proprietary Information leaks: extraction of internal training data, system prompts, or other non-public model internals.
Account and Platform Integrity bypasses: circumvention of usage controls, rate limits, or account-level safety features.

Generic jailbreaks that produce rude language or regurgitate publicly available information are explicitly out of scope. The program does not care whether you can make the model swear. It cares whether you can make it do something structurally dangerous while operating as an agent, or leak something it was trained to conceal.

The distinction matters. The bounty treats the model as a component in a system that takes actions on behalf of users, not as a chatbot with content filters. The attack surface is agentic: tool invocation, data routing, chain-of-composition failures.

The GPT-5.5 Bio Bug Bounty: narrow scope, specific terms

In April 2026, OpenAI launched a second bounty targeting GPT-5.5’s biological safety guardrails. The terms are narrow. A single researcher or team that produces a universal jailbreak prompt defeating five undisclosed biosafety questions earns $25,000. Only GPT-5.5 running in Codex Desktop is in scope. Applications close June 22, 2026; testing runs through July 27.

The GPT-5.5 System Card rates the model “High” in biological and chemical capability. Key safety benchmarks: 97.9% refusal rate on violent illicit behavior, 82.2% on harassment, 0.90 on destructive action avoidance, and 0.963 on prompt injection resistance. That last figure is a regression from GPT-5.4’s 0.998, which is a curious data point to publish alongside a bounty: the model already got worse at resisting prompt injection, and now OpenAI is paying external researchers to try harder.

How it compares to safety bounties at other labs

The general Safety Bug Bounty caps individual rewards at up to $7,500, a separate and much lower ceiling than OpenAI’s traditional Security Bug Bounty, which raised its top payout to $100,000 in 2025. The in-scope agentic products named at launch include ChatGPT Agent, the Atlas browser, Codex, Operator, and Connectors. Submissions are triaged jointly by OpenAI’s safety and security teams, and a report can be rerouted between the two programs depending on whether the underlying issue is behavioral or a conventional security flaw.

Place that next to what other labs pay, and a pattern appears: behavioral jailbreaks and safety-guardrail bypasses are paid at only two of the major players, and everyone else routes them to a feedback form.

Program	Platform	Pays for jailbreaks / safety bypasses?	Top reward	Notable terms
OpenAI Safety Bug Bounty	Bugcrowd	Yes, agentic and policy-bypass classes	up to $7,500	50% reproducibility threshold; generic jailbreaks out of scope
OpenAI GPT-5.5 Bio Bounty	Application portal	Yes, biosafety only	$25,000	Single winner; NDA; five undisclosed questions; closes June 2026
Anthropic Model Safety Bug Bounty	HackerOne	Yes, core focus	up to $35,000	Universal jailbreak of Constitutional Classifiers; public since ~May 2026
Google AI VRP	Bug Hunters	No, explicitly excluded	up to $30,000	Pays only for AI bugs with conventional security impact
Microsoft AI Bounty	MSRC	No, excludes non-security prompt injection	up to $30,000	Covers Copilot and AI products; $250 floor
Mozilla 0Din	0Din	Yes	up to $15,000	Tiered by category; weights disclosure is the top tier

The amounts are less interesting than the scope decisions. Google and Microsoft both built AI bounties in 2025 and 2026 and both decided, in writing, that a jailbreak with no downstream security consequence is not a bug they will pay for. Google’s program rules say “prompt injections, jailbreaks and alignment issues” go to in-product feedback, not the bounty. Microsoft’s AI bounty similarly excludes prompt injection that lacks a security impact. That is a defensible engineering position: a model that says something rude is not a breach of a system boundary. It also means that the entire behavioral-safety category, the part that worries biosecurity regulators most, has exactly two paying buyers, and one of them just locked the highest-stakes question behind an NDA.

Why the payout structure drew fire

The security community’s response to the Bio Bounty was pointed. $25,000 is one-twentieth of OpenAI’s prior spend on red-teaming: the company ran a $500,000 Kaggle competition (ten winners at $50,000 each for red-teaming gpt-oss-20b) for model safety testing before this. The scale gap only widened after launch. [Updated June 2026] OpenAI reported $13.07 billion in 2025 revenue against a net loss in the tens of billions, and by early 2026 was running at roughly a $25 billion annualized rate. The company was valued at $500 billion in an October 2025 share sale and at about $852 billion after a funding round that closed at the end of March 2026. Against a current revenue run-rate near $65 million a day, the full Bio Bounty prize is about 33 seconds of intake.

The terms compound the sticker shock. Only the first researcher to submit a universal jailbreak collects the full amount. Runners-up get nothing. Participants sign an NDA that silences them regardless of whether their submission is accepted or rejected. A researcher could identify a genuine vulnerability, have it dismissed, and still be legally barred from disclosing it elsewhere.

The single-winner structure is worth dwelling on, because it inverts how most bounty economics work. A standard program pays per valid finding, so two researchers who independently discover the same class of bug both get paid, and the vendor gets two reproductions confirming severity. The Bio Bounty pays the first valid universal jailbreak and nothing after, which means the expected value for any individual entrant is the prize divided by the number of serious competitors, minus the opportunity cost of the work and the option value of publishing instead. For a researcher capable of producing a universal bio jailbreak, the same skill set commands far more on a fine-tuning consulting contract or a conference paper that builds reputation. The structure selects for hobbyists and the already-curious, not for the people most likely to find the hardest breaks.

The PR-theory reading

Commenters on Hacker News speculated that the bounty’s terms are designed to produce a specific outcome: low participation yielding a “nobody broke it” narrative, which OpenAI can cite as evidence of model safety to regulators. Restrictive terms and modest payouts discourage serious red-teamers from entering. The NDA prevents anyone who does participate from contradicting the official line.

This is speculative. The structural incentives align, though, and the timing got sharper after publication. OpenAI filed confidentially for an IPO on June 8, 2026, with Goldman Sachs and Morgan Stanley leading; the Wall Street Journal had reported the company was preparing the filing weeks earlier. [Updated June 2026] A public, well-documented safety testing program with named benchmarks is an asset in a regulatory environment where legislators are actively debating AI safety requirements, and a safety story is exactly the kind of forward-looking risk-factor material that gets refined into S-1 language. A bounty that few people enter, where results are vendor-controlled, produces cleaner press than one where a researcher demonstrates a catastrophic failure and publishes the paper. (Groundy has made the same read of OpenAI’s biology-risk post, which maps onto SEC disclosure conventions almost line for line.)

The counterargument is straightforward: any bounty is better than none, and OpenAI is under no obligation to pay market rates. Both points are true. The gap between what OpenAI spends on safety marketing and what it spends on safety bounties is still a data point worth tracking.

What the attack-surface research shows

A paper submitted to arxiv on May 21, 2026 (arXiv:2605.22984) demonstrates Test-Time Training (TTT) achieving 93-95% attack success rates in bypassing safety filters across multiple model families. The technique transfers to production fine-tuning APIs, meaning the vulnerability is not theoretical. It works against the same APIs developers use to customize model behavior.

This is the technical context the bounty programs operate in. Safety guardrails are not robust in the current model generation. A 93-95% bypass rate, if independently confirmed, means any competent attacker with fine-tuning API access can defeat content filters with high reliability. The question is not whether the guardrails break. The research says they do. The question is who finds the breaks first and what they do with the knowledge.

That question is exactly what OpenAI’s bounties are trying to influence. If the first disclosure route a researcher considers is an NDA-gated vendor program rather than a published paper, the vendor controls the timeline, the narrative, and the fix.

There is a deeper structural problem the bounty cannot fix, and it predates the program. A “universal jailbreak” prize rewards the single most general attack a researcher can construct, which sounds like it covers the worst case. Recent red-teaming research suggests the opposite incentive. Automated red-teamers tend to mode-collapse onto a narrow band of high-yield jailbreaks and leave wide regions of the attack space unexplored, which is the pattern Groundy covered in the Stable-GFlowNet work on jailbreak diversity. A bounty that pays once, for the first universal break, optimizes for exactly this failure mode: the first entrant to find a working attack collects, the search stops, and the long tail of distinct but lower-effort breaks never gets probed. A five-question test set narrows the target further. The model can pass a curated handful of bio prompts while remaining trivially breakable on the thousands of phrasings nobody was paid to try. The headline outcome (nobody collected the prize) and the security reality (the guardrails are porous, per the TTT result) can both be true at once, which is precisely the gap a regulator reading the press release would miss.

What this means for developers building on OpenAI’s API

The safety guarantees in the system card are a snapshot of the model’s behavior against OpenAI’s internal test suite. They are not a guarantee against determined adversarial input, and they are not a substitute for your own input validation and output filtering.

The bounty programs create a formal channel for reporting jailbreaks, but the terms favor OpenAI’s interests: NDAs, single-winner payouts, vendor-controlled disclosure timelines. Researchers who want to publish findings openly, or who believe the public interest requires disclosure regardless of vendor preference, have no lane in this program. That tension between coordinated disclosure and open publication is not new in security. It is new in AI safety, where the “vulnerabilities” are behavioral rather than code-level.

The competitive dynamic compounds it. If OpenAI is the only major lab paying for jailbreak disclosure, researchers with findings that apply across model families will route them to the highest bidder or the most permissive publication venue. Labs that do not offer comparable bounties will learn about their own vulnerabilities secondhand, if at all.

Frequently Asked Questions

What if a jailbreak works across multiple model families, not just GPT-5.5?

The Bio Bounty’s NDA would prevent the researcher from disclosing the finding to Anthropic, Google, or any other affected vendor. A cross-family jailbreak submitted under OpenAI’s terms becomes a single-vendor patch while the same vulnerability remains exploitable on every other platform the researcher is legally barred from warning.

Do Anthropic or Google run comparable safety bounties?

[Updated June 2026] Anthropic does, and the gap with OpenAI is narrower than it looked at launch. Anthropic runs a Model Safety Bug Bounty on HackerOne that pays up to $35,000 for a novel universal jailbreak of its Constitutional Classifiers in the CBRN and cyber domains. The program ran invite-only through 2025 (its 2025 constitutional-classifiers challenge drew 339 participants and paid out $55,000 across four teams) and opened publicly around May 2026. Like OpenAI’s Bio Bounty, it targets universal jailbreaks rather than one-off content slips, and it routes infrastructure bugs to a separate product-security track. Anthropic’s Project Glasswing, gated behind the June 2026 Mythos 5 release, is a separate thing: a capability-access program for vetted cyber defenders and biology researchers, not a disclosure bounty.

Google goes the other way. Its AI Vulnerability Reward Program, launched in October 2025, pays up to $20,000 (and as much as $30,000 with report-quality multipliers) but explicitly puts “prompt injections, jailbreaks and alignment issues” out of scope, routing them to in-product feedback instead. The bounty pays only for AI bugs with conventional security impact: data exfiltration, model theft, prompt injection that actually breaches a boundary. So OpenAI is not the only lab paying for jailbreaks. It shares that lane with Anthropic and, at smaller dollar amounts, Mozilla’s 0Din ($500 to $15,000 by category). What OpenAI is doing alone is paying specifically for a biosafety universal jailbreak under single-winner, NDA-locked terms.

What does the prompt injection regression from 0.998 to 0.963 mean in agentic deployments?

That 3.5-point drop means roughly 1 in 28 injection attempts succeeds against GPT-5.5 where GPT-5.4 would have blocked it. For an agent chain making 100 tool calls per session, the compound probability of at least one compromised step rises to about 2.5%. The regression looks small in a benchmark table but is material when the model is orchestrating multi-step actions.

Can standard fine-tuning accidentally weaken safety behavior?

The TTT paper shows that safety bypasses transfer through production fine-tuning APIs without requiring adversarial training data. A developer fine-tuning GPT-5.5 for a legitimate domain task can inadvertently produce a model variant with degraded guardrails. The vulnerability is in the fine-tuning mechanism itself, not in the developer’s intent.

Why does the Bio Bounty test only five questions?

OpenAI has not publicly justified the narrow scope. A five-question test set risks certifying safety against a handful of known-dangerous prompts while leaving the long tail of biosafety queries untested. Because the questions are undisclosed, external researchers also cannot verify whether they represent the hardest cases or a curated set the model already handles.