OpenAI now pays researchers to jailbreak its models. In March 2026, the company launched a Safety Bug Bounty on Bugcrowd, separate from its existing security program, that treats prompt injection, policy circumvention, and harmful-output elicitation as reportable vulnerabilities. Six weeks later, a second, narrower bounty offered $25,000 to anyone who could produce a universal jailbreak prompt against GPT-5.5’s biosafety guardrails. Together, the two programs define a new category of paid disclosure: vendor-controlled, NDA-gated, and priced at a fraction of what comparable research earns elsewhere.
What the Safety Bug Bounty covers
The general Safety Bug Bounty, hosted on Bugcrowb, targets three categories of AI-specific vulnerability:
- Agentic Risks Including MCP: prompt injection and data exfiltration attacks against tool-using agents. Submissions require a 50% reproducibility threshold.
- OpenAI Proprietary Information leaks: extraction of internal training data, system prompts, or other non-public model internals.
- Account and Platform Integrity bypasses: circumvention of usage controls, rate limits, or account-level safety features.
Generic jailbreaks that produce rude language or regurgitate publicly available information are explicitly out of scope. The program does not care whether you can make the model swear. It cares whether you can make it do something structurally dangerous while operating as an agent, or leak something it was trained to conceal.
The distinction matters. The bounty treats the model as a component in a system that takes actions on behalf of users, not as a chatbot with content filters. The attack surface is agentic: tool invocation, data routing, chain-of-composition failures.
The GPT-5.5 Bio Bug Bounty: narrow scope, specific terms
In April 2026, OpenAI launched a second bounty targeting GPT-5.5’s biological safety guardrails. The terms are narrow. A single researcher or team that produces a universal jailbreak prompt defeating five undisclosed biosafety questions earns $25,000. Only GPT-5.5 running in Codex Desktop is in scope. Applications close June 22, 2026; testing runs through July 27.
The GPT-5.5 System Card rates the model “High” in biological and chemical capability. Key safety benchmarks: 97.9% refusal rate on violent illicit behavior, 82.2% on harassment, 0.90 on destructive action avoidance, and 0.963 on prompt injection resistance. That last figure is a regression from GPT-5.4’s 0.998, which is a curious data point to publish alongside a bounty: the model already got worse at resisting prompt injection, and now OpenAI is paying external researchers to try harder.
Why the payout structure drew fire
The security community’s response to the Bio Bounty was pointed. $25,000 is one-twentieth of OpenAI’s prior spend on red-teaming: the company ran a $500,000 Kaggle competition for model safety testing before this. By one estimate, $25,000 covers roughly 33 seconds of OpenAI’s operating revenue at an estimated $65 million per day. For context, OpenAI reported $13.1 billion in 2025 revenue and was valued at $500 billion in an October 2025 share sale.
The terms compound the sticker shock. Only the first researcher to submit a universal jailbreak collects the full amount. Runners-up get nothing. Participants sign an NDA that silences them regardless of whether their submission is accepted or rejected. A researcher could identify a genuine vulnerability, have it dismissed, and still be legally barred from disclosing it elsewhere.
The PR-theory reading
Commenters on Hacker News speculated that the bounty’s terms are designed to produce a specific outcome: low participation yielding a “nobody broke it” narrative, which OpenAI can cite as evidence of model safety to regulators. Restrictive terms and modest payouts discourage serious red-teamers from entering. The NDA prevents anyone who does participate from contradicting the official line.
This is speculative. The structural incentives align, though. OpenAI is reportedly preparing to file for IPO confidentially, according to Bloomberg. A public, well-documented safety testing program with published outcomes is an asset in a regulatory environment where legislators are actively debating AI safety requirements. A bounty that few people enter, where results are vendor-controlled, produces cleaner press than one where a researcher demonstrates a catastrophic failure and publishes the paper.
The counterargument is straightforward: any bounty is better than none, and OpenAI is under no obligation to pay market rates. Both points are true. The gap between what OpenAI spends on safety marketing and what it spends on safety bounties is still a data point worth tracking.
What the attack-surface research shows
A paper submitted to arxiv on May 21, 2026 (arXiv:2605.22984) demonstrates Test-Time Training (TTT) achieving 93-95% attack success rates in bypassing safety filters across multiple model families. The technique transfers to production fine-tuning APIs, meaning the vulnerability is not theoretical. It works against the same APIs developers use to customize model behavior.
This is the technical context the bounty programs operate in. Safety guardrails are not robust in the current model generation. A 93-95% bypass rate, if independently confirmed, means any competent attacker with fine-tuning API access can defeat content filters with high reliability. The question is not whether the guardrails break. The research says they do. The question is who finds the breaks first and what they do with the knowledge.
That question is exactly what OpenAI’s bounties are trying to influence. If the first disclosure route a researcher considers is an NDA-gated vendor program rather than a published paper, the vendor controls the timeline, the narrative, and the fix.
What this means for developers building on OpenAI’s API
The safety guarantees in the system card are a snapshot of the model’s behavior against OpenAI’s internal test suite. They are not a guarantee against determined adversarial input, and they are not a substitute for your own input validation and output filtering.
The bounty programs create a formal channel for reporting jailbreaks, but the terms favor OpenAI’s interests: NDAs, single-winner payouts, vendor-controlled disclosure timelines. Researchers who want to publish findings openly, or who believe the public interest requires disclosure regardless of vendor preference, have no lane in this program. That tension between coordinated disclosure and open publication is not new in security. It is new in AI safety, where the “vulnerabilities” are behavioral rather than code-level.
The competitive dynamic compounds it. If OpenAI is the only major lab paying for jailbreak disclosure, researchers with findings that apply across model families will route them to the highest bidder or the most permissive publication venue. Labs that do not offer comparable bounties will learn about their own vulnerabilities secondhand, if at all.
Frequently Asked Questions
What if a jailbreak works across multiple model families, not just GPT-5.5?
The Bio Bounty’s NDA would prevent the researcher from disclosing the finding to Anthropic, Google, or any other affected vendor. A cross-family jailbreak submitted under OpenAI’s terms becomes a single-vendor patch while the same vulnerability remains exploitable on every other platform the researcher is legally barred from warning.
Do Anthropic or Google run comparable safety bounties?
Neither Anthropic nor Google has a public paid bounty program for safety guardrail bypasses as of May 2026. Anthropic accepts responsible disclosure and publishes safety research but offers no formal payout. Google’s Vulnerability Reward Program covers traditional security bugs, not behavioral model jailbreaks. OpenAI is effectively setting the market terms for an entire disclosure category.
What does the prompt injection regression from 0.998 to 0.963 mean in agentic deployments?
That 3.5-point drop means roughly 1 in 28 injection attempts succeeds against GPT-5.5 where GPT-5.4 would have blocked it. For an agent chain making 100 tool calls per session, the compound probability of at least one compromised step rises to about 2.5%. The regression looks small in a benchmark table but is material when the model is orchestrating multi-step actions.
Can standard fine-tuning accidentally weaken safety behavior?
The TTT paper shows that safety bypasses transfer through production fine-tuning APIs without requiring adversarial training data. A developer fine-tuning GPT-5.5 for a legitimate domain task can inadvertently produce a model variant with degraded guardrails. The vulnerability is in the fine-tuning mechanism itself, not in the developer’s intent.
Why does the Bio Bounty test only five questions?
OpenAI has not publicly justified the narrow scope. A five-question test set risks certifying safety against a handful of known-dangerous prompts while leaving the long tail of biosafety queries untested. Because the questions are undisclosed, external researchers also cannot verify whether they represent the hardest cases or a curated set the model already handles.