Frontier AI Has Broken the Open CTF Format: What the Scoreboard Collapse Means for Security Training

Open CTF scoreboards have stopped measuring human skill. In a May 1 post that went viral on Hacker News, Kabir Acharya argued that Claude Opus 4.5 now solves almost every medium-difficulty challenge and some hard ones autonomously, while GPT-5.5 can one-shot Insane-difficulty heap exploits on HackTheBox.¹ Since that post, Anthropic has released Opus 4.8 and then Claude Fable 5, its most capable widely released model, which sits in a new tier above Opus and ships with cybersecurity classifiers that block offensive cyber tasks and showed zero compliance across all 30 jailbreak techniques Anthropic tested.⁶⁷ The problem is not that competitions need harder puzzles. It is that the cheapest on-ramp into offensive security has become a contest between API budgets, and the talent pipeline that recruiters and red teams have relied on for two decades is losing its calibration signal.

The May 1 Post That Broke the Scene

Kabir Acharya is not a pundit. He competes with TheHackersCrew and Blitzkrieg, teams that place at major CTFs. When he published “The CTF Scene is Dead” on May 1, the post carried the weight of someone who had watched the floor drop out from under the format in real time. His core claim was stark: Claude Opus 4.5 (the then-current flagship) is “agent-solvable” on almost every medium-difficulty challenge and some hard ones. GPT-5.5, he wrote, can one-shot Insane-difficulty active leakless heap pwn challenges on HackTheBox. These are not tutorial challenges. Heap exploitation with no leaks is the kind of task that used to separate a competent undergraduate from a hireable junior reverse engineer.

The post went viral on Hacker News for a reason. It named the specific models and the specific difficulty tiers, and it came from a competitor rather than a vendor. Vendor claims that models are “CTF-tested” have been circulating for months, but those benchmarks tend to saturate quickly; Acharya’s argument was about the live, open scoreboards where humans still expected to compete against humans.

What the 2026 Scoreboard Actually Looks Like

The numbers from BSidesSF 2026 are hard to spin.² Sixteen teams fully solved every challenge, up from one team in 2025. The top ten were fully automated. A former fifth-place solo competitor estimated they would have placed seventy-fifth without LLM assistance. This is not incremental improvement. It is a discontinuity.

A live onsite CTF study with forty-one participants put harder figures to the trend.³ The strongest autonomous agent scored 4,900 points, placing second among the top-ten human teams, at a cost of $96.32 in API usage.³ Sixty-four percent of participants with low CTF expertise but high AI expertise achieved intermediate-to-high scores. The correlation between security skill and scoreboard position is weakening in real time.

Why Medium Challenges Matter More Than DEF CON Finals

It is tempting to treat this as a solved-game problem: make the challenges harder. But the CTF ecosystem is not a single ladder. It is a pyramid. DEF CON finals and elite private competitions sit at the top, but they are irrelevant to the vast majority of people who will ever touch a debugger. Working offensive-security engineers typically cut their teeth on medium-difficulty web, pwn, and reverse-engineering challenges in open competitions. That is where you learn buffer overflows, race conditions, and cryptographic oracle attacks for the first time. It is also where recruiters look for signal.

If an autonomous agent can solve those medium challenges for under a hundred dollars, the incentive structure inverts. The beginner no longer needs to internalize the mechanics of a heap layout to place on a scoreboard. They need prompt engineering and an API key. The skill being tested has changed, but the credential being issued, a scoreboard rank, has not. Employers who relied on CTF performance as a proxy for security intuition are now reading API-budget rankings.

The Organizer’s Trilemma: Ban, Redesign, or Surrender

Organizers face three options, and all of them extract a cost.

Option one is an AI ban. The problem is detection. In a remote competition, you cannot distinguish a competitor who is silently querying GPT-5.5 from one who is just fast. Even live onsite events struggle; the study cited above ran onsite and still saw AI-assisted participants blend into the field.³ Bans are unenforceable theater.

Option two is redesign. Challenges could shift toward the tasks frontier models still fail: long-horizon multi-step attacks requiring human stealth, bespoke hardware, or creative cryptographic constructions. UK AISI’s eighteen-month evaluation found that model performance on multi-step cyber attacks scales log-linearly with inference-time compute, with no observed plateau.⁴ But even the then-current Opus 4.6 only completed twenty-two of thirty-two steps on a corporate-network attack that takes a human expert roughly fourteen hours.⁴ The gap exists. The trouble is that writing challenges at this tier requires authors who can themselves outthink frontier models, and there are not many of them. Authoring cost rises, the pool of qualified setters shrinks, and the barrier to entry for new organizers climbs.

Opus 4.8, released May 28, 2026, narrows that gap further.⁶ On SWE-Bench Pro, an agentic coding benchmark that tests multi-step software engineering, it scores 69.2%, up from 64.3% for Opus 4.7, and Anthropic states it is four times less likely than its predecessor to allow flaws in code.⁶ On Terminal-Bench 2.1, which scores autonomous terminal-based coding agents, it reaches 74.6% (though GPT-5.5 leads that specific benchmark at 78.2%).⁶ Claude Fable 5, released June 9, 2026 as the first model in Anthropic’s new Mythos-class tier above Opus, extends this trajectory further: it achieved the highest score among frontier models on FrontierCode at medium effort and the highest score for senior-level reasoning on Hebbia’s finance benchmark, though Anthropic has published no numeric figures for either.⁷ The direction is consistent: each generation shrinks the space of challenges that require human-level judgment to solve, which compresses the window organizers have to redesign before any newly authored hard challenge is within model reach. The companion Claude Mythos 5 adds a further wrinkle: restricted to Project Glasswing partners and select biology researchers, it is explicitly aimed at defensive cybersecurity research, meaning the same capability improving CTF solve rates is also being applied to find vulnerabilities across the exposed attack surface.⁷

Option three is surrender: accept that open scoreboards now measure human-plus-AI teams, or pure AI agents, and treat the old format as obsolete. This is honest, but it abandons the twenty-year calibration signal that the industry has used to identify raw talent.

What Happens to the Talent Pipeline

The second-order effects matter more than the competitions themselves. Corporate red teams, government agencies, and security consultancies have used CTF rankings as a coarse filter for decades. It is not a perfect signal, but it is cheap and globally available. When that signal decays, the filter becomes expensive. Recruiters must find other proxies, probably interviews and captive assessments run in controlled environments, which reintroduces the gatekeeping that open CTFs were supposed to eliminate.

DEF CON and similar conferences also lose. The open CTF is a funnel. It surfaces unknown competitors from outside the usual hiring networks. If the funnel stops distinguishing human skill, the conferences become less relevant as talent marketplaces, and the industry loses one of its few meritocratic on-ramps.

The timing is poor. Ransomware groups, state actors, and supply-chain attackers are not waiting for the talent pipeline to stabilize. The moment defenders most need a calibrated influx of new offensive-security engineers is the moment the cheapest calibration tool has stopped working.

What Trainers and Recruiters Should Do Now

For trainers, the implication is that rote challenge sets need to be retired faster. Any published medium-difficulty pwn or web challenge with a known solution pattern is now training data. Curriculum designers should move toward ephemeral, bespoke challenges and emphasize the skills that still resist automation: operational stealth, adversarial tool development, and the judgment calls in multi-day red-team engagements.

For recruiters, CTFtime rankings should be treated as a noisy signal at best. A top-ten finish in a 2026 open competition no longer distinguishes a skilled reverse engineer from a competent prompt engineer. The better filter may be write-ups: detailed, technical post-mortems that demonstrate understanding rather than just output. If a candidate can explain why a heap layout was shaped a particular way, rather than just that they solved it, they are likely still ahead of the model.

Security training will probably split the way chess did after Deep Blue: a divide between AI-assisted and AI-prohibited divisions, with the latter becoming smaller, more expensive, and more insular. The open scoreboard, the great equalizer of the last two decades, has become a leaderboard for API spend. That is a loss for the field, and no harder cipher will fix it. The same agentic capabilities are already being turned toward automated vulnerability discovery in bulk and weaponizing coding assistants against their own users. The CTF collapse is one node in a broader shift in the offense-defense balance that now extends to DARPA’s autonomous cyber reasoning postmortem. Notably, Fable 5 ships with cybersecurity classifiers that actively block offensive cyber tasks, a signal that frontier-model vendors are now building hard limits into their most capable systems rather than leaving policy enforcement to operators alone.⁷

Frequently Asked Questions

Which attack categories still resist frontier models at the highest difficulty?

Industrial control system (ICS) targets remain a notable gap. Claude Mythos Preview solved AISI’s 7-step “Cooling Tower” ICS attack in only 3 of 10 attempts, compared to 6 of 10 on the 32-step corporate-network scenario. Physical-layer and operational-technology attacks require sensor interaction and timing constraints that current agents handle poorly.

How do top models compare head-to-head on multi-step attack scenarios?

On AISI’s 32-step “The Last Ones” corporate-network attack, Claude Mythos Preview succeeded in 6 of 10 autonomous attempts while GPT-5.5 solved it in 3 of 10. The gap between frontier models on multi-step operations is still wide enough that model choice materially affects outcome, unlike single-step CTF challenges where multiple models now converge on the same solutions.

Is the capability doubling rate itself accelerating or steady?

Accelerating. AISI’s estimate of the 80%-reliability cyber time horizon compressed from an 8-month doubling to 4.7 months in just three months (November 2025 to February 2026). If the doubling rate is itself compressing, linear extrapolations from today’s 4.7-month figure will systematically underestimate model capability at every future checkpoint.

What does a serious autonomous attack run cost in raw compute terms?

Opus 4.6 consumed roughly 100 million tokens, approximately $80, to complete 22 of 32 steps on AISI’s corporate-network scenario in a single continuous run. Opus 4.8 at $5/$25 per million tokens (input/output) keeps that cost floor identical while delivering higher capability per dollar.⁶ Claude Fable 5, Anthropic’s most capable widely released model, is priced at $10/$50 per million tokens, exactly double Opus 4.8.⁷ The token budget for a near-complete autonomous corporate breach is already cheaper than one hour of professional penetration-testing time; the capability ceiling rises with each generation while the cost floor shifts only modestly.