Open CTF scoreboards have stopped measuring human skill. In a May 1 post that went viral on Hacker News, Kabir Acharya argued that Claude Opus 4.5 now solves almost every medium-difficulty challenge and some hard ones autonomously, while GPT-5.5 can one-shot Insane-difficulty heap exploits on HackTheBox.[^1] The problem is not that competitions need harder puzzles. It is that the cheapest on-ramp into offensive security has become a contest between API budgets, and the talent pipeline that recruiters and red teams have relied on for two decades is losing its calibration signal.
The May 1 Post That Broke the Scene
Kabir Acharya is not a pundit. He competes with TheHackersCrew and Blitzkrieg, teams that place at major CTFs. When he published “The CTF Scene is Dead” on May 1, the post carried the weight of someone who had watched the floor drop out from under the format in real time. His core claim was stark: Claude Opus 4.5 is now “agent-solvable” on almost every medium-difficulty challenge and some hard ones. GPT-5.5, he wrote, can one-shot Insane-difficulty active leakless heap pwn challenges on HackTheBox. These are not tutorial challenges. Heap exploitation with no leaks is the kind of task that used to separate a competent undergraduate from a hireable junior reverse engineer.
The post went viral on Hacker News for a reason. It named the specific models and the specific difficulty tiers, and it came from a competitor rather than a vendor. Vendor claims that models are “CTF-tested” have been circulating for months, but those benchmarks tend to saturate quickly; Acharya’s argument was about the live, open scoreboards where humans still expected to compete against humans.
What the 2026 Scoreboard Actually Looks Like
The numbers from BSidesSF 2026 are hard to spin.[^2] Sixteen teams fully solved every challenge, up from one team in 2025. The top ten were fully automated. A former fifth-place solo competitor estimated they would have placed seventy-fifth without LLM assistance. This is not incremental improvement. It is a discontinuity.
A live onsite CTF study with forty-one participants put harder figures to the trend.[^3] The strongest autonomous agent scored 4,900 points, placing second among the top-ten human teams, at a cost of $96.32 in API usage.[^3] Sixty-four percent of participants with low CTF expertise but high AI expertise achieved intermediate-to-high scores. The correlation between security skill and scoreboard position is weakening in real time.
Why Medium Challenges Matter More Than DEF CON Finals
It is tempting to treat this as a solved-game problem: make the challenges harder. But the CTF ecosystem is not a single ladder. It is a pyramid. DEF CON finals and elite private competitions sit at the top, but they are irrelevant to the vast majority of people who will ever touch a debugger. Working offensive-security engineers typically cut their teeth on medium-difficulty web, pwn, and reverse-engineering challenges in open competitions. That is where you learn buffer overflows, race conditions, and cryptographic oracle attacks for the first time. It is also where recruiters look for signal.
If an autonomous agent can solve those medium challenges for under a hundred dollars, the incentive structure inverts. The beginner no longer needs to internalize the mechanics of a heap layout to place on a scoreboard. They need prompt engineering and an API key. The skill being tested has changed, but the credential being issued, a scoreboard rank, has not. Employers who relied on CTF performance as a proxy for security intuition are now reading API-budget rankings.
The Organizer’s Trilemma: Ban, Redesign, or Surrender
Organizers face three options, and all of them extract a cost.
Option one is an AI ban. The problem is detection. In a remote competition, you cannot distinguish a competitor who is silently querying GPT-5.5 from one who is just fast. Even live onsite events struggle; the study cited above ran onsite and still saw AI-assisted participants blend into the field.[^3] Bans are unenforceable theater.
Option two is redesign. Challenges could shift toward the tasks frontier models still fail: long-horizon multi-step attacks requiring human stealth, bespoke hardware, or creative cryptographic constructions. UK AISI’s eighteen-month evaluation found that model performance on multi-step cyber attacks scales log-linearly with inference-time compute, with no observed plateau.[^4] But Opus 4.6 still only completed twenty-two of thirty-two steps on a corporate-network attack that takes a human expert roughly fourteen hours.[^4] The gap exists. The trouble is that writing challenges at this tier requires authors who can themselves outthink frontier models, and there are not many of them. Authoring cost rises, the pool of qualified setters shrinks, and the barrier to entry for new organizers climbs.
Option three is surrender: accept that open scoreboards now measure human-plus-AI teams, or pure AI agents, and treat the old format as obsolete. This is honest, but it abandons the twenty-year calibration signal that the industry has used to identify raw talent.
What Happens to the Talent Pipeline
The second-order effects matter more than the competitions themselves. Corporate red teams, government agencies, and security consultancies have used CTF rankings as a coarse filter for decades. It is not a perfect signal, but it is cheap and globally available. When that signal decays, the filter becomes expensive. Recruiters must find other proxies, probably interviews and captive assessments run in controlled environments, which reintroduces the gatekeeping that open CTFs were supposed to eliminate.
DEF CON and similar conferences also lose. The open CTF is a funnel. It surfaces unknown competitors from outside the usual hiring networks. If the funnel stops distinguishing human skill, the conferences become less relevant as talent marketplaces, and the industry loses one of its few meritocratic on-ramps.
The timing is poor. Ransomware groups, state actors, and supply-chain attackers are not waiting for the talent pipeline to stabilize. The moment defenders most need a calibrated influx of new offensive-security engineers is the moment the cheapest calibration tool has stopped working.
What Trainers and Recruiters Should Do Now
For trainers, the implication is that rote challenge sets need to be retired faster. Any published medium-difficulty pwn or web challenge with a known solution pattern is now training data. Curriculum designers should move toward ephemeral, bespoke challenges and emphasize the skills that still resist automation: operational stealth, adversarial tool development, and the judgment calls in multi-day red-team engagements.
For recruiters, CTFtime rankings should be treated as a noisy signal at best. A top-ten finish in a 2026 open competition no longer distinguishes a skilled reverse engineer from a competent prompt engineer. The better filter may be write-ups: detailed, technical post-mortems that demonstrate understanding rather than just output. If a candidate can explain why a heap layout was shaped a particular way, rather than just that they solved it, they are likely still ahead of the model.
Security training will probably split the way chess did after Deep Blue: a divide between AI-assisted and AI-prohibited divisions, with the latter becoming smaller, more expensive, and more insular. The open scoreboard, the great equalizer of the last two decades, has become a leaderboard for API spend. That is a loss for the field, and no harder cipher will fix it.
Frequently Asked Questions
Which attack categories still resist frontier models at the highest difficulty?
Industrial control system (ICS) targets remain a meaningful gap. Claude Mythos Preview solved AISI’s 7-step “Cooling Tower” ICS attack in only 3 of 10 attempts, compared to 6 of 10 on the 32-step corporate-network scenario. Physical-layer and operational-technology attacks require sensor interaction and timing constraints that current agents handle poorly.
How do top models compare head-to-head on multi-step attack scenarios?
On AISI’s 32-step “The Last Ones” corporate-network attack, Claude Mythos Preview succeeded in 6 of 10 autonomous attempts while GPT-5.5 solved it in 3 of 10. The gap between frontier models on multi-step operations is still wide enough that model choice materially affects outcome—unlike single-step CTF challenges where multiple models now converge on the same solutions.
Is the capability doubling rate itself accelerating or steady?
Accelerating. AISI’s estimate of the 80%-reliability cyber time horizon compressed from an 8-month doubling to 4.7 months in just three months (November 2025 to February 2026). If the doubling rate is itself compressing, linear extrapolations from today’s 4.7-month figure will systematically underestimate model capability at every future checkpoint.
What does a serious autonomous attack run cost in raw compute terms?
Opus 4.6 consumed roughly 100 million tokens—approximately $80—to complete 22 of 32 steps on AISI’s corporate-network scenario in a single continuous run. The token budget for a near-complete autonomous corporate breach is already cheaper than one hour of professional penetration-testing time and will drop with every model pricing revision.