Frontier AI Has Broken Open CTFs: Why Claude Code Now One-Shots Medium Pwn Challenges

Kabir Acharya, a competitor with TheHackersCrew and Blitzkrieg, published “The CTF Scene is Dead” on May 1. His claim was narrow and testable: Claude Opus 4.5, driven through CTFd with MCP tool integration, now agent-solves nearly every medium-difficulty CTF challenge and a non-trivial fraction of hard ones. He reported that GPT-5.5 one-shots Insane-difficulty leakless heap pwn on HackTheBox, a claim that has circulated widely but lacks independent second-source verification as of this writing. The post went viral because the scoreboards had already caught up to the thesis.

What the 2026 scoreboards show

BSidesSF 2026 is the sharpest data point. Sixteen teams fully solved every challenge, up from one in 2025. The top ten were fully automated. A competitor who placed fifth solo in 2025 estimated they would have finished 75th this year without LLM assistance, according to Groundy’s reporting.

The winning entry has a name now. ctf-agent, built over a weekend by Veria Labs (founded by members of the .;,;. team, the top-ranked US squad on CTFtime in 2024 and 2025), took first by solving all 52 of 52 challenges [Updated June 2026]. The design is a coordinator LLM that fans work out to solver swarms, each in an isolated Docker container preloaded with pwntools, radare2, and GDB, racing Claude Opus 4.6 (medium and max reasoning), GPT-5.4, GPT-5.4-mini, and GPT-5.3-codex against the same target until one returns a flag. Include Security’s field report puts a tool name on the capability floor: apart from a few OSINT challenges, Claude Code and Codex solved everything, including the binary-exploitation pwn that would have gone unsolved a year earlier. That report is the original source for the fifth-to-seventy-fifth inversion above, and it names the setup: Claude Code on the $100-per-month Max plan, running Opus 4.6 at max effort. No challenge in 2026 drew fewer than 25 solves.

The architecture is the lesson, not the scoreline. ctf-agent does nothing a skilled competitor could not script; it just scripts it well and runs it wide. Parallel model racing hides the variance of any single model behind whichever one happens to crack a given challenge first. Container isolation lets dozens of attempts run without colliding. Pre-loaded tooling removes the setup friction that used to cost humans the first thirty minutes of every box. The scarce resource is no longer who knows how to chain a ROP gadget. It is who has wired the harness, paid for the tokens, and tuned the coordinator’s hand-off logic. That is an engineering moat, and engineering moats fall fast.

A separate onsite CTF study with 41 participants found that the strongest autonomous agent scored 4,900 points, placing second among the top-ten human teams, at an API cost of $96.32, per Groundy’s prior reporting. Sixty-four percent of participants with low CTF expertise but high AI expertise reached intermediate-to-high scores. That API cost to beat most human teams at a live event is cheaper than the registration fee.

Why medium challenges are the keystone

The reflex response to LLM-solved CTFs is “make harder challenges.” That works for DEF CON Finals, which resist automation through multi-stage, multi-day chains requiring creative synthesis no current agent can sustain. But DEF CON Finals represent the narrow apex of a pyramid whose base is medium-difficulty regional competitions, university CTFs, and training platforms like CTFd and HackTheBox.

The base of that pyramid is where the talent pipeline lives. It is where students first learn heap layout, first chain a ROP gadget, first realize JWT tokens are often just base64 with no signature check. Those challenges are solvable by design: they exist to teach. If they become trivially automatable, the pipeline stops filtering for skill and starts filtering for who can afford the better API key.

The ctf-skills repository catalogs just how commoditized this knowledge has become: 20 web exploitation skill families (SQLi, XSS, SSRF, SSTI, JWT attacks, and so on), multiple pwn families (heap exploitation, ROP chains, kernel exploitation, FSOP), and 3 AI/ML attack families (prompt injection, LLM jailbreaking, model extraction, data poisoning). Published challenge sets and write-ups are training data the moment they go public. Every solution posted to GitHub makes the next model marginally better at that category.

The organizer’s trilemma

CTF organizers face three options. All of them are bad.

Ban AI assistance. Detection is the immediate problem. Proctoring cannot distinguish between a contestant running a local model and one thinking quietly. Network-level detection (API calls to Claude, GPT) fails against self-hosted models, which the ATOM Report’s ecosystem survey shows are proliferating.

Redesign challenges. This raises authoring cost substantially. Challenges that resist agents require multi-step reasoning chains, novelty that exceeds the training window, or physical-layer interaction. The authoring expertise for this is scarce and expensive.

Accept it. Open scoreboards become leaderboards of agent budgets and prompt engineering. CTFTime rankings lose calibration as a talent signal.

None of these resolve cleanly. The honest path for most regional and training competitions is probably a combination of ephemeral challenges (never published, never reused) and explicit AI-permitted divisions that treat tool use as a first-class skill rather than a cheat.

What still resists automation

AISI’s corporate-network scenario, “The Last Ones,” is a 32-step attack chain. AISI’s evaluation found that Claude Mythos Preview became the first model to solve it from start to finish, succeeding in 3 of 10 attempts and averaging 22 of 32 steps. Opus 4.6, the next-best performer, averaged 16 steps. A later Mythos Preview checkpoint pushed that to 6 of 10 on “The Last Ones” and became the first model to also clear the previously unsolved “Cooling Tower” OT range, in 3 of 10, making it the only system to complete both AISI cyber ranges [Updated June 2026].

But the gap between single-step CTF solving and multi-step real-world attack chains remains wide. DARPA’s own AIxCC systems found most synthetic bugs yet proved unusable outside their competition sandboxes, which is the same brittleness in a different setting. Physical-layer and OT attacks require sensor interaction and timing constraints that current agents handle poorly, though AISI cautioned that Mythos’s earlier Cooling Tower failures came from getting stuck in the IT phase, not from a genuine OT-layer wall, so the ICS resistance may be thinner than the raw pass rate suggests.

The AISI doubling rate for 80%-reliability cyber capability compressed from 8 months to 4.7 months between November 2025 and February 2026, and the doubling rate itself is accelerating. Any static snapshot understates capability by the time you read it.

The defensive side: Palo Alto’s Patch Wednesday

Palo Alto Networks’ May 2026 Patch Wednesday covered 26 CVEs (75 discrete issues) found by frontier AI models scanning over 130 products, up from their typical fewer than 5 CVEs per month. They tested Claude Mythos, Claude Opus 4.7, and GPT-5.5-Cyber through the Trusted Access for Cyber program and Project Glasswing. That testing predated Opus 4.8, which shipped May 28, 2026, and is four times less likely than Opus 4.7 to allow flaws in code, according to Anthropic.

Two findings deserve attention. First, achieving high-fidelity vulnerability results requires scanning harnesses with context, guardrails, and threat intelligence integrated into the prompt pipeline. Pointing a model at source code alone does not suffice. Second, a multimodel approach is necessary because different models find different vulnerability supersets due to training variance. No single model catches everything; the coverage comes from the ensemble.

Palo Alto has warned of a narrow defensive window before advanced AI capabilities become widely available to adversaries. The direction is consistent with the AISI doubling rate and the BSidesSF scoreboard data.

The talent pipeline downstream

CTFtime rankings have served as a proxy for security skill for two decades. Recruiters at defense contractors, consulting firms, and tech companies use them to filter candidates. That proxy is now noisy in a specific direction: it overestimates the raw technical skill of recent high performers who had access to capable agents, and it underestimates candidates who competed without them. The same erosion shows up on the platforms recruiters trust most: frontier models now place in the top 5% of public competitions, which is why Hack The Box and BearcatCTF 2026 results have pushed organizers toward bans, hybrid scoring, or AI-only divisions.

Write-ups are a better signal. A write-up demonstrates that the competitor understood the challenge, can articulate the exploitation chain, and can generalize the technique. An agent-generated solve produces none of these. Hiring managers who previously screened by CTFtime percentile should consider screening by write-up quality instead.

What trainers and recruiters should do now

The Deep Blue analogy is instructive. Chess did not end when Kasparov lost; it bifurcated. Human-only competition continued. Engine-assisted analysis became standard training. The rating system adjusted. Security training is about to undergo the same split.

Concrete steps:

Ephemeral challenges. Regional and training competitions should move to challenges that are never published and never reused. This raises authoring cost but preserves difficulty calibration against training-data absorption.
Explicit divisions. AI-permitted and AI-prohibited divisions, with the understanding that enforcement of the latter is imperfect and relies on honor systems and community norms.
Write-up-based evaluation. For recruitment, weight demonstrated understanding over scoreboard placement. A clear write-up of a medium challenge is more diagnostic than an unexplained full solve.
Retrain on what agents cannot do. Multi-step attack planning, physical-layer interaction, social engineering, and novel technique development remain human-domain skills. Training curricula should shift weight toward these areas.

Frequently Asked Questions

Does Claude Code actually one-shot medium pwn challenges, or is that overstated?

Both, with a caveat. At BSidesSF 2026 the winning harness, ctf-agent, raced Claude Opus 4.6 and several GPT-5.4-family models in parallel and solved 52 of 52 challenges, and Include Security reports that Claude Code and Codex cleared every category except a few OSINT tasks, binary-exploitation pwn included. The “one-shot” framing is shorthand for an agent loop with pwntools, radare2, and GDB already on hand, not a single prompt that returns a flag. The honest scope is medium and many hard challenges; DEF CON Finals-grade multi-day chains and physical-layer work still resist. The Insane-difficulty leakless-heap solve attributed to GPT-5.5 on HackTheBox remains second-source-unverified.

What does an autonomous agent actually cost to run through a corporate-grade attack chain?

Opus 4.6 consumed roughly 100 million tokens, about $80, to reach 22 of 32 steps on AISI’s corporate-network scenario. That token budget is already cheaper than one hour of professional penetration-testing time, and per-token pricing continues to drop.

How much did the latest model generation improve on autonomous pen-testing benchmarks?

XBOW reported that Claude Opus 4.7 scored 98.5% on their visual-acuity benchmark for autonomous penetration testing, versus 54.5% for Opus 4.6. A single-generation jump from barely passing to near-perfect is what made computer-use agents viable for tasks where the previous model could not be fielded. Opus 4.8, released May 28, 2026, extends that trajectory: it scores 74.6% on Terminal-Bench 2.1 (Opus 4.7 was 66.1%) and is four times less likely than its predecessor to allow flaws in code, which matters directly for the exploit-construction side of autonomous pen-testing. GPT-5.5 currently posts the top public Terminal-Bench 2.1 figure of the two, though the labs measure the benchmark differently enough that the gap should not be read as a clean ranking [Updated June 2026].

Why do industrial control system attacks remain harder for agents than corporate network chains?

The AISI Cooling Tower scenario is only 7 steps, compared to 32 for the corporate-network chain, yet Claude Mythos Preview solved it in just 3 of 10 attempts versus 6 of 10 on the longer corporate scenario. The constraint is not step count but sensor interaction and timing requirements that operate on different feedback loops than software-only targets.

Does a single model suffice for vulnerability discovery across a large product portfolio?

Palo Alto found that different models discover different vulnerability supersets due to training variance. Their scanning pipeline requires a multimodel ensemble to achieve high-fidelity results, and the models need integrated threat intelligence and guardrails in the prompt pipeline. Pointing a model at source code without that context produces low-signal output.

How narrow is the defensive window before AI-driven exploits become standard for adversaries?

Palo Alto estimates a 3 to 5 month window before AI-driven exploits become the norm, based on Project Glasswing testing with Claude Mythos, Opus 4.7, and GPT-5.5-Cyber. Their May Patch Wednesday alone saw output jump from fewer than 5 CVEs per month to 26, with the majority found by frontier models rather than human analysts. That testing was completed before Opus 4.8 shipped, so the current capability floor is higher than the estimate reflects.