Kabir Acharya, a competitor with TheHackersCrew and Blitzkrieg, published “The CTF Scene is Dead” on May 1. His claim was narrow and testable: Claude Opus 4.5, driven through CTFd with MCP tool integration, now agent-solves nearly every medium-difficulty CTF challenge and a non-trivial fraction of hard ones. He reported that GPT-5.5 one-shots Insane-difficulty leakless heap pwn on HackTheBox, a claim that has circulated widely but lacks independent second-source verification as of this writing. The post went viral because the scoreboards had already caught up to the thesis.
What the 2026 scoreboards show
BSidesSF 2026 is the sharpest data point. Sixteen teams fully solved every challenge, up from one in 2025. The top ten were fully automated. A competitor who placed fifth solo in 2025 estimated they would have finished 75th this year without LLM assistance, according to Groundy’s reporting.
A separate onsite CTF study with 41 participants found that the strongest autonomous agent scored 4,900 points, placing second among the top-ten human teams, at an API cost of $96.32, per Groundy’s prior reporting. Sixty-four percent of participants with low CTF expertise but high AI expertise reached intermediate-to-high scores. That API cost to beat most human teams at a live event is cheaper than the registration fee.
Why medium challenges are the keystone
The reflex response to LLM-solved CTFs is “make harder challenges.” That works for DEF CON Finals, which resist automation through multi-stage, multi-day chains requiring creative synthesis no current agent can sustain. But DEF CON Finals represent the narrow apex of a pyramid whose base is medium-difficulty regional competitions, university CTFs, and training platforms like CTFd and HackTheBox.
The base of that pyramid is where the talent pipeline lives. It is where students first learn heap layout, first chain a ROP gadget, first realize JWT tokens are often just base64 with no signature check. Those challenges are solvable by design: they exist to teach. If they become trivially automatable, the pipeline stops filtering for skill and starts filtering for who can afford the better API key.
The ctf-skills repository catalogs just how commoditized this knowledge has become: 20 web exploitation skill families (SQLi, XSS, SSRF, SSTI, JWT attacks, and so on), multiple pwn families (heap exploitation, ROP chains, kernel exploitation, FSOP), and 3 AI/ML attack families (prompt injection, LLM jailbreaking, model extraction, data poisoning). Published challenge sets and write-ups are training data the moment they go public. Every solution posted to GitHub makes the next model marginally better at that category.
The organizer’s trilemma
CTF organizers face three options. All of them are bad.
Ban AI assistance. Detection is the immediate problem. Proctoring cannot distinguish between a contestant running a local model and one thinking quietly. Network-level detection (API calls to Claude, GPT) fails against self-hosted models, which the ATOM Report’s ecosystem survey shows are proliferating.
Redesign challenges. This raises authoring cost substantially. Challenges that resist agents require multi-step reasoning chains, novelty that exceeds the training window, or physical-layer interaction. The authoring expertise for this is scarce and expensive.
Accept it. Open scoreboards become leaderboards of agent budgets and prompt engineering. CTFTime rankings lose calibration as a talent signal.
None of these resolve cleanly. The honest path for most regional and training competitions is probably a combination of ephemeral challenges (never published, never reused) and explicit AI-permitted divisions that treat tool use as a first-class skill rather than a cheat.
What still resists automation
AISI’s corporate-network scenario, “The Last Ones,” is a 32-step attack chain. AISI’s evaluation found that Claude Mythos Preview became the first model to solve it from start to finish, succeeding in 3 of 10 attempts and averaging 22 of 32 steps. Opus 4.6, the next-best performer, averaged 16 steps.
But the gap between single-step CTF solving and multi-step real-world attack chains remains wide. Physical-layer and OT attacks require sensor interaction and timing constraints that current agents handle poorly.
The AISI doubling rate for 80%-reliability cyber capability compressed from 8 months to 4.7 months between November 2025 and February 2026, and the doubling rate itself is accelerating. Any static snapshot understates capability by the time you read it.
The defensive side: Palo Alto’s Patch Wednesday
Palo Alto Networks’ May 2026 Patch Wednesday covered 26 CVEs (75 discrete issues) found by frontier AI models scanning over 130 products, up from their typical fewer than 5 CVEs per month. They tested Claude Mythos, Claude Opus 4.7, and GPT-5.5-Cyber through the Trusted Access for Cyber program and Project Glasswing. That testing predated Opus 4.8, which shipped May 28, 2026, and is four times less likely than Opus 4.7 to allow flaws in code, according to Anthropic.
Two findings deserve attention. First, achieving high-fidelity vulnerability results requires scanning harnesses with context, guardrails, and threat intelligence integrated into the prompt pipeline. Pointing a model at source code alone does not suffice. Second, a multimodel approach is necessary because different models find different vulnerability supersets due to training variance. No single model catches everything; the coverage comes from the ensemble.
Palo Alto has warned of a narrow defensive window before advanced AI capabilities become widely available to adversaries. The direction is consistent with the AISI doubling rate and the BSidesSF scoreboard data.
The talent pipeline downstream
CTFtime rankings have served as a proxy for security skill for two decades. Recruiters at defense contractors, consulting firms, and tech companies use them to filter candidates. That proxy is now noisy in a specific direction: it overestimates the raw technical skill of recent high performers who had access to capable agents, and it underestimates candidates who competed without them.
Write-ups are a better signal. A write-up demonstrates that the competitor understood the challenge, can articulate the exploitation chain, and can generalize the technique. An agent-generated solve produces none of these. Hiring managers who previously screened by CTFtime percentile should consider screening by write-up quality instead.
What trainers and recruiters should do now
The Deep Blue analogy is instructive. Chess did not end when Kasparov lost; it bifurcated. Human-only competition continued. Engine-assisted analysis became standard training. The rating system adjusted. Security training is about to undergo the same split.
Concrete steps:
- Ephemeral challenges. Regional and training competitions should move to challenges that are never published and never reused. This raises authoring cost but preserves difficulty calibration against training-data absorption.
- Explicit divisions. AI-permitted and AI-prohibited divisions, with the understanding that enforcement of the latter is imperfect and relies on honor systems and community norms.
- Write-up-based evaluation. For recruitment, weight demonstrated understanding over scoreboard placement. A clear write-up of a medium challenge is more diagnostic than an unexplained full solve.
- Retrain on what agents cannot do. Multi-step attack planning, physical-layer interaction, social engineering, and novel technique development remain human-domain skills. Training curricula should shift weight toward these areas.
Frequently Asked Questions
What does an autonomous agent actually cost to run through a corporate-grade attack chain?
Opus 4.6 consumed roughly 100 million tokens, about $80, to reach 22 of 32 steps on AISI’s corporate-network scenario. That token budget is already cheaper than one hour of professional penetration-testing time, and per-token pricing continues to drop.
How much did the latest model generation improve on autonomous pen-testing benchmarks?
XBOW reported that Claude Opus 4.7 scored 98.5% on their visual-acuity benchmark for autonomous penetration testing, versus 54.5% for Opus 4.6. A single-generation jump from barely passing to near-perfect is what made computer-use agents viable for tasks where the previous model could not be fielded. Opus 4.8, released May 28, 2026, extends that trajectory: it scores 74.6% on Terminal-Bench 2.1 (Opus 4.7 was 66.1%) and is four times less likely than its predecessor to allow flaws in code, which matters directly for the exploit-construction side of autonomous pen-testing.
Why do industrial control system attacks remain harder for agents than corporate network chains?
The AISI Cooling Tower scenario is only 7 steps, compared to 32 for the corporate-network chain, yet Claude Mythos Preview solved it in just 3 of 10 attempts versus 6 of 10 on the longer corporate scenario. The constraint is not step count but sensor interaction and timing requirements that operate on different feedback loops than software-only targets.
Does a single model suffice for vulnerability discovery across a large product portfolio?
Palo Alto found that different models discover different vulnerability supersets due to training variance. Their scanning pipeline requires a multimodel ensemble to achieve high-fidelity results, and the models need integrated threat intelligence and guardrails in the prompt pipeline. Pointing a model at source code without that context produces low-signal output.
How narrow is the defensive window before AI-driven exploits become standard for adversaries?
Palo Alto estimates a 3 to 5 month window before AI-driven exploits become the norm, based on Project Glasswing testing with Claude Mythos, Opus 4.7, and GPT-5.5-Cyber. Their May Patch Wednesday alone saw output jump from fewer than 5 CVEs per month to 26, with the majority found by frontier models rather than human analysts. That testing was completed before Opus 4.8 shipped, so the current capability floor is higher than the estimate reflects.