From 213 Vulnerabilities to Project Glasswing — What Three Years Did to AI-Powered Security


I read Chris Koch's article last week. The one from February 2023, where he pointed GPT-3 at a repository of 129 deliberately vulnerable code files and watched it find 213 security vulnerabilities. Snyk Code, a commercial tool built specifically for this job, found 99 in the same codebase. GPT-3, a general-purpose language model that couldn't even browse the internet, outperformed it by a factor of two.
When that article dropped, it felt like a glimpse of something. A proof of concept. Impressive, sure, but clearly limited. GPT-3's context window was 4,000 tokens. It couldn't hold more than a few hundred lines of code in memory at once. Koch had to scan each file in isolation, which meant the model couldn't trace vulnerabilities that emerged from interactions between modules. It caught format string attacks and insecure deserialisation. It missed the architectural sins.
I'm reading that same article in April 2026. And the distance between what GPT-3 could do then and what AI can do now doesn't feel like three years of progress. It feels like a geological epoch compressed into a long weekend.
The 2023 baseline
Koch used text-davinci-003, the best GPT-3 variant available at the time. The model had a hard ceiling of roughly 3,000 English words per request. To scan a repository, he had to feed it one file at a time and ask: what security vulnerabilities do you see?
The results were good. Surprisingly good. GPT-3 caught a C format string vulnerability in five lines of code. It identified log injection in a C# ASP.NET controller. It spotted insecure deserialisation in Java code that Koch himself had initially read and found nothing wrong with. Out of 60 manually reviewed findings, only 4 were false positives. That's a 93% precision rate from a model that was never trained for security analysis.
But the limitations were just as telling. GPT-3 couldn't reason across files. It couldn't follow a function call from one module into another and understand that the lack of input validation in module A becomes a remote code execution path when module B passes that input to a shell command. It couldn't analyse compiled binaries. It couldn't test whether a vulnerability was actually exploitable. It found bugs in code snippets, and that's a very different thing from finding bugs in systems.
There's also a subtlety that a commenter on Koch's article raised, one worth taking seriously: many of those vulnerable code snippets had been published online as examples of vulnerable code. The snippets came from snoopysecurity's Vulnerable-Code-Snippets repository. GPT-3 may have partially memorised them. It might have been recognising patterns from its training data rather than performing real security reasoning. Koch was transparent about this. We can't know for certain.
The experiment was a compelling demo. It was not a threat model.
What happened next
Fast forward to today. Anthropic has announced Project Glasswing, and at its centre is a model called Claude Mythos Preview. Reading the announcement doesn't feel like reading about another incremental improvement. This isn't a researcher running a clever side project on the weekend. This is AWS, Apple, Google, Microsoft, CrowdStrike, NVIDIA, Cisco, and the Linux Foundation assembling around a single realisation: the models have gotten good enough at finding and exploiting vulnerabilities that the industry needs to move together. And it needs to move now.
What Mythos Preview has done in its first few weeks of deployment:
It found a 27-year-old vulnerability in OpenBSD, one of the most security-hardened operating systems ever built. The bug allowed remote crashes just by connecting to the machine. Twenty-seven years of manual auditing, community review, and a culture that treats security as a religion. The model found what none of that caught.
It discovered a 16-year-old vulnerability in FFmpeg, a library so common that if you've ever watched a video on a computer, you've used it. The vulnerable line of code had been hit by automated testing tools five million times without triggering a detection. Five million.
And then the one that keeps me up: it autonomously found and chained together multiple Linux kernel vulnerabilities to escalate from ordinary user privileges to complete root control. It didn't find a bug. It found several bugs, worked out how they related to each other, and constructed a working exploit chain. No human guided it.
The numbers
The benchmark gap between Claude Opus 4.6 and Claude Mythos Preview is hard to wave away:
| Benchmark | Opus 4.6 | Mythos Preview |
|---|---|---|
| CyberGym (vulnerability reproduction) | 66.6% | 83.1% |
| SWE-bench Verified | 80.8% | 93.9% |
| SWE-bench Pro | 53.4% | 77.8% |
| Terminal-Bench 2.0 | 65.4% | 82.0% |
SWE-bench Pro, a benchmark for solving real-world software engineering problems, jumped from 53% to 78%. That's not a model getting slightly better at pattern matching. Something qualitative changed in the model's ability to reason about code as a system, not just as syntax.
AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.
Anthropic isn't making Mythos Preview generally available. They've looked at what it can do and decided it's too capable to release broadly without new safeguards. They're committing $100 million in usage credits to Glasswing partners so that defenders get access first. That decision tells you more than any benchmark number can.
Thirty-eight months
I keep coming back to the timeline because the timeline is the point.
In February 2023, the state of the art was a model that scanned individual code files with a 4,000-token window and caught known vulnerability patterns with decent accuracy. Think of it as a junior security analyst doing a first pass. Useful, but you'd never trust it on its own.
In April 2026, the state of the art autonomously discovers zero-day vulnerabilities in hardened production systems, chains them into working exploits, and finds bugs that survived decades of expert review and millions of automated tests. Anthropic describes its capability as competitive with "all but the most skilled humans" in offensive cybersecurity.
That progression took thirty-eight months. Not a decade. Not a generation. Thirty-eight months.
The same trajectory is playing out in drug discovery, in mathematical proof, in code generation, in scientific research. The capabilities are compounding faster than most people's mental models can absorb. I include my own.
The asymmetry that worries me
I build AI systems for a living. What keeps nagging at me is the asymmetry between offence and defence.
Finding a vulnerability is a search problem. You're looking for the one flaw in millions of lines of code. AI is very good at search. Fixing a vulnerability requires understanding the system's architecture, its dependencies, its deployment context, the downstream effects of a patch. That's a design problem, and it's harder by nature.
CrowdStrike's CTO put it plainly in the Glasswing announcement: "The window between a vulnerability being discovered and being exploited by an adversary has collapsed. What once took months now happens in minutes with AI."
Koch's 2023 experiment was symmetrical in a way that feels almost quaint now. GPT-3 found bugs. A developer could fix them. The model was a tool in the developer's hands. In 2026, the model doesn't just find the bug, it writes the exploit. The defender still needs to understand the system well enough to patch it correctly. The attacker just needs to point the model at a target.
This is why Glasswing exists. Not because it's a nice initiative, but because a world where these capabilities spread without coordinated defence is a world with a lot more Colonial Pipeline incidents. A lot more hospital ransomware. A lot more infrastructure failures that we currently treat as unlikely.
Where I think this goes
I've been sitting with this for a while, and here's where I've landed.
We are probably three years from models that make Mythos Preview look the way GPT-3 looks to us now. That's not optimism or pessimism. It's the observed rate of capability gain, extrapolated forward with some caution. The jumps have been large enough and consistent enough that assuming a plateau feels more like hope than analysis.
Within five years, I believe AI systems will audit entire software ecosystems end-to-end. Not just individual codebases, but the interactions between systems, the supply chain dependencies, the runtime behaviours under adversarial conditions. The Linux Foundation's involvement in Glasswing points directly at this: open-source software is the majority of code running in production worldwide, and most of it has never had a serious security audit. That sentence should bother you.
But capability alone doesn't decide outcomes. What decides outcomes is who has access, how it's governed, and whether the institutions deploying these systems can keep pace with the models themselves. That last part is the one I'm least confident about.
Koch's 2023 article ends with a line that reads differently now: "GPT-4 doesn't currently have a release date, but I'm sure these large language models will continue to march forward as they gain more and more capabilities." He was right. He just had no way of knowing how right.
The part that stays with me
The Java deserialisation example from Koch's article is the one I can't stop thinking about. He showed GPT-3 a perfectly ordinary-looking SerializeToFile utility class. The kind of code you'd scroll past in a code review without a second glance. GPT-3 flagged it immediately: insecure deserialisation, potential remote code execution.
Koch wrote: "I didn't see anything wrong with this code when I first read it. To me, it looked completely innocuous and familiar."
That's the thing about vulnerabilities. And it's the thing about AI progress too. The most dangerous patterns are the ones that look normal until you have the eyes to see them for what they are.
In 2023, we got a glimpse of those eyes. In 2026, they're wide open.

Discussion (0)
Sign in to join the discussion.