top | item 44081338

I used o3 to find a remote zeroday in the Linux SMB implementation

658 points| zielmicha | 9 months ago |sean.heelan.io

223 comments

order

nxobject|9 months ago

A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.

It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.

[1] https://github.com/SeanHeelan/o3_finds_cve-2025-37899

epolanski|9 months ago

I find your take amusing considering that's literally the only part of the post he admits to just vibing it:

> In fact my entire system prompt is speculative so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering

kweingar|9 months ago

How do we benchmark these different methodologies?

It all seems like vibes-based incantations. "You are an expert at finding vulnerabilities." "Please report only real vulnerabilities, not any false positives." Organizing things with made-up HTML tags because the models seem to like that for some reason. Where does engineering come into it?

threeseed|9 months ago

It's amusing to me how people keep trying to apply engineering principles to an inherently unstable and unpredictable system in order to get a feeling of control.

Those prompts should be renamed as hints. Because that's all they are. Every LLM today ignores prompts if they conflict with its sole overarching goal: to give you an answer no matter whether it's true or not.

stingraycharles|9 months ago

Fun fact: if you ask an LLM about best practices and how to organize your prompts, it will hint you towards this direction.

It’s surprisingly effective to ask LLMs to help you write prompts as well, i.e. all my prompt snippets were designed with help of an LLM.

I personally keep them all in an org-mode file and copy/paste them on demand in a ChatGPT chat as I prefer more “discussion”-style interactions, but the approach is the same.

Enginerrrd|9 months ago

Wrangling LLM's is remarkably like wrangling interns in my experience. Except that the LLM will surprise you by being both much smarter and much dumber.

The more you can frame the problem with your expertise, the better the results you will get.

Retr0id|9 months ago

The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.

Aurornis|9 months ago

I’ve developed a few take-home interview problems over the years that were designed to be short, easy for an experienced developer, but challenging for anyone who didn’t know the language. All were extracted from real problems we solved on the job, reduced into something minimal.

Every time a new frontier LLM is released (excluding LLMs that use input as training data) I run the interview questions through it. I’ve been surprised that my rate of working responses remains consistently around 1:10 for the first pass, and often takes upwards of 10 rounds of poking to get it to find its own mistakes.

So this level of signal to noise ratio makes sense for even more obscure topics.

ianbutler|9 months ago

We’ve been working on a system that increases signal to noise dramatically for finding bugs, we’ve at the same time been thoroughly benchmarking the entire popular software agents space for this

We’ve found a wide range of results and we have a conference talk coming up soon where we’ll be releasing everything publicly so stay tuned for that itll be pretty illuminating on the state of the space

Edit: confusing wording

tough|9 months ago

I was thinking about this the other day, wouldn't it be feasible to make fine-tune or something like that into every git change, mailist, etc, the linux kernel has ever hard?

Wouldn't such an LLM be the closer -synth- version of a person who has worked on a codebase for years, learnt all its quirks etc.

There's so much you can fit on a high context, some codebases are already 200k Tokens just for the code as is, so idk

antirez|9 months ago

I bet automatic this part will be simple. In general LLMs that have a given semantical ability "X" to do some task, have greater than X ability to check, among N replies about doing the same task, which reply is the best, especially if via binary tournament like RAInk did (it was posted here a few weeks ago). There is also the possibility to use agreement among different LLMs. I'm surprised Gemini 2.5 PRO was not used here, in my experience it is the most powerful LLM to do that kind of stuff.

andix|9 months ago

1:50 is a great detection ratio for finding a needle in a haystack.

manmal|9 months ago

If the LLM wrote a harness and proof of concept tests for its leads, then it might increase S/N dramatically. It’s just quite expensive to do all that right now.

nialv7|9 months ago

maybe we ask the AI to come up with an exploit, run it and see if it works? then you can RL on this.

iandanforth|9 months ago

The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!

seanheelan|9 months ago

I realised I didn't mention it in the article, so in case you're curious it cost about $116 to run the 100k token version 100 times.

JFingleton|9 months ago

Zero days can go for $$$, or you can go down the bug bounty route and also get $$. The cost of the LLM would be a drop in the bucket.

When the cost of inference gets near zero, I have no idea what the world of cyber security will look like, but it's going to be a very different space from today.

roncesvalles|9 months ago

A lot of money is all you need~

xyst|9 months ago

"100 times for each of the models" represents a significant amount of energy burned. The achievement of finding the most common vulnerability in C based codebases becomes less of an achievement. And more of a celebration of decadence and waste.

We are facing global climate change event, yet continue to burn resources for trivial shit like it’s 1950.

geraneum|9 months ago

This has become a common recurrence recently.

Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.

In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.

Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:

https://www.nature.com/articles/s41586-023-06924-6

One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.

meander_water|9 months ago

I'm not sure about the assertion that this is the first vulnerability found with an LLM. For e.g. OSS-Fuzz [0] has found a few using fuzzing, and Big Sleep using an agent approach [1].

[0] https://security.googleblog.com/2024/11/leveling-up-fuzzing-...

[1] https://googleprojectzero.blogspot.com/2024/10/from-naptime-...

seanheelan|9 months ago

It's certainly not the first vulnerability found with an LLM =) Perhaps I should have been more clear though.

What the post says is "Understanding the vulnerability requires reasoning about concurrent connections to the server, and how they may share various objects in specific circumstances. o3 was able to comprehend this and spot a location where a particular object that is not referenced counted is freed while still being accessible by another thread. As far as I'm aware, this is the first public discussion of a vulnerability of that nature being found by a LLM."

The point I was trying to make is that, as far as I'm aware, this is the first public documentation of an LLM figuring out that sort of bug (non-trivial amount of code, bug results from concurrent access to shared resources). To me at least, this is an interesting marker of LLM progress.

empath75|9 months ago

Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.

treebeard901|9 months ago

Yeah, the amount of engineering they have around controlling (censoring) the output, along with the terms of service, creates an incentive to still look for any possible bugs, but not allow it in the output.

Certainly for Govt agencies and others this will not be a factor. It is just for everyone else. This will cause people to use other models and agents without these restrictions.

It is safe to assume that a large number of vulnerabilities exist in important software all over the place. Now they can be found. This is going to set off arms race game theory applied to computer security and hacking. Probably sooner than expected...

stonepresto|9 months ago

I know there were at least a few kernel devs who "validated" this bug, but did anyone actually build a PoC and test it? It's such a critical piece of the process yet a proof of concept is completely omitted? If you don't have a PoC, you don't know what sort of hiccups would come along the way and therefore can't determine exploitability or impact. At least the author avoided calling it an RCE without validation.

But what if there's a missing piece of the puzzle that the author and devs missed or assumed o3 covered, but in fact was out of o3's context, that would invalidate this vulnerability?

I'm not saying there is, nor am I going to take the time to do the author's work for them, rather I am saying this report is not fully validated which feels like a dangerous precedent to set with what will likely be an influential blog post in the LLM VR space moving forward.

IMO the idea of PoC || GTFO should be applied more strictly than ever before to any vulnerability report generated by a model.

The underlying perspective that o3 is much better than previous or other current models still remains, and the methodology is still interesting. I understand the desire and need to get people to focus on something by wording it a specific way, it's the clickbait problem. But dammit, do better. Build a PoC and validate your claims, don't be lazy. If you're going to write a blog post that might influence how vulnerability researchers conduct their research, you should promote validation and not theoretical assumption. The alternative is the proliferation of ignorance through false-but-seemingly-true reporting, versus deepening the community's understanding of a system through vetted and provable reports.

seanheelan|9 months ago

Hi, author here. Yes, I built a PoC. Yes, it triggered a KASAN report/crash.

lyu07282|9 months ago

Are you saying you want PoCs that trigger a crash from the use-after-free or you would only be satisfied by full on RCE PoCs?

simonw|9 months ago

There's a beautiful little snippet here that perfectly captures how most of my prompt development sessions go:

> I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

logifail|9 months ago

My understanding is that ksmbd is a kernel-space SMB server "developed as a lightweight, high-performance alternative" to the traditional (user-space) Samba server...

Q1: Who is using ksmbd in production?

Q2: Why?

donnachangstein|9 months ago

1. People that were using the in-kernel SMB server in Solaris or Windows.

2. Samba performance sucks (by comparison) which is why people still regularly deploy Windows for file sharing in 2025.

Anybody know if this supports native Windows-style ACLs for file permissions? That is the last remaining reason to still run Solaris but I think it relies on ZFS to do so.

Samba's reliance on Unix UID/GID and the syncing as part of its security model is still stuck in the 1970s unfortunately.

The caveat is the in-kernel SMB server has been the source of at least one holy-shit-this-is-bad zero-day remote root hole in Windows (not sure about Solaris) so there are tradeoffs.

AshamedCaptain|9 months ago

Licensing. Samba is GPLv3, Linux is only GPLv2.

pixl97|9 months ago

I would assume for the reason of being lightweight and high performance?

noname120|9 months ago

The same reason people use kmod-trelay instead of relayd I guess

zielmicha|9 months ago

(To be clear, I'm not the author of the post, the title just starts with "How I")

eqvinox|9 months ago

Anyone else feel like this is a best case application for LLMs?

You could in theory automate the entire process, treat the LLM as a very advanced fuzzer. Run it against your target in one or more VMs. If the VM crashes or otherwise exhibits anomalous behavior, you've found something. (Most exploits like this will crash the machine initially, before you refine them.)

On one hand: great application for LLMs.

On the other hand: conversely implies that demonstrating this doesn't mean that much.

KTibow|9 months ago

> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.

This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.

gizmodo59|9 months ago

Having tried both I’d say o3 is in a league of it’s own compared to 3.7 or even Gemini 2.5 pro. The benchmarks may show not a lot of gain but that matters a lot when the task is very complex. What’s surprising is that they announced it last November and only now it’s released a month back now? (I’m guessing lots of safety took time but no idea). Can’t wait for o4!

iamdanieljohns|9 months ago

Could you provide some links to relevant work/research on using a "scratchpad" that you liked?

jp0001|9 months ago

We followed a very similar approach at work, created a test harness and tested all the models available in AWS bedrock and the OpenAI. We created our own code challenges not available on the Internet for training with vulnerable and non-vulnerable inline snippets and more contextual multi-file bugs. We also used 100 tests per challenge - I wanted to do 1000 test per challenge but realized that these models are not even close to 2 Sigma in accuracy! Overall we found very similar results. But, we were also able to increase accuracy using additional methods - which comes as additional costs. The issue I see overall is that we found is when dealing with large codebases you'll need to put blinders on the LLMs to shorten context windows so that hallucinated results are less likely to happen. The worst thing would be to follow red herrings - perhaps in 5 years we'll have models used for more engineering specific tasks that can be rated with Six Sigma accuracy if posed with the same questions and problems sets.

bandrami|9 months ago

The blinders give you a problem in that a lot of security issues aren't at a single point in the code but at where two remote points in the code interact.

resiros|9 months ago

I think an approach like AlphaEvolve is very likely to work well for this space.

You've got all the elements for a successful optimization algorithm: 1) A fast and good enough sampling function + 2) a fairly good energy function.

For 1) this post shows that LLMs (even unoptimized) are quite good at sampling candidate vulnerabilities in large code bases. A 1% accuracy rate isn't bad at all, and they can be made quite fast (at least very parallelizable).

For 2) theoretically you can test any exploit easily and programmatically determine if it works. The main challenge is getting the energy function to provide gradient—some signal when you're close to finding a vulnerability/exploit.

I expect we'll see such a system within the next 12 months (or maybe not, since it's the kind of system that many lettered agencies would be very interested in).

martinald|9 months ago

I think this is the biggest alignment problem with LLMs in the short term imo. It is getting scarily good at this.

I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.

tekacs|9 months ago

The (obvious) flipside of this coin is that it allows us to run this adversarially against our own codebases, catching bugs that could otherwise have been found by a researcher, but that we can instead patch proactively.\

I wouldn't (personally) call it an alignment issue, as such.

Legend2440|9 months ago

If attackers can automatically scan code for vulnerabilities, so can defenders. You could make it part of your commit approval process or scan every build or something.

roywiggins|9 months ago

Is it an alignment problem if it's doing what was asked of it? It's "aligned" with a human's wishes.

bongodongobob|9 months ago

It's a moot point unless attackers have better LLMs don't have access to.

dboreham|9 months ago

I feel like our jobs are reasonably secure for a while because the LLM didn't immediately say "SMB implemented in the kernel, are you f-ing joking!?"

gerdesj|9 months ago

I'll have to get my facts straight but I'm pretty sure that ksmbd is ... not used much (by me).

https://lwn.net/Articles/871866/ This is also nothing to do with Samba which is a well trodden path.

So why not attack a codebase that is rather more heavily used and older? Why not go for vi?

usr1106|9 months ago

Good link. After reading this it's not a surprise that this code has security vulnerabilities. But of course from knowing that there must be more to actually finding it, it's still a big leap.

4 years after the article, does any relevant distro have that implementation enabled?

mezyt|9 months ago

Meanwhile, as a maintainer, I've been reviewing more than a dozen false positives slop CVEs in my library and not a single one found an actual issue. This article's is probably going to make my situation worse.

SamuelAdams|9 months ago

Maybe, but the author is an experienced vulnerability analyst. Obviously if you get a lot of people who have no experience with this you may get a lot of sloppy, false reports.

But this poster actually understands the AI output and is able to find real issues (in this case, use-after-free). From the article:

> Before I get into the technical details, the main takeaway from this post is this: with o3 LLMs have made a leap forward in their ability to reason about code, and if you work in vulnerability research you should start paying close attention. If you’re an expert-level vulnerability researcher or exploit developer the machines aren’t about to replace you. In fact, it is quite the opposite: they are now at a stage where they can make you significantly more efficient and effective.

baq|9 months ago

probably not. o3 is not free to use.

mehulashah|9 months ago

The scary part of this is that the bad guys are doing the same thing. They’re looking for zero day exploits, and their ability to find them just got better. More importantly, it’s now almost automated. While the arms race will always continue, I wonder if this change of speed hurts the good guys more than the bad guys. There are many of these, and they take time to fix.

akomtu|9 months ago

This made me think that the near future will be LLMs trained specifically on Linux or another large project. The source code is a small part of the dataset fed to LLMs. The more interesting is runtime data flow, similar to what we observe in a debugger. Looking at the codebase alone is like trying to understand a waterfall by looking at equations that describe the water flow.

baq|9 months ago

it needs to be trained on on enough TLA+ traces, too.

davidgerard|9 months ago

This is just fuzzing with extra power consumption?

brokensegue|9 months ago

Can you reconstruct finding this bug with traditional fuzzing?

1oooqooq|9 months ago

i can ask offline o3 about that cve and get a reply, does that mean the author used a model that knew about the vulnerability?

theptip|9 months ago

This is a great case study. I wonder how hard o3 would find it to build a minimal repro for these vulns? This would of course make it easier to identify true positives and discard false positives.

This is I suppose an area where the engineer can apply their expertise to build a validation rig that the LLM may be able to utilize.

jobswithgptcom|9 months ago

Wow, interesting. I been hacking a tool called https://diffwithgpt.com with a similar angle but indexing git changelogs with qwen to have it raise risks for backward compat issues, risks including security when upgrading k8s etc.

qoez|9 months ago

This is why AI safety is going to be impossible. This easily could have been a bad actor who would use this finding for nefarious acts. A person can just lie and there really isn't any safety finetuning that would let it separate the two intents.

mettamage|9 months ago

I wonder how often it will say there’s a vulnerability where there is non. Running it 100 times is a lot

ape4|9 months ago

Seems we need something like kernel modules but with memory protection

Hilift|9 months ago

Does the vulnerability exist in other implementations of SMB?

p_ing|9 months ago

Implementations of SMB (Windows, Samba, macOS, ksmbd) are going to be different (macOS has a terrible implementation, even though AFP is being deprecated). At this level, it's doubtful that the code is shared among all implementations.

jokoon|9 months ago

Wow

I think the NSA already has this, without the need for a LLM.

fsckboy|9 months ago

>It is interesting by virtue of being part of the remote attack surface of the Linux kernel.

...if your linux kernel has ksmbd built into it; that's a much smaller interest group

tomalbrc|9 months ago

“I brute forced an ai to help me find potential zero day bugs”

fHr|9 months ago

meanwhile boomers out here still thinking they are better than AI wehen even local gemma3 models can write better code then them allready

dehrmann|9 months ago

Are there better tools for finding this? It feels like the sort of thing static analysis should reliably find, but it's in the Linux kernel, so you'd think either coding standards or tooling around these sorts of C bugs would be mature.

grg0|9 months ago

Not the expert in the area, but "classic static analysis" (for lack of a better term) and concurrency bugs doesn't really check. There are specific modeling tools for concurrency, and they are an entirely different beast than static analysis that requires notation and language support to describe what threads access what data when. Concurrency bugs in static analysis probably requires a level of context and understanding that an LLM can easily churn through.

yellow_lead|9 months ago

Some static analysis tools can detect use after free or memory leaks. But since this one requires reasoning about multiple threads, I think it would've been unlikely to be found by static analysis.

mdaniel|9 months ago

Noteable:

> o3 finds the kerberos authentication vulnerability in 8 of the 100 runs

And I'd guess this only became a blog post because the author already knew about the vuln and was just curious to see if the intern could spot it too, given a curated subset of the codebase

moyix|9 months ago

He did do exactly what you say – except right after that, while reviewing the outputs, he found that it had also discovered a different 0day.