top | item 45784179

Claude Code can debug low-level cryptography

473 points| Bogdanp | 4 months ago |words.filippo.io

208 comments

order

simonw|4 months ago

Using coding agents to track down the root cause of bugs like this works really well:

> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).

Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

rtpg|4 months ago

I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.

The end result being these robots doing bikeshedding. When paired with junior engineers looking at this output and deciding to act on it, it just generates busywork. Not helping that everyone and their dog wants to automatically run their agent against PRs now

I'm trying to use these to some extent when I find myself in a canonical situation that should work and am not getting the value everyone else seems to get in many cases. Very much "trying to explain a thing to a junior engineer taking more time than doing it myself" thing, except at least the junior is a person.

majormajor|4 months ago

They're quite good at algorithm bugs, a lot less good at concurrency bugs, IME. Which is very valuable still, just that's where I've seen the limits so far.

Their also better at making tests for algorithmic things than for concurrency situations, but can get pretty close. Just usually don't have great out-of-the-box ideas for "how to ensure these two different things run in the desired order."

Everything that I dislike about generating non-greenfield code with LLMs isn't relevant to the "make tests" or "debug something" usage. (Weird/bad choices about when to duplicate code vs refactor things, lack of awareness around desired "shape" of codebase for long-term maintainability, limited depth of search for impact/related existing stuff sometimes, running off the rails and doing almost-but-not-quite stuff that ends up entirely the wrong thing.)

lxgr|4 months ago

I’ve been pretty impressed with LLMs at (to me) greenfield hobby projects, but not so much at work in a huge codebase.

After reading one of your blog posts recommending it, I decided to specifically give them a try as bug hunters/codebase explainers instead, and I’ve been blown away. Several hard-to-spot production bugs down in two weeks or so that would have all taken me at least a few focused hours to spot all in all.

mschulkind|4 months ago

One of my favorite ways to use LLM agents for coding is to have them write extensive documentation on whatever I'm about to dig in coding on. Pretty low stakes if the LLM makes a few mistakes. It's perhaps even a better place to start for skeptics.

teaearlgraycold|4 months ago

I’m a bit of an LLM hater because they’re overhyped. But in these situations they can be pretty nice if you can quickly evaluate correctness. If evaluating correctness is harder than searching on your own then there are net negative. I’ve found with my debugging it’s really hard to know which will be the case. And as it’s my responsibility to build a “Do I give the LLM a shot?” heuristic that’s very frustrating.

NoraCodes|4 months ago

> start exploring how these tools can help them without feeling like they're [...] ripping off the work of everyone who's code was used to train the model

But you literally still are. If you weren't, it should be trivially easy to create these models without using huge swathes of non-public-domain code. Right?

pron|4 months ago

I'm only an "AI sceptic" in the sense that I think that today's LLM models cannot regularly and substantially reduce my workload, not because they aren't able to perform interesting programming tasks (they are!), but because they don't do so reliably, and for a regular and substantial reduction in effort, I think a tool needs to be reliable and therefore trustworthy.

Now, this story is a perfect use case, because Filippo Valsorda put very little effort into communicating with the agent. If it worked - great; if it didn't - no harm done. And it worked!

The thing is that I already know that these tools are capable of truly amazing feats, and this is, no doubt, one of them. But it's been a while since I had a bug in a single-file library implementing a well-known algorithm, so it still doesn't amount to a regular and substantial increase in productivity for me, but "only" to yet another amazing feat by LLMs (something I'm not sceptical of).

Next time I have such a situation, I'll definitely use an LLM to debug it, because I enjoy seeing such results first-hand (plus, it would be real help). But I'm not sure that it supports the claim that these tools can today offer a regular and substantial productivity boost.

ulrikrasmussen|4 months ago

I know this is not an argument against LLM's being useful to increase productivity, but of all tasks in my job as software developer, hunting for and fixing obscure bugs is actually one of the most intellectually rewarding. I would miss that if it were to be taken over by a machine.

Also, hunting for bugs is often a very good way to get intimately familiar with the architecture of a system which you don't know well, and furthermore it improves your mental model of the cause of bugs, making you a better programmer in the future. I can spot a possible race condition or unsafe alien call at a glance. I can quickly identify a leaky abstraction, and spot mutable state that could be made immutable. All of this because I have spent time fixing bugs that were due to these mistakes. If you don't fix other people's bugs yourself, I fear you will also end up relying on an LLM to make judgements about your own code to make sure that it is bug-free.

jack_tripper|4 months ago

>Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

Why? Bug hunting is more challenging and cognitive intensive than writing code.

dns_snek|4 months ago

This is no different than when LLMs write code. In both scenarios they often turn into bullshit factories that are capable, willing, and happy to write pages and pages of intricate, convincing-sounding explanations for bugs that don't exist, wasting everyone's time and testing my patience.

qa34514324|4 months ago

I have tested the AI SAST tools that were hyped after a curl article on several C code bases and they found nothing.

Which low level code base have you tried this latest tool on? Official Anthropic commercials do not count.

XenophileJKO|4 months ago

Personally my biggest piece of advice is: AI First.

If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.

By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.

Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

imiric|4 months ago

> By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests.

Unfortunately, it doesn't quite work out that way.

Yes, you will get better at using these tools the more you use them, which is the case with any tool. But you will not learn what they can do as easily, or at all.

The main problem with them is the same one they've had since the beginning. If the user is a domain expert, then they will be able to quickly spot the inaccuracies and hallucinations in the seemingly accurate generated content, and, with some effort, coax the LLM into producing correct output.

Otherwise, the user can be easily misled by the confident and sycophantic tone, and waste potentially hours troubleshooting, without being able to tell if the error is on the LLM side or their own. In most of these situations, they would've probably been better off reading the human-written documentation and code, and doing the work manually. Perhaps with minor assistance from LLMs, but never relying on them entirely.

This is why these tools are most useful to people who are already experts in their field, such as Filippo. For everyone else who isn't, and actually cares about the quality of their work, the experience is very hit or miss.

> That being said.. you also need to understand how they fail and build an intuition for why they fail.

I've been using these tools for years now. The only intuition I have for how and why they fail is when I'm familiar with the domain. But I had that without LLMs as well, whenever someone is talking about a subject I know. It's impossible to build that intuition with domains you have little familiarity with. You can certainly do that by traditional learning, and LLMs can help with that, but most people use them for what you suggest: throwing things over the wall and running with it, which is a shame.

> I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

I haven't used GPT-5-Codex, but have experience with Sonnet 4.5, and it's only marginally better than the previous versions IME. It still often wastes my time, no matter the quality or amount of context I feed it.

Razengan|4 months ago

I did ask the AI first, about some things that I already knew how to do.

It gave me horribly inefficient or long-winded ways of doing it. In the time it took for "prompt tuning" I could have just written the damn code myself. It decreased the confidence for anything else it suggested about things I didn't already know about.

Claude still sometimes insists that iOS 26 isn't out yet. sigh.. I suppose I just have to treat it as an occasional alternative to Google/StackOverflow/Reddit for now. No way would I trust it to write an entire class let alone an app and be able to sleep at night (not that I sleep at night, but that's besides the point)

I think I prefer Xcode's built-in local model approach better, where it just offers sane autocompletions based on your existing code. e.g. if you already wrote a Dog class it can make a Cat class and change `bark()` to `meow()`

pton_xd|4 months ago

> Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers.

Related, lately I've been getting tons of Anthropic Instagram ads; they must be near a quarter of all the sponsored content I see for the last month or so. Various people vibe coding random apps and whatnot using different incarnations of Claude. Or just direct adverts to "Install Claude Code." I really have no idea why I've been targeted so hard, on Instagram of all places. Their marketing team must be working overtime.

simonw|4 months ago

I think it might be that they've hit product-market fit.

Developers find Claude Code extremely useful (once they figure out how to use it). Many developers subscribe to their $200/month plan. Assuming that's profitable (and I expect it is, since even for that much money it cuts off at a certain point to avoid over-use) Anthropic would be wise to spend a lot of money on marketing to try and grow their paying subscriber base for it.

jamamp|4 months ago

Unfortunately that might also be due to how Instagram shows ads, and not necessarily Anthropic's marketing push. As soon as you click on or even linger on a particular ad, Instagram notices and doubles down on sending you more ads from the same provider as well as related ads. My experience is that Instagram has 2-3 groups of ad types I receive which slowly change over the months as the effectiveness wanes over time.

The fact that you are receiving multiple kinds of ads from Anthropic does signify more of a marketing push, though.

spacechild1|4 months ago

> Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

Except they regularly come up with "explanations" that are completely bogus and may actually waste an hour or two. Don't get me wrong, LLMs can be incredibly helpful for identifying bugs, but you still have to keep a critical mindset.

danielbln|4 months ago

OP said "for me to reason about it", not for the LLM to reason about it.

I agree though, LLMs can be incredible debugging tools, but they are also incredibly gullable and love to jump to conclusions. The moment you turn your own fleshy brain off is when they go to lala land.

qsort|4 months ago

This resonates with me a lot:

> As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete

I think part of the reason why I was initially more skeptical than I ought to have been is because chat is such a garbage modality. LLMs started to "click" for me with Claude Code/Codex.

A "continuously running" mode that would ping me would be interesting to try.

imiric|4 months ago

On the one hand, I agree with this. The chat UI is very slow and inefficient.

But on the other, given what I know about these tools and how error-prone they are, I simply refuse to give them access to my system, to run commands, or do any action for me. Partly due to security concerns, partly due to privacy, but mostly distrust that they will do the right thing. When they screw up in a chat, I can clean up the context and try again. Reverting a removed file or messed up Git repo is much more difficult. This is how you get a dropped database during code freeze...

The idea of giving any of these corporations such privileges is unthinkable for me. It seems that most people either don't care about this, or are willing to accept it as the price of admission.

I experimented with Aider and a self-hosted model a few months ago, and wasn't impressed. I imagine the experience with SOTA hosted models is much better, but I'll probably use a sandbox next time I look into this.

cmrdporcupine|4 months ago

I absolutely agree with this sentiment as well and keep coming back to it. What I want is more of an actual copilot which works in a more paired way and forces me to interact with each of its changes and also involves me more directly in them, and teaches me about what it's doing along the way, and asks for more input.

A more socratic method, and more augmentic than "agentic".

Hell, if anybody has investment money and energy and shares this vision I'd love to work on creating this tool with you. I think these models are being misused right now in attempt to automate us out of work when their real amazing latent power is the intuition that we're talking about on this thread.

Misused they have the power to worsen codebases by making developers illiterate about the very thing they're working on because it's all magic behind the scenes. Uncorked they could enhance understanding and help better realize the potential of computing technology.

Frannky|4 months ago

CLI terminals are incredibly powerful. They are also free if you use Gemini CLI or Qwen Code. Plus, you can access any OpenAI-compatible API(2k TPS via Cerebras at 2$/M or local models). And you can use them in IDEs like Zed with ACP mode.

All the simple stuff (creating a repo, pushing, frontend edits, testing, Docker images, deployment, etc.) is automated. For the difficult parts, you can just use free Grok to one-shot small code files. It works great if you force yourself to keep the amount of code minimal and modular. Also, they are great UIs—you can create smart programs just with CLI + MCP servers + MD files. Truly amazing tech.

idiotsecant|4 months ago

Claude code ux is really good, though.

BrokenCogs|4 months ago

How good is Gemini CLI compared to Claude code and openAi codex?

Thorrez|4 months ago

>so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session

There's a risk there that the AI could find the solution by looking through your history to find it, instead of discovering it directly in the checked-out code. AI has done that in the past:

https://news.ycombinator.com/item?id=45214670

delaminator|4 months ago

> For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it?

You can use Git hooks to do that. If you have tests and one fails, spawn an instance of claude a prompt -p 'tests/test4.sh failed, look in src/ and try and work out why'

    $ claude -p 'hello, just tell me a joke about databases'

    A SQL query walks into a bar, walks up to two tables and asks, "Can I JOIN you?"

    $ 
Or if, you use Gogs locally, you can add a Gogs hook to do the same on pre-push

> An example hook script to verify what is about to be pushed. Called by "git push" after it has checked the remote status, but before anything has been pushed. If this script exits with a non-zero status nothing will be pushed.

I like this idea. I think I shall get Claude to work out the mechanism itself :)

It is even a suggestion on this Claude cheet sheet

https://www.howtouselinux.com/post/the-complete-claude-code-...

jamesponddotco|4 months ago

This could probably be implemented as a simple Bash script, if the user wants to run everything manually. I might just do that to burn some time.

wrs|4 months ago

Yesterday I tried something similar with a website (Go backend) that was doing a complex query/filter and showing the wrong results. I just described the problem to Claude Code, told it how to run psql, and gave it an authenticated cookie to curl the backend with. In about three minutes of playing around, it fixed the problem. It only needed raw psql and curl access, no specialized tooling, to write a bunch of bash commands to poke around and compare the backend results with the test database.

gdevenyi|4 months ago

Coming soon, adversarial attacks on LLM training to ensure cryptographic mistakes.

nikanj|4 months ago

I'm surprised it didn't fix it by removing the code. In my experience, if you give Claude a failing test, it fixes it by hard-coding the code to return the value expected by the test or something similar.

Last week I asked it to look at why a certain device enumeration caused a sigsegv, and it quickly solved the issue by completely removing the enumeration. No functionality, no bugs!

pessimizer|4 months ago

I've got a paste in prompt that reiterates multiple times not to remove features or debugging output without asking first, and not to blame the test file/data that the program failed on. Repeated multiple times, the last time in all caps. It still does it. I hope maybe half as often, but I may be fooling myself.

phendrenad2|4 months ago

A whole class of tedious problems have been eliminated by LLMs because they are able to look at code in a "fuzzy" way. But this can be a liability, too. I have a codebase that "looks kinda" like a nodejs project, so AI agents usually assume it is one, even if I rename the package.json, it will inspect the contents and immediately clock it as "node-like".

didibus|4 months ago

This is basically the ideal scenario for coding agents. Easily verifiable through running tests, pure logic, algorithmic problem. It's the case that has worked the best for me with LLMs.

cluckindan|4 months ago

So the ”fix” includes a completely new function? In a cryptography implementation?

I feel like the article is giving out very bad advice which is going to end up shooting someone in the foot.

thadt|4 months ago

Can you expand on what you find to be 'bad advice'?

The author uses an LLM to find bugs and then throw away the fix and instead write the code he would have written anyway. This seems like a rather conservative application of LLMs. Using the 'shooting someone in the foot' analogy - this article is an illustration of professional and responsible firearm handling.

OneDeuxTriSeiGo|4 months ago

The article even states that the solution claude proposed wasn't the point. The point was finding the bug.

AI are very capable heuristics tools. Being able to "sniff test" things blind is their specialty.

i.e. Treat them like an extremely capable gas detector that can tell you there is a leak and where in the plumbing it is, not a plumber who can fix the leak for you.

lordnacho|4 months ago

I'm not surprised it worked.

Before I used Claude, I would be surprised.

I think it works because Claude takes some standard coding issues and systematizes them. The list is long, but Claude doesn't run out of patience like a human being does. Or at least it has some credulity left after trying a few initial failed hypotheses. This being a cryptography problem helps a little bit, in that there are very specific keywords that might hint at a solution, but from my skim of the article, it seems like it was mostly a good old coding error, taking the high bits twice.

The standard issues are just a vague laundry list:

- Are you using the data you think you're using? (Bingo for this one)

- Could it be an overflow?

- Are the types right?

- Are you calling the function you think you're calling? Check internal, then external dependencies

- Is there some parameter you didn't consider?

And a bunch of others. When I ask Claude for a debug, it's always something that makes sense as a checklist item, but I'm often impressed by how it diligently followed the path set by the results of the investigation. It's a great donkey, really takes the drudgery out of my work, even if it sometimes takes just as long.

vidarh|4 months ago

> The list is long, but Claude doesn't run out of patience like a human being does

I've flat out had Claude tell me it's task was getting tedious, and it will often grasp at straws to use as excuses for stopping a repetitive task and moving in to something else.

Keeping it on task when something keeps moving forward, is easy, but when it gets repetitive it takes a lot of effort to make it stick to it.

ay|4 months ago

> Claude doesn't run out of patience like a human being does.

It very much does! I had a debugging session with Claude Code today, and it was about to give up with the message along the lines of “I am sorry I was not able to help you find the problem”.

It took some gentle cheering (pretty easy, just saying “you are doing an excellent job, don’t give up!”) and encouragement, and a couple of suggestions from me on how to approach the debug process for it to continue and finally “we” (I am using plural here because some information that Claude “volunteered” was essential to my understanding of the problem) were able to figure out the root cause and the fix.

zcw100|4 months ago

I just recently found a number of bugs in both the RELIC and MCL libraries. It took a while to track them down but it was remarkable that it was able to find them at all.

deadbabe|4 months ago

With AI, we will finally be able to do the impossible: roll our own crypto.

oytis|4 months ago

It is very much possible, it's just a bad idea. Doubly so with AI.

lisbbb|4 months ago

You're not going to do better than the NSA.

tptacek|4 months ago

That's exactly not what he's doing.

jerf|4 months ago

This is one of the things I've mentioned before, I think it's just hidden a bit and hard to see, but this is basically the LLM doing style transfer, which they're really good at. There was a specification for the code (which looks like it was already trained into the LLM since it didn't have to go fetch it but it also had intimate knowledge of), there was an implementation, and it's really good at extracting out the style difference between code and spec. Anything that looks like style transfer is a good use for LLMs.

As another example, I think things like "write unit tests for this code" are usually similar sort of style transfer as well, based on how it writes the tests. It definitely has a good idea as to how to sort of ensure that all the functionality gets tested, I find it is less likely to produce "creative" ways that bugs may come out, but hey, it's a good start.

This isn't a criticism, it's intended to be a further exploration and understanding of when these tools can be better than you might intuitively think.

rvz|4 months ago

As declared by an expert in cryptography who knows how to guide the LLM into debugging low-level cryptography, which that's good.

Quite different if you are not a cryptographer or a domain expert.

tptacek|4 months ago

Even the title of the post makes this clear: it's about debugging low-level cryptography. He didn't vibe code ML-DSA. You have to be building a low-level implementation in the first place for anything in this post to apply to it.