The unexpected effectiveness of one-shot decompilation with Claude

[+] simonw|3 months ago|reply

For anyone else who was initially confused by this, useful context is that Snowboard Kids 2 is an N64 game.

I also wasn't familiar with this terminology:

> You hand it a function; it tries to match it, and you move on.

In decompilation "matching" means you found a function block in the machine code, wrote some C, then confirmed that the C produces the exact same binary machine code once it is compiled.

The author's previous post explains this all in a bunch more detail: https://blog.chrislewis.au/using-coding-agents-to-decompile-...

[+] slavik81|3 months ago|reply

Snowboard Kids 2 was a great N64 game. It was one of a number of racing titles inspired by Mario Kart, but the snowboarding added a bit of a different feel. The battle items were clever, and the stages were really well made given the technical limitations they faced. As a kid with two brothers, we played a lot of competitive multiplayer.

I also remember a few things in the singleplayer being very difficult. The number of times I had to fight/race Dameian in his giant robot running down the mountainside... It's carved into my brain like that footrace against Wizpig in DKR or the Donkey Kong arcade game for the Rareware coin in DK64.

The battle items in Snowboard Kids were clever and memorable. The parachute missile that would launch racers up in the air and then deploy the parachute so they slowly float back down was such a frustrating item to be hit with. The pans that would hit all opponents was iconic and it was hilarious that you could somehow doge it with invisibility. Even the basic rock dropped on the course was somehow memorable.

Great game. It's heartwarming to know that others still remember it and care about it.

[+] govping|3 months ago|reply

We've been using LLMs for security research (finding vulnerabilities in ML frameworks) and the pattern is similar - it's surprisingly good at the systematic parts (pattern recognition, code flow analysis) when you give it specific constraints and clear success criteria.

The interesting part: the model consistently underestimates its own speed. We built a complete bug bounty submission pipeline - target research, vulnerability scanning, POC development - in hours when it estimated days. The '10 attempts' heuristic resonates - there's definitely a point where iteration stops being productive.

For decompilation specifically, the 1M context window helps enormously. We can feed entire codebases and ask 'trace this user input to potential sinks' which would be tedious manually. Not perfect, but genuinely useful when combined with human validation.

The key seems to be: narrow scope + clear validation criteria + iterative refinement. Same as this decompilation work.

[+] your_sweetpea|3 months ago|reply

I'd like to see this given a bit more structure, honestly. What occurs to me is constraining the grammar for LLM inference to ensure valid C89 (or close-to, as much can be checked without compilation), then perhaps experimentally switching to a permuter once/if a certain threshold is reached for accuracy of the decompiled function.

Eventually some or many of these attempts would, of course, fail, and require programmer intervention, but I suspect we might be surprised how far it could go.

[+] Animats|3 months ago|reply

In decompilation "matching" means you found a function block in the machine code, wrote some C, then confirmed that the C produces the exact same binary machine code once it is compiled.

They had access to the same C compiler used by Nintendo in 1999? And the register allocation on a MIPS CPU is repeatable enough to get an exact match? That's impressive.

[+] tails4e|3 months ago|reply

Why not follow decompilation like ghidra does, rather than guess, compile, compare? It seems more sensible to actually decompile.

[+] elitan|3 months ago|reply

helpful

[+] saagarjha|3 months ago|reply

It's worth noting here that the author came up with a handful of good heuristics to guide Claude and a very specific goal, and the LLM did a good job given those constraints. Most seasoned reverse engineers I know have found similar wins with those in place.

What LLMs are (still?) not good at is one-shot reverse engineering for understanding by a non-expert. If that's your goal, don't blindly use an LLM. People already know that you getting an LLM to write prose or code is bad, but it's worth remembering that doing this for decompilation is even harder :)

[+] zdware|3 months ago|reply

Agree with this. I'm a software engineer that has mostly not had to manage memory for most of my career.

I asked Opus how hard it would be to port the script extender for Baldurs Gate 3 from Windows to the native Linux Build. It outlined that it would be very difficult for someone without reverse engineering experience, and correctly pointed out they are using different compilers, so it's not a simple mapping exercise. It's recommendation was not to try unless I was a Ghrida master and had lots of time in my hands.

[+] ph4evers|3 months ago|reply

Are they not performing well because they are trained to be more generic, or is the task too complex? It seems like a cheap problem to fine-tune.

[+] t_mann|3 months ago|reply

> The ‘give up after ten attempts’ threshold aims to prevent Claude from wasting tokens when further progress is unlikely. It was only partially successful, as Claude would still sometimes make dozens of attempts.

Not what I would have expected from a 'one-shot'. Maybe self-supervised would be a more suitable term?

[+] wavemode|3 months ago|reply

"one-shot" usually just means, one example and its correct answer was provided in the prompt.

125 comments