I am reluctant to allow the decompiler to influence my judgment about the meaning of variables. `streamPos` is not equivalent to `end`. Consider the issue multiplied by 20 or 100 as many incorrect assumptions, and it would severel cloud your understanding of the decompiled code.
Combining this with reasoning models that can justify their labels would be very helpful. UX improvements could also be made to indicate confidence or progressively disclose these assumptions.
> If you’ve ever talked to me in person, you’d know that I’m a disbeliever of AI replacing decompilers any time soon
Decompilation, seen as a translation problem, is by any means a job that suits AI methods. Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.
My first priority for a decompiler is that the output is (mostly) correct. (I say mostly because there's lots of little niggling behavior you probably want to ignore, like representing a shift instruction as `a << b` over `a << (b & 0x1f)`). When the decompiler's output is incorrect, I can't trust it anymore, and I'm going to go straight back to the disassembly because I need to work with the correct output. And AI--especially LLMs--are notoriously bad at the "correct" part of translation.
If you look at decompilation as a multistep problem, the main steps are a) identify the function/data symbol boundaries, b) lift the functions to IR, c) recover type information (including calling convention for functions), d) recover high-level control flow, and e) recover variable names.
For step b, correctness is so critical that I'm wary of even trusting hand-generated tables for disassembly, since it's way too easy for someone to copy something by hand. But on the other hand, this is something that can be machine-generated with something that is provably correct (see, e.g., https://cs.stanford.edu/people/eschkufz/docs/pldi_16.pdf). Sure, there's also a further step for recognizing higher-level patterns like manually-implemented-bswap, but that's basically "implement a peephole optimizer," and the state of the art for compilers these days is to use formally verifiable techniques for doing that.
For a lot of the other problems, if you instead categorize them as things where the AI being wrong doesn't make it incorrect, AI can be a valuable tool. For example, control flow structuring can be envisioned as identifying which branches are gotos (including breaks/continues/early returns), since a CFG that has no gotos is pretty trivial to structure. So if your actual AI portion is a heuristic engine for working that out, it's never going to generate wrong code, just unnecessarily complicated code.
I agree with many other sentiments here that if it can replace decompilers, then surely it can replace compilers... which feels unlikely soon. So far, I've seen four end-to-end binary-to-code AI approaches, and none have had convincing results. Even those that crawled all of GitHub continue to have issues of making fake code, not understanding math, omitting portions of code, and (a personal irritant for me) being unable to map what address a line of decompilation came from.
However, I also acknowledge that AI can solve many pattern-based problems well.
I think a considerable value can be extracted from AI by focusing in on micro decisions in the decompiler process, like variable types, as recent work has.
> Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.
Decompilation is about dependencies which makes it a graph problem.
One such problem is boolean satisfiability and this particular kind of problem is extremely important. It also very easy to gather mappings between CNF and solutions. Actually, randomization of standard benchmarks is now part of SAT competitions, AFAIK.
Have you seen any advances there using large predictive models?
Proper decompilation is even harder, it is much like halting problem than SAT. Imagine that there is a function that gets inlined and, therefore specialized. One definitely wants source for the original function and calls to it, not a listing of all specializations.
This moves us to the space of "inverse guaranteed optimization" and as such it requires approximation of the solution of halting problem.
> Decompilation, seen as a translation problem, is by any means a job that suits AI methods.
Compilation is also a translation problem but I think many people would be leery of an LLM-based rust or clang -- perhaps simply because they're more familiar with the complexities involved in compilation than they are with those involved in decompilation.
(Not to say it won't eventually happen in some form.)
> Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.
Not anytime soon. There is more to a decompiler than assembly being converted to x language. File parsers, disassemblers, type reconstruction, etc are all functionality that have to run before a “machine code” can be converted to the most basics of decompiler output.
“Resurgence” not “resurgance”. I wanted to leave a comment in the article itself but it wants me to sign in with GitHub, which: yuk, so I’m commenting here instead.
I remember working on DCC, a decompiler for C created by Cristina Cifuentes in 1990. It felt like magic and the future, but it was incredibly difficult and interesting. I used it for decompiling firmware and it was hard to convince my boss that we needed it.
Decompilers aren’t just for security research they’re a key part of data compression of software updates. Delta compressors make deltas between decompiled code. So an improvement in mapping of decompiled files could have as much as a 20x improvement in software update size.
Don’t delta updates usually use some binary diffing algorithm?
Even if they didn’t I can understand why would they make deltas using decompiled code and not the disassembled one.
I love this use case! Do you have any public links acknowledging/mentioning/showing this use case? Including it in the Applications portion of the Dec Wiki would be great.
The article covers several themes in decompilation, but for academic work in decompilation just take some papers, study them, study references, try to reproduce the experiments I guess. For the bare basics, you can get a disassembly of a random binary on Linux with objdump -S
[+] [-] rgovostes|1 year ago|reply
Regarding AI-assisted renaming of variables, the author calls this "a strict improvement over traditional decompilation." But looking at the example:
I am reluctant to allow the decompiler to influence my judgment about the meaning of variables. `streamPos` is not equivalent to `end`. Consider the issue multiplied by 20 or 100 as many incorrect assumptions, and it would severel cloud your understanding of the decompiled code.Combining this with reasoning models that can justify their labels would be very helpful. UX improvements could also be made to indicate confidence or progressively disclose these assumptions.
[+] [-] summerlight|1 year ago|reply
[+] [-] benob|1 year ago|reply
Decompilation, seen as a translation problem, is by any means a job that suits AI methods. Give time to researchers to gather enough mappings between source code and machine code, get used to training large predictive models, and you shall see top notch decompilers that beat all engineered methods.
[+] [-] jcranmer|1 year ago|reply
My first priority for a decompiler is that the output is (mostly) correct. (I say mostly because there's lots of little niggling behavior you probably want to ignore, like representing a shift instruction as `a << b` over `a << (b & 0x1f)`). When the decompiler's output is incorrect, I can't trust it anymore, and I'm going to go straight back to the disassembly because I need to work with the correct output. And AI--especially LLMs--are notoriously bad at the "correct" part of translation.
If you look at decompilation as a multistep problem, the main steps are a) identify the function/data symbol boundaries, b) lift the functions to IR, c) recover type information (including calling convention for functions), d) recover high-level control flow, and e) recover variable names.
For step b, correctness is so critical that I'm wary of even trusting hand-generated tables for disassembly, since it's way too easy for someone to copy something by hand. But on the other hand, this is something that can be machine-generated with something that is provably correct (see, e.g., https://cs.stanford.edu/people/eschkufz/docs/pldi_16.pdf). Sure, there's also a further step for recognizing higher-level patterns like manually-implemented-bswap, but that's basically "implement a peephole optimizer," and the state of the art for compilers these days is to use formally verifiable techniques for doing that.
For a lot of the other problems, if you instead categorize them as things where the AI being wrong doesn't make it incorrect, AI can be a valuable tool. For example, control flow structuring can be envisioned as identifying which branches are gotos (including breaks/continues/early returns), since a CFG that has no gotos is pretty trivial to structure. So if your actual AI portion is a heuristic engine for working that out, it's never going to generate wrong code, just unnecessarily complicated code.
[+] [-] mahaloz|1 year ago|reply
However, I also acknowledge that AI can solve many pattern-based problems well. I think a considerable value can be extracted from AI by focusing in on micro decisions in the decompiler process, like variable types, as recent work has.
[+] [-] thesz|1 year ago|reply
Decompilation is about dependencies which makes it a graph problem.
One such problem is boolean satisfiability and this particular kind of problem is extremely important. It also very easy to gather mappings between CNF and solutions. Actually, randomization of standard benchmarks is now part of SAT competitions, AFAIK.
Have you seen any advances there using large predictive models?
Proper decompilation is even harder, it is much like halting problem than SAT. Imagine that there is a function that gets inlined and, therefore specialized. One definitely wants source for the original function and calls to it, not a listing of all specializations.
This moves us to the space of "inverse guaranteed optimization" and as such it requires approximation of the solution of halting problem.
[+] [-] wzdd|1 year ago|reply
Compilation is also a translation problem but I think many people would be leery of an LLM-based rust or clang -- perhaps simply because they're more familiar with the complexities involved in compilation than they are with those involved in decompilation.
(Not to say it won't eventually happen in some form.)
[+] [-] donatj|1 year ago|reply
[+] [-] __alexander|1 year ago|reply
Not anytime soon. There is more to a decompiler than assembly being converted to x language. File parsers, disassemblers, type reconstruction, etc are all functionality that have to run before a “machine code” can be converted to the most basics of decompiler output.
[+] [-] kachapopopow|1 year ago|reply
[+] [-] loloquwowndueo|1 year ago|reply
[+] [-] mahaloz|1 year ago|reply
[+] [-] FusspawnUK|1 year ago|reply
it seems to be very capable of having some understanding of what the original code would do.
for instance i was feeding it some game decomp. a function looking for an entity in a 3d array of tiles.
It somehow inferred it was an array of tiles and that it was hunting for a specific entity.
None of the decomp I fed it had any variable/function names or comments, just the usual var1,var2 ect.
How did it know what the underlying code was doing?
[+] [-] ellis0n|1 year ago|reply
[+] [-] mips_avatar|1 year ago|reply
[+] [-] pjc50|1 year ago|reply
[+] [-] auguzanellato|1 year ago|reply
[+] [-] mahaloz|1 year ago|reply
[+] [-] makz|1 year ago|reply
[+] [-] fulafel|1 year ago|reply