No source? The first thing I did was try it on itself... which I suppose is somewhat of an "acid test". It took a few minutes and an enormous amount of memory, but finally it told me that the function at the very beginning of the executable, definitely a nontrivial one, decompiles to...
void fun_401000() {
}
I'm sure I hit upon some edge case and much better output can be had from this tool if I play with it some more, but for a first impression, not so good. But I'm definitely going to keep this one around, it looks promising.
The original and the most powerful disassembler is IDA Pro. The project was started in the 90s and has been used for security analysis, antivirus work, protection analysis/research, hacks as well as normal dev work in the closed-source ecosystems.
The author has implemented a decompiler plugin over the top of IDA and it works on the real-world code. The point here is to annotate the disassembly bottom-up and then decompile.
IDA is great, but for day-to-day work, especially if you're casual, Hopper is a really strong contender, and it's (amazingly) even cheaper than IDA. Hopper also was designed from the get-go to have a first-class Python interface, and it includes a workable decompiler.
C and C++ are different languages. I only saw C examples. How would you even decompile to C++? The only C++ information you have in the object code is the mangled names. How do you use that to get C++ code?
And vtables, rtti data, systematic sequences of ops for invoking base classes, intrinsic/library functions emitted by the compiler in specific situations, exception handling tables, ...
The first C++ compiler (Cfront) compiled from C++ to C.
Once you get some C code that does things with structures and function pointers like a C++ compiler would do, I think it's not impossible to turn those back into classes if you can recognise the patterns that a C++ compiler uses to compile C++ constructs like classes, virtual functions, etc.
Interesting that gcc removes the \n from the string and calls puts() directly - this avoids the overhead of parsing the string for non-existent format specifiers.
The decompiler could do with a bit of work making dynamic library imports more symbolic. Following the puts call chain quickly disappears into a non-local jump to an address with no further references.
Pretty impressive! I haven't used any disassembly tools in years and only remember last time I did I found it useless. Not sure if that was due to my lack of understanding or the generated output or rather a combination of both.
This thing however: I fed it with an OpenGL test app which doesn't do much but still has hundreds of lines of modern C++ spread over different libraries and could clearly recognize lots of my functions in it and follow some program flow starting from main. Still hard but at least I didn't feel completely lost like years ago.
This is extremely useful for analysis. Even if you understand x86 ASM, it allows you to quickly jump around a lot more efficiently than you otherwise would.
It won't, for me, recompile back into the source application. So that is a limitation, but even with that limitation it is extremely useful (and the fact it looks the C/C++ back to the ASM, makes altering the ASM directly trivial).
I wonder if statistical machine translation approach could be usefully applied here. Get tons of source from github, compile with every compiler available, train on the result. It would be challenging to compile automatically at scale, to align the code with the source, or to get a source representation invariant to identifiers, but should be doable.
It's both harder and easier. There is a mechanical transformation, without un- or approximately translatable idioms like natural language. On the other hand, the dependency chain is much more complex - with something like link-time optimization, a change to one part of the code can completely change the result (for instance, if it suddenly allows inlining of a function everywhere). There is also the problem of, if not "idiom translation", "idiom generation" - people write code in a particular style that may not be captured by the generated output, even if it compiles the same.
Targeting something like Clang specifically, where you have access not only to the assembler & a potential source, but also a whole AST & intermediate data structures, would be pretty interesting.
It does always evaluate to true. I honestly can't figure out why it's there, I've been googling what the 'frame_dummy' function is supposed to do and the only information I've found is something on 'setting up the exception frame'. All the code in that function does though is force a seg-fault if that test code you posted fails, so I'm not sure what it accomplishes.
This will totally fail for optimised code if it is just using object code without debug information.
There is no information in the resulting machine code that can indicate whether some code has been inlined or not.
Basically any optimisations performed by the compiler will throw this decompilation off.
I question whether you can get any real use out of this...
I don't think the market for this is to get the actual original code. It's more like understanding what a particular program does: when you see it on a higher level it's much easier to understand the code than reading raw assembly.
There's an enormous community of people who spend all their time worrying about the contents or behavior of Windows binaries. I've met some of them through my work, like malware analysts who deal with malware that's part of phishing attacks. The phishers will often prefer to create Windows-only attacks because Windows has such a commanding market share lead among most populations of phishing targets; in turn, that's what the people trying to defend against or mitigate those attacks will study. To folks in that sort of field, "binary" is virtually synonymous with "Windows binary"!
I guess also historically most of the tools for creating, modifying, and examining binaries for a given platform have been native to that platform, rather than cross tools. That's surely because most people (with the exception of embedded developers) do much more native development than cross development. I can get a small number of packages on my Linux machine that will deal with Windows executables in some relatively shallow way, but I have tons of programs already installed that do complicated and specific things to Linux ELF binaries even though I don't typically use those programs on a day-to-day basis.
[+] [-] userbinator|11 years ago|reply
[+] [-] webkike|11 years ago|reply
[+] [-] os_|11 years ago|reply
https://www.hex-rays.com/products/ida/index.shtml
The author has implemented a decompiler plugin over the top of IDA and it works on the real-world code. The point here is to annotate the disassembly bottom-up and then decompile.
https://www.hex-rays.com/products/decompiler/index.shtml
I don't want to bash the author of Snowman - this kind of research is serious fun. Yet, IDA has an insane lead.
[+] [-] tptacek|11 years ago|reply
[+] [-] pjmlp|11 years ago|reply
[+] [-] DigitalJack|11 years ago|reply
[+] [-] jordigh|11 years ago|reply
[+] [-] _wmd|11 years ago|reply
[+] [-] pjmlp|11 years ago|reply
Back in the 90's there was one targeted specifically to executables produced by Borland compilers.
[+] [-] userbinator|11 years ago|reply
Once you get some C code that does things with structures and function pointers like a C++ compiler would do, I think it's not impossible to turn those back into classes if you can recognise the patterns that a C++ compiler uses to compile C++ constructs like classes, virtual functions, etc.
[+] [-] nobotty|11 years ago|reply
[deleted]
[+] [-] barrkel|11 years ago|reply
The decompiler could do with a bit of work making dynamic library imports more symbolic. Following the puts call chain quickly disappears into a non-local jump to an address with no further references.
[+] [-] nes350|11 years ago|reply
[+] [-] RDeckard|11 years ago|reply
[+] [-] 72deluxe|11 years ago|reply
[EDIT: Very informative replies below, thanks!]
[+] [-] qzc4|11 years ago|reply
[+] [-] stinos|11 years ago|reply
[+] [-] Someone1234|11 years ago|reply
It won't, for me, recompile back into the source application. So that is a limitation, but even with that limitation it is extremely useful (and the fact it looks the C/C++ back to the ASM, makes altering the ASM directly trivial).
[+] [-] fla|11 years ago|reply
[+] [-] 3rd3|11 years ago|reply
[+] [-] ntoshev|11 years ago|reply
[+] [-] fiatmoney|11 years ago|reply
Targeting something like Clang specifically, where you have access not only to the assembler & a potential source, but also a whole AST & intermediate data structures, would be pretty interesting.
[+] [-] daguu|11 years ago|reply
if (__JCR_END__ == 0 || 1) { return;
[+] [-] DSMan195276|11 years ago|reply
[+] [-] fnordfnordfnord|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] TickleSteve|11 years ago|reply
I question whether you can get any real use out of this...
[+] [-] anemic|11 years ago|reply
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] m00dy|11 years ago|reply
[+] [-] schoen|11 years ago|reply
I guess also historically most of the tools for creating, modifying, and examining binaries for a given platform have been native to that platform, rather than cross tools. That's surely because most people (with the exception of embedded developers) do much more native development than cross development. I can get a small number of packages on my Linux machine that will deal with Windows executables in some relatively shallow way, but I have tons of programs already installed that do complicated and specific things to Linux ELF binaries even though I don't typically use those programs on a day-to-day basis.
[+] [-] astrange|11 years ago|reply
Or try and fix up Boomerang on other OSes, I suppose.
[+] [-] cinch|11 years ago|reply
[+] [-] J_Darnley|11 years ago|reply