A native code to C/C++ decompiler

[+] userbinator|11 years ago|reply

No source? The first thing I did was try it on itself... which I suppose is somewhat of an "acid test". It took a few minutes and an enormous amount of memory, but finally it told me that the function at the very beginning of the executable, definitely a nontrivial one, decompiles to...

    void fun_401000() {
    }

I'm sure I hit upon some edge case and much better output can be had from this tool if I play with it some more, but for a first impression, not so good. But I'm definitely going to keep this one around, it looks promising.

[+] webkike|11 years ago|reply

For all we know, that result is a complete facsimile of the original source. But then again, we don't have the source.

[+] os_|11 years ago|reply

The original and the most powerful disassembler is IDA Pro. The project was started in the 90s and has been used for security analysis, antivirus work, protection analysis/research, hacks as well as normal dev work in the closed-source ecosystems.

https://www.hex-rays.com/products/ida/index.shtml

The author has implemented a decompiler plugin over the top of IDA and it works on the real-world code. The point here is to annotate the disassembly bottom-up and then decompile.

https://www.hex-rays.com/products/decompiler/index.shtml

I don't want to bash the author of Snowman - this kind of research is serious fun. Yet, IDA has an insane lead.

[+] tptacek|11 years ago|reply

IDA is great, but for day-to-day work, especially if you're casual, Hopper is a really strong contender, and it's (amazingly) even cheaper than IDA. Hopper also was designed from the get-go to have a first-class Python interface, and it includes a workable decompiler.

[+] pjmlp|11 years ago|reply

IDA was not the first one. There have been a few others since the 80's.

[+] DigitalJack|11 years ago|reply

It's also expensive. I'm sure it's worth it from all I've heard, but it's unlikely I'll ever have the money for hobby work.

[+] jordigh|11 years ago|reply

C and C++ are different languages. I only saw C examples. How would you even decompile to C++? The only C++ information you have in the object code is the mangled names. How do you use that to get C++ code?

[+] _wmd|11 years ago|reply

And vtables, rtti data, systematic sequences of ops for invoking base classes, intrinsic/library functions emitted by the compiler in specific situations, exception handling tables, ...

[+] pjmlp|11 years ago|reply

As some other decompilers do, by having a knowledge pool of specific compilers.

Back in the 90's there was one targeted specifically to executables produced by Borland compilers.

[+] userbinator|11 years ago|reply

The first C++ compiler (Cfront) compiled from C++ to C.

Once you get some C code that does things with structures and function pointers like a C++ compiler would do, I think it's not impossible to turn those back into classes if you can recognise the patterns that a C++ compiler uses to compile C++ constructs like classes, virtual functions, etc.

[+] nobotty|11 years ago|reply

[deleted]

[+] barrkel|11 years ago|reply

Interesting that gcc removes the \n from the string and calls puts() directly - this avoids the overhead of parsing the string for non-existent format specifiers.

The decompiler could do with a bit of work making dynamic library imports more symbolic. Following the puts call chain quickly disappears into a non-local jump to an address with no further references.

[+] nes350|11 years ago|reply

Another native code decompiler, although apparently abandoned long ago: http://boomerang.sourceforge.net

[+] RDeckard|11 years ago|reply

Thanks for sharing, did not know. Of course, aware'ing everyone on IDA Hex-Rays on this thread too: http://www.hex-rays.com/products/ida/

[+] 72deluxe|11 years ago|reply

That "hello world" decompilation is complex!

[EDIT: Very informative replies below, thanks!]

[+] qzc4|11 years ago|reply

It's because of the #include <stdio.h>, isn't it?

[+] stinos|11 years ago|reply

Pretty impressive! I haven't used any disassembly tools in years and only remember last time I did I found it useless. Not sure if that was due to my lack of understanding or the generated output or rather a combination of both. This thing however: I fed it with an OpenGL test app which doesn't do much but still has hundreds of lines of modern C++ spread over different libraries and could clearly recognize lots of my functions in it and follow some program flow starting from main. Still hard but at least I didn't feel completely lost like years ago.

[+] Someone1234|11 years ago|reply

This is extremely useful for analysis. Even if you understand x86 ASM, it allows you to quickly jump around a lot more efficiently than you otherwise would.

It won't, for me, recompile back into the source application. So that is a limitation, but even with that limitation it is extremely useful (and the fact it looks the C/C++ back to the ASM, makes altering the ASM directly trivial).

[+] fla|11 years ago|reply

This and also the fact it's available as an IDA plugin.

[+] 3rd3|11 years ago|reply

I’m wondering whether one could use machine learning and C/C++ code from GitHub to find reasonable variable names automatically.

[+] ntoshev|11 years ago|reply

I wonder if statistical machine translation approach could be usefully applied here. Get tons of source from github, compile with every compiler available, train on the result. It would be challenging to compile automatically at scale, to align the code with the source, or to get a source representation invariant to identifiers, but should be doable.

[+] fiatmoney|11 years ago|reply

It's both harder and easier. There is a mechanical transformation, without un- or approximately translatable idioms like natural language. On the other hand, the dependency chain is much more complex - with something like link-time optimization, a change to one part of the code can completely change the result (for instance, if it suddenly allows inlining of a function everywhere). There is also the problem of, if not "idiom translation", "idiom generation" - people write code in a particular style that may not be captured by the generated output, even if it compiles the same.

Targeting something like Clang specifically, where you have access not only to the assembler & a potential source, but also a whole AST & intermediate data structures, would be pretty interesting.

[+] daguu|11 years ago|reply

I'm brand new to C, but wouldn't this from the hello world example always eval true?

if (__JCR_END__ == 0 || 1) { return;

[+] DSMan195276|11 years ago|reply

It does always evaluate to true. I honestly can't figure out why it's there, I've been googling what the 'frame_dummy' function is supposed to do and the only information I've found is something on 'setting up the exception frame'. All the code in that function does though is force a seg-fault if that test code you posted fails, so I'm not sure what it accomplishes.

[+] fnordfnordfnord|11 years ago|reply

If __JCR_END___ was always a boolean, yes.

[+] unknown|11 years ago|reply

[deleted]

[+] TickleSteve|11 years ago|reply

This will totally fail for optimised code if it is just using object code without debug information. There is no information in the resulting machine code that can indicate whether some code has been inlined or not. Basically any optimisations performed by the compiler will throw this decompilation off.

I question whether you can get any real use out of this...

[+] anemic|11 years ago|reply

I don't think the market for this is to get the actual original code. It's more like understanding what a particular program does: when you see it on a higher level it's much easier to understand the code than reading raw assembly.

[+] unknown|11 years ago|reply

[deleted]

[+] m00dy|11 years ago|reply

Why only windows ? i couldn't get it.

[+] schoen|11 years ago|reply

There's an enormous community of people who spend all their time worrying about the contents or behavior of Windows binaries. I've met some of them through my work, like malware analysts who deal with malware that's part of phishing attacks. The phishers will often prefer to create Windows-only attacks because Windows has such a commanding market share lead among most populations of phishing targets; in turn, that's what the people trying to defend against or mitigate those attacks will study. To folks in that sort of field, "binary" is virtually synonymous with "Windows binary"!

I guess also historically most of the tools for creating, modifying, and examining binaries for a given platform have been native to that platform, rather than cross tools. That's surely because most people (with the exception of embedded developers) do much more native development than cross development. I can get a small number of packages on my Linux machine that will deal with Windows executables in some relatively shallow way, but I have tons of programs already installed that do complicated and specific things to Linux ELF binaries even though I don't typically use those programs on a day-to-day basis.

[+] astrange|11 years ago|reply

You can use Hopper on OS X: http://www.hopperapp.com

Or try and fix up Boomerang on other OSes, I suppose.

[+] cinch|11 years ago|reply

"The standalone decompiler runs fine under Wine."

[+] J_Darnley|11 years ago|reply

I'm kind of disappointed that there isn't a version available for IDA 5.0. Yeah, I'm cheap.

54 comments