top | item 31119138

Snowman native code to C/C++ decompiler for x86/x86_64/ARM

83 points| pabs3 | 3 years ago |github.com | reply

37 comments

order
[+] kubb|3 years ago|reply
How about this for an idea: a decompiler that uses Machine Learning to name the decompiled variables and functions. Would be nice even if it worked only sometimes.
[+] kuroguro|3 years ago|reply
For an AI-less solution there is IDA's Lumina which works pretty well. There's also a reverse engineered server for it [0] so you could build plugins for other disasemblers/decompilers to use with non-official servers.

It basically hashes machine code (with address parts removed) [1], then when reverse engineers label and push symbols to the server (or get them from some debug build), others can pull and see what the functions are called in completely unrelated projects, that use the same libraries / have the same functions.

[0] https://abda.nl/lumen/ [1] https://github.com/naim94a/lumen/issues/2

[+] Andoryuuta|3 years ago|reply
I'm surprised nobody has mentioned DIRE[0] yet. They did exactly this and got some very impressive results.

[0]: https://arxiv.org/abs/1909.09029 / J. Lacomis et al., "DIRE: A Neural Approach to Decompiled Identifier Naming," 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 628-639, doi: 10.1109/ASE.2019.00064.

[1]: https://github.com/pcyin/dire

[+] glouwbug|3 years ago|reply
It's certainly possible - Compile all the C projects on github with `gcc -O0`. Map statements, blocks, or functions to ASM output. Put everything in a giant SQL. Repeat for all of gcc's compiler flags.

Wait, did I say it was possible? I'm curious what a neural netted compiler would produce. Probably your average CRUD software.

[+] planede|3 years ago|reply
It's also something that could be somewhat easy to get a lot of learning material for.
[+] blueflow|3 years ago|reply
If we would write some code in rust, compile it to, for example, x86_64. And then de-compile it to C. It would be perfectly memory-safe C code, right?
[+] kalnins|3 years ago|reply
As Rust is llvm based, you don't need to compile it to C. Just write a backend that translates LLVM IR to C instead of x86_64. The IR is very C looking, it's probably overly complex.

Compiling down to asm, lot of information is lost regarding memory layout etc, so it's not the best source for generating code.

[+] pjc50|3 years ago|reply
Yes, but it wouldn't necessarily be readable. And it definitely wouldn't be portable!
[+] vnorilo|3 years ago|reply
Some architectural/program assumptions may not be encoded in assembly or preserved in the asm -> C -> asm roundtrip, especially if the assemblers are for different architectures. The obvious example is pointer word size and memory model.
[+] thargor90|3 years ago|reply
I would guess that writing a decompiler that gets c semantics 100% correct is also really hard. Think about inline assembly, memory barriers, etc.

I don't think this would work, unless the c file just contains inline assembly.

[+] astrange|3 years ago|reply
It wouldn't look much like what you'd expect C to look like. If it did decompile back to idiomatic C it could introduce some kind of aliasing bug along the way that'd make it not so memory safe.
[+] shin_lao|3 years ago|reply
Nothing is perfectly memory-safe. Also, not sure I would see the point of this translation?
[+] xphos|3 years ago|reply
Critic for the author have picture of you decompile so people can feel it out and judge the quality. I wanted to check it out but I was on my phone so I can quite build and run the repo
[+] richardfey|3 years ago|reply
A few months ago I was looking for a 16bit equivalent. Unfortunately I couldn't find a snowman for Win3.1!
[+] marcelluscat|3 years ago|reply
as someone who's used snowman for ctfs, here's my recommendation for it.