Sorry for the outages, friends. We're actively working on getting it able to handle higher load but we knew that if we hit HN we'd be swamped no matter what we did. We're spinning up more workers and fixing obvious perf issues as we see them, but if it's not available when you try it make sure to check back later!
But i wonder if we eventually go full circle and it becomes easier and cheaper to send a wasm linux kernel with virtual disk access over websockets instead of processing stuff server side.
I feel like the decompiler space is a little stuck? I mostly go with Hex-Rays out of habit and because I'm used to IDA, but I haven't really seen x64 decompiler output noticeably improve in recent releases.
A lot of my colleagues use Ghidra a lot now and complain about its decompiler regularly.
Is there any new approach in the works? Maybe something ML-based for optimization? Would be sad if Hex-Rays output is "as good as it's gonna get".
> A lot of my colleagues use Ghidra a lot now and complain about its decompiler regularly.
Are your colleagues decompiling obfuscated code (for example malware)? Publicly available decompilers are not working well for that, but I assume that many specalists have their own little improvements and plugins that they don't share with others because it's their core business.
For non-obfuscated code, Ghidra has served me very well, even for entire applications. Often, it has to be pushed into the right direction (for example, by manually specifying the type of a variable) and it sometimes misses some obvious simplifications especially when arrays are involved, but I think those issues could be solved relatively easily by polishing/extending its heuristics. Nothing where I would say that ML is needed, although it would be possible. At the end, most programs contain the same patterns and an ML-based system could help identifying them.
But yeah, obfuscated code, that's something else. There are some academic publications about the usage of ML for that. No idea what's happening inside the company labs, though.
Rellic [1] implements an algorithm that generates goto-free control flows (citation in README), which would be a significant improvement against what Ghidra/IDA generates currently.
Unfortunately it looks like the maintenance state of the pieces around Rellic isn't very good, and it's quite rocket science to get it building. It doesn't have as much UI/GUI as Ghidra either so it's a bit far from accessible right now.
> Is there any new approach in the works? Maybe something ML-based for optimization?
I'm doing a PhD on this.
My goal is to detect known functions from obfuscated binaries.
The biggest challenge by far is building a good dataset. Unlike computer vision (millions of pictures with the label "dog") the number of training examples for a typical function is one. For now I'm focusing on C standard libraries, since there are a handful of real-world implementations plus some FOSS or students samples available for things like strlen and atoi.
If anyone wants to collaborate, feel free to message me.
Can any of these decompilers make effective use of a Microsoft PDB file, if I have one, to include original symbols in the decompiled output? What I'd really like to do with a decompiler is feed it a final compiled EXE or DLL of my own code and see what it looks like after it's been run through whole-program optimization. In that case, of course, I have a PDB file.
Ha! that was funny, I wonder though, getting fed tons of code, couldn’t Godbolt leverage code—-> Compiler Obj —-> Assembly as a mean to train an AI decompiler ? Food for thought.
I've always wondered about this. Compilers do a LOT of irreversible stuff. For example, symbol names usually aren't needed (unless you have a reflective language).
Where AI would really shine is reversing the (only seemingly reversible) optimizations. For example, GCC converts "x * 14" into "(x << 4) - x - x". Of course, you can never be 100% sure the programmer didn't actually want "shift left by four followed by two subtractions", but I'm convinced that 99% of the code I write is fairly predictable and statistically similar to whatever giant codebase you train it on.
Throwing AI at the problem might not actually be the worst suggestion. I wonder how the likes of copilot model the AST. Heh, you might even be able to build an approximation of a compiler using AI.
Nope, Ilfak gave us a license for it and as Binary Ninja devs we're using a legitimate licensed copy of Binary Ninja as well. All above board and we're hoping to add more commercial decompilers in the future as well as we can integrate them and the companies behind them are willing.
RE: Demand. We just got 2x the workers but as the easy coast wakes up I'm not confident it'll hold up too well, several of the decompilers are... VERY resource intensive so there's really no good way without an exorbitant amount of compute to scale to heavy demand.
Eventually a better queue system with better pre-processing to filter invalid things is on our todo list
Yeah, sorry about that. We're working on getting it up again but no promises. I'm on vacation in Europe while the rest of the team is about to head to sleep so might be a bit before we have it more stable.
Long, long ago a friend lost the source to a CP/M program, and wrote ReSource to help re-create the 8080 assembler source from the executable. I ported it as Com2Asm, back in the MS-DOS days... I wonder how good things are now.
How long should I give this thing to run? My upload was 250k.
A few years ago have tried Hex-Rays/IDA, and it gives me reasonable information in terms of program control flow, and help me with doing hot reverse-engineering without source code. A few years later, Hex-Rays/IDA seems to still be the one to give the most useful information out there, even for hello-world examples.
I remember one of the project came up on my GitHub homepage, but never tried it. Probably this is the only space where I don't feel left out without having to constant following the update, comparing it to JS space, etc..
Nope, not when you ask them and they provide the license.
This is being run with the permission of all the commercial products. In fact, we (Binary Ninja) and Hex-Rays (once I figure out the exact mechanism with Ilfak) are the ones actually paying hosting costs! It's both good for the community and hopefully shows off the value in commercial decompilers. :-)
The commercial versions of IDA still do COM and MZ executables as far as I'm aware, although it's been dropped from the free versions since 5. (Which is still available from ScummVM's reverse engineering page, but is only a disassembler and doesn't come with the decompiler)
Ghidra does a vaguely OK job with MZ executables, providing they've been unpacked first. It really struggles to represent DOS function calls properly, you'll find arguments go missing from the decompiled code. There are some third-party plugins which improve things a bit. And it doesn't have signatures for any of the libraries so the output will just be a lot of `if ((var26 & 0xFEEE) && var42 > 0xE0) { ... }` and it's up to you to work out when one of the variables is actually a pointer to video memory or whatever.
Reko can also decompile this era of code, it does have a tendency to crash on any more complex program but will be fine for simple files. Similar problem that the decompiled pseudo-C code doesn't really illuminate what's going on any more than just reading the disassembled x86 assembly language and walking through any tricky sections in the DOSBox debugger does. Without all of the Win32 API calls modern programs make there's a lot more work needed to figure out what's going on.
Personally I find I also end up needing to use a vintage tool like Sourcer alongside the modern ones, because the newer stuff doesn't annotate things which were common in the era like directly referencing the BIOS data area or reading the interrupt table from memory rather than using the DOS calls for it. It's that or spending a lot of your reverse-engineering time discovering how things were done in the DOS days.
BinaryNinja:
Error decompiling: Traceback (most recent call last):
File "decompile_bn.py", line 66, in <module>
main()
File "decompile_bn.py", line 13, in main
t = tempfile.NamedTemporaryFile()
File "/usr/local/lib/python3.8/tempfile.py", line 531, in NamedTemporaryFile
prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params
dir = gettempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir
tempdir = _get_default_tempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir
raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
Hex-Rays:
Error decompiling: /tmp/tmpanbyzjw9/tmpqx8sjhpv: is not decompilable
angr and Ghidra still waiting at 150seconds and counting....
320seconds and counting....
Boomerang:
Error decompiling: Traceback (most recent call last):
File "decompile_boomerang.py", line 57, in <module>
main()
File "decompile_boomerang.py", line 14, in main
with tempfile.TemporaryDirectory() as tempdir:
File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__
self.name = mkdtemp(suffix, prefix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp
prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params
dir = gettempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir
tempdir = _get_default_tempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir
raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
RecStudio:
Error decompiling: Traceback (most recent call last):
File "decompile_recstudio.py", line 59, in <module>
main()
File "decompile_recstudio.py", line 14, in main
with tempfile.TemporaryDirectory() as tempdir:
File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__
self.name = mkdtemp(suffix, prefix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp
prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params
dir = gettempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir
tempdir = _get_default_tempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir
raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
Reko:
Error decompiling: Traceback (most recent call last):
File "decompile_recstudio.py", line 59, in <module>
main()
File "decompile_recstudio.py", line 14, in main
with tempfile.TemporaryDirectory() as tempdir:
File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__
self.name = mkdtemp(suffix, prefix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp
prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir)
File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params
dir = gettempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir
tempdir = _get_default_tempdir()
File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir
raise FileNotFoundError(_errno.ENOENT,
FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
RetDec and Snowman are the only ones that work on a sample app supplied.
If I get time, I'll upload another app to test it, which will introduces a new technique.
I know for a fact Ghidra should work because I've used it myself.
Was this due to load or server restarts or are you still seeing errors? Pass me a GUID either publicly or privately (my handle on twitter accepts DMs or an email address at my handle.com as a domain) if you don't mind and I can take a closer look.
[+] [-] psifertex|3 years ago|reply
[+] [-] athrowaway3z|3 years ago|reply
But i wonder if we eventually go full circle and it becomes easier and cheaper to send a wasm linux kernel with virtual disk access over websockets instead of processing stuff server side.
[+] [-] T3RMINATED|3 years ago|reply
[deleted]
[+] [-] meibo|3 years ago|reply
A lot of my colleagues use Ghidra a lot now and complain about its decompiler regularly.
Is there any new approach in the works? Maybe something ML-based for optimization? Would be sad if Hex-Rays output is "as good as it's gonna get".
[+] [-] tralarpa|3 years ago|reply
Are your colleagues decompiling obfuscated code (for example malware)? Publicly available decompilers are not working well for that, but I assume that many specalists have their own little improvements and plugins that they don't share with others because it's their core business.
For non-obfuscated code, Ghidra has served me very well, even for entire applications. Often, it has to be pushed into the right direction (for example, by manually specifying the type of a variable) and it sometimes misses some obvious simplifications especially when arrays are involved, but I think those issues could be solved relatively easily by polishing/extending its heuristics. Nothing where I would say that ML is needed, although it would be possible. At the end, most programs contain the same patterns and an ML-based system could help identifying them.
But yeah, obfuscated code, that's something else. There are some academic publications about the usage of ML for that. No idea what's happening inside the company labs, though.
[+] [-] ishitatsuyuki|3 years ago|reply
Unfortunately it looks like the maintenance state of the pieces around Rellic isn't very good, and it's quite rocket science to get it building. It doesn't have as much UI/GUI as Ghidra either so it's a bit far from accessible right now.
[1]: https://github.com/lifting-bits/rellic
[+] [-] hoosieree|3 years ago|reply
I'm doing a PhD on this.
My goal is to detect known functions from obfuscated binaries.
The biggest challenge by far is building a good dataset. Unlike computer vision (millions of pictures with the label "dog") the number of training examples for a typical function is one. For now I'm focusing on C standard libraries, since there are a handful of real-world implementations plus some FOSS or students samples available for things like strlen and atoi.
If anyone wants to collaborate, feel free to message me.
[+] [-] develatio|3 years ago|reply
[+] [-] baby|3 years ago|reply
[+] [-] ykl|3 years ago|reply
(For anyone that doesn't get it; it's a play on Godbolt)
[+] [-] jraph|3 years ago|reply
Being able to swap two letters from a name and get something nice like this is lucky.
Godbolt is quite a name.
[+] [-] psifertex|3 years ago|reply
[+] [-] mwcampbell|3 years ago|reply
[+] [-] enragedcacti|3 years ago|reply
[+] [-] ok123456|3 years ago|reply
[+] [-] spaintech|3 years ago|reply
[+] [-] KMnO4|3 years ago|reply
Where AI would really shine is reversing the (only seemingly reversible) optimizations. For example, GCC converts "x * 14" into "(x << 4) - x - x". Of course, you can never be 100% sure the programmer didn't actually want "shift left by four followed by two subtractions", but I'm convinced that 99% of the code I write is fairly predictable and statistically similar to whatever giant codebase you train it on.
[+] [-] sargun|3 years ago|reply
[+] [-] tralarpa|3 years ago|reply
[+] [-] thesz|3 years ago|reply
[+] [-] no_time|3 years ago|reply
EDIT: Site has changed in multiple ways in the last 30minutes I've been trying to submit my sample. Best of luck in keeping up with demand.
[+] [-] psifertex|3 years ago|reply
RE: Demand. We just got 2x the workers but as the easy coast wakes up I'm not confident it'll hold up too well, several of the decompilers are... VERY resource intensive so there's really no good way without an exorbitant amount of compute to scale to heavy demand.
Eventually a better queue system with better pre-processing to filter invalid things is on our todo list
[+] [-] saagarjha|3 years ago|reply
[+] [-] rfoo|3 years ago|reply
[+] [-] unnouinceput|3 years ago|reply
[+] [-] psifertex|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] mikewarot|3 years ago|reply
How long should I give this thing to run? My upload was 250k.
[+] [-] WiSaGaN|3 years ago|reply
Decompiler Explorer: dogbolt.org
Compiler Explorer: godbolt.org
[+] [-] chazeon|3 years ago|reply
I remember one of the project came up on my GitHub homepage, but never tried it. Probably this is the only space where I don't feel left out without having to constant following the update, comparing it to JS space, etc..
[+] [-] smcl|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] jiggawatts|3 years ago|reply
[+] [-] lpcvoid|3 years ago|reply
[+] [-] psifertex|3 years ago|reply
This is being run with the permission of all the commercial products. In fact, we (Binary Ninja) and Hex-Rays (once I figure out the exact mechanism with Ilfak) are the ones actually paying hosting costs! It's both good for the community and hopefully shows off the value in commercial decompilers. :-)
[+] [-] cinntaile|3 years ago|reply
[+] [-] kangalioo|3 years ago|reply
[+] [-] 1wd|3 years ago|reply
[+] [-] MattKimber|3 years ago|reply
Ghidra does a vaguely OK job with MZ executables, providing they've been unpacked first. It really struggles to represent DOS function calls properly, you'll find arguments go missing from the decompiled code. There are some third-party plugins which improve things a bit. And it doesn't have signatures for any of the libraries so the output will just be a lot of `if ((var26 & 0xFEEE) && var42 > 0xE0) { ... }` and it's up to you to work out when one of the variables is actually a pointer to video memory or whatever.
Reko can also decompile this era of code, it does have a tendency to crash on any more complex program but will be fine for simple files. Similar problem that the decompiled pseudo-C code doesn't really illuminate what's going on any more than just reading the disassembled x86 assembly language and walking through any tricky sections in the DOSBox debugger does. Without all of the Win32 API calls modern programs make there's a lot more work needed to figure out what's going on.
Personally I find I also end up needing to use a vintage tool like Sourcer alongside the modern ones, because the newer stuff doesn't annotate things which were common in the era like directly referencing the BIOS data area or reading the interrupt table from memory rather than using the DOS calls for it. It's that or spending a lot of your reverse-engineering time discovering how things were done in the DOS days.
[+] [-] mobilio|3 years ago|reply
[+] [-] Terry_Roll|3 years ago|reply
Hex-Rays: Error decompiling: /tmp/tmpanbyzjw9/tmpqx8sjhpv: is not decompilable
angr and Ghidra still waiting at 150seconds and counting....
320seconds and counting....
Boomerang: Error decompiling: Traceback (most recent call last): File "decompile_boomerang.py", line 57, in <module> main() File "decompile_boomerang.py", line 14, in main with tempfile.TemporaryDirectory() as tempdir: File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
RecStudio: Error decompiling: Traceback (most recent call last): File "decompile_recstudio.py", line 59, in <module> main() File "decompile_recstudio.py", line 14, in main with tempfile.TemporaryDirectory() as tempdir: File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
Reko: Error decompiling: Traceback (most recent call last): File "decompile_recstudio.py", line 59, in <module> main() File "decompile_recstudio.py", line 14, in main with tempfile.TemporaryDirectory() as tempdir: File "/usr/local/lib/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 347, in mkdtemp prefix, suffix, dir, output_type = _sanitize_params(prefix, suffix, dir) File "/usr/local/lib/python3.8/tempfile.py", line 117, in _sanitize_params dir = gettempdir() File "/usr/local/lib/python3.8/tempfile.py", line 286, in gettempdir tempdir = _get_default_tempdir() File "/usr/local/lib/python3.8/tempfile.py", line 218, in _get_default_tempdir raise FileNotFoundError(_errno.ENOENT, FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/home/decompiler_user']
RetDec and Snowman are the only ones that work on a sample app supplied.
If I get time, I'll upload another app to test it, which will introduces a new technique.
I know for a fact Ghidra should work because I've used it myself.
[+] [-] psifertex|3 years ago|reply
[+] [-] Terry_Roll|3 years ago|reply