top | item 42573232

Ask HN: How are you using LLMs for traversing decompiler output?

105 points| mjbale116 | 1 year ago

I need to reverse a binary made years ago, and I have zero experience with cpp, so I think it would be a good experiment to get an LLM to help me in any way

45 comments

order

carom|1 year ago

Binary Ninja has an AI integration called side kick, it has a free trial but I'm not sure it can be used in the free web version. [1]

In my experience, the off the shelf LLMs (e.g. ChatGPT) do a pretty poor job with assembly, they can not reason about the stack or stack frames well.

I think your job will be the same with or without AI. Figuring out the data structures and data types a function is operating on and naming variables.

What are you reverse engineering for? For example, getting a full compilable decompilation has different goals than finding vulnerabilities or patching a bug.

1. https://sidekick.binary.ninja/

aidanhs|1 year ago

Out of curiosity, what would you say the current state of the art is for full compilable decompilation? This is something I have a vague interest in but I'm not involved enough in the space to be on top of the latest and greatest tooling.

th0ma5|1 year ago

This is what I gather from reverse engineering material I've read and groups I've been around. Hidden state, hidden data structures, hidden automations all abound, and there simply isn't enough detail in the assembler itself to bridge the hardware's internal conceptualization and processes.

JosephRedfern|1 year ago

These guys are building foundational models for this purpose: https://reveng.ai/. The results are quite compelling, and they have plugins for your favourite reverse engineering tools.

netsec_burn|1 year ago

I made a site to use LLMs to help me with reverse engineering. The output is surprisingly readable, even with C++ classes. Let me know any feedback you might have: https://decompiler.zeroday.engineering/

readyplayernull|1 year ago

This is great! With Ghidra I had to look for the corresponding libs of a very specific RiscV vendor, your SRE did it by itself. You should have your own HN thread in front page!

btown|1 year ago

What kind of file should be uploaded?

__alexander|1 year ago

Do you have experience reverse engineering? If not, LLMs are not going to help much. LLMs are useful for aiding the analysis but they don’t do the analysis.

uncomplexity_|1 year ago

Yea this one. If you have solid fundamentals these LLMs are really handy in assisting and never leading.

For example I have a minified javascript file, way obfuscated. I can paste the code and make it break down the initial structure. And then I tell it which parts to focus on and which parts to dig in deeper.

lumb63|1 year ago

It has nothing to do with LLMs, but Ghidra is a wonderful tool.

Dwedit|1 year ago

Have you tried Ghidra yet? If you still have your debug symbols, then it can do a really good job.

flashgordon|1 year ago

Interesting. Wouldn't this actually be a deterministic problem based on graph analysis. Id have thought LLMs would have been more effective taking the out out some graph recognizer and then identifying what those higher level constructs map to?

warkdarrior|1 year ago

Deterministic maybe, but surely undecidable in the general case since you need whole program analysis to understand, for example, the purpose of a memory location. ML may help approximate this undecidable problem.

mahaloz|1 year ago

I like using it for library function comments, variable name recovery, and sometimes types. The comments are usually hit or miss, but I find the variable names to be a bit better than auto-generated ones. I implement most of this in my decompiler plugin: https://github.com/mahaloz/DAILA; check it out if you are interested :).

stackghost|1 year ago

The Advent of Cyber side quest this year needed some Ghidra and I found Pickman's Model was pretty good at helping me craft a heap exploit from a decompilation.

userbinator|1 year ago

Unfortunately LLMs are not good at precision and details, which is exactly what you need for the sort of analysis you're trying to do.

menaerus|1 year ago

Right. Have a look at the paper above from Meta on how they fine-tuned the Code Llama with LLVM IR to beat the compiler in producing size-optimized binaries.

apatheticonion|1 year ago

Inspired by the work out there that reverse engineers game engines, I've always wanted to try my hand at reverse engineering to contribute to the world of game preservation.

Is it actually legal to decompile a game engine from executables/dll files, write new sources by making sense of the output and rewriting it such that it can be compiled targeting modern APIs?

I feel like that must be illegal

feznyng|1 year ago

You could use the LLM to help you write utility scripts for whatever disassembler you’re using e.g. python for IDA. That might work better than feeding it raw assembly.

Game RE communities also have all sorts of neat utilities for decompiling large cpp binaries. Skyrim’s community is pretty active with ghidra/ida.

Guessing you’re not lucky enough to have a PDB?

sitkack|1 year ago

Do you know the compiler and what the source possibly looks like? I found LLMs are pretty good at recovering code from binaries, they need help though.

If you are able to run the program and collect traces, that will help a ton.

svilen_dobrev|1 year ago

cpp? that's a preprocessor. u mean c++?

LLM won't help you much if u can't understand what it's talking about.

Manual way is, given ELF (linux executable format) somexe,

$ strings somexe

$ objdump -d somexe

$ objdump -s -j .ro data somexe

then look+ponder over the results.

and/or running ghidra (as mouse'd UI) over it.. which may help somewhat but not 100%

Have in mind, that objdump and ghidra have opposite ways of showing assembly transfer/multi-operand instructions - one has mov dest,target , other has mov target,dest - for same code.

no idea on (recent) windoze front. IDA ?

u53rn4m3|1 year ago

RevEng.AI have their own foundational AI models for decompilation with English language summaries.

seba_dos1|1 year ago

Good luck. If that's how you're approaching it, you're going to need it.

2-3-7-43-1807|1 year ago

op apparently never even heard about reddit

ianhawes|1 year ago

Highly recommend it. I reversed an app with o1 Pro Mode and the analysis of the obfuscated C# code matched up accurately with what I eventually discovered by manually reversing.

chc4|1 year ago

Reverse engineering C# is extremely different from C++ binaries.