top | item 38980078

(no title)

dobin | 2 years ago

I once had the idea to do malware-similarity analysis. The X86 should first be lifted into a IL, so it gets "normalized" (e.g. register independant). The problem with all lifters is though that even a trivial "add rax, 1" generated a lot of IL code (probably 50-100 lines in LLVM IL), as the lifter had to implement all side effects of the X86 instructions in a fake memory space (i used remill if i remember correctly).

Does this lifter have a similar implementation, or will a "add rax, 1" be lifted to something like "register1 += 1"?

discuss

order

aengelke|2 years ago

> The problem with all lifters is though that even a trivial "add rax, 1" generated a lot of IL code (probably 50-100 lines in LLVM IL)

Why is this a problem? The addition is one LLVM-IR instruction (add), followed by flag computation (maybe 10-20 instrs). Dead code elimination will afterwards quickly remove unused instructions (e.g., unused flags).

> register1 += 1

I don't see how this could be beneficial, especially on x86 where you can have "mov rax, rdx; add rax, 1" and "lea rax, [rdx + 1]", which do mostly the same (the former clobbers flags). SSA removes registers and shows the semantic operations clearly.

aleclm|2 years ago

I had some ideas about binary diffing, but it's a difficult topic and I'm too much of a noob in ML to get to something working in a decent time frame.

I think something ABI-, compiler- and architecture-agnostic would be super cool and I started to build a training data set.

I wouldn't diff individual instructions though, I'd go for something more highlevel, such as features of the CFG and type of operations in the nodes.