top | item 39734317

(no title)

This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code

discuss

bt1a|1 year ago

I would also think that generating a very large amount of C code using coding LLMs (using deepseek, for example, + verifying that the output compiles) as synthetic training data would be quite beneficial in this situation. Generally the quality of synthetic training data is one of the main concerns, but in this case, the ability for the code to compile is the crux.

Zambyte|1 year ago

I would think that the primary benefit of this over existing decompiler tools would be the ability to use sensible names for identifiers, break up a project to be a sensible set of modules, and maybe even add realistic / helpful comments. If you're synthesizing code to do that, you'll probably gain on the front of generating code that compiles, at the cost of these advantages.