This is an excellent use case for LLM fine-tuning, purely because of the ease of generating a massive dataset of input / output pairs from public C code
I would also think that generating a very large amount of C code using coding LLMs (using deepseek, for example, + verifying that the output compiles) as synthetic training data would be quite beneficial in this situation. Generally the quality of synthetic training data is one of the main concerns, but in this case, the ability for the code to compile is the crux.
I would think that the primary benefit of this over existing decompiler tools would be the ability to use sensible names for identifiers, break up a project to be a sensible set of modules, and maybe even add realistic / helpful comments. If you're synthesizing code to do that, you'll probably gain on the front of generating code that compiles, at the cost of these advantages.
bt1a|1 year ago
Zambyte|1 year ago