(no title)
boomanaiden154 | 1 year ago
Additionally, they train approximately half on assembly and half on LLVM-IR. They don't talk much about how they generate the dataset other than that they generated it from the CodeLlama dataset, but I would guess they compile as much code as they can into LLVM-IR and then just lower that into assembly, leaving gcc out of the loop completely for the vast majority of the compiler specific training.
hughleat|1 year ago
boomanaiden154|1 year ago
It seems like somehow build systems were invoked given the different targets present in the final version?
Was it mostly C/C++ (if so, how did you resolve missing includes/build flags), or something else?