top | item 42291313

(no title)

MurizS | 1 year ago

I think GP was probably referring to "Scaling Data-Constrained Language Models" (2305.16264) from NeurIPS 2023, which looked first at how to optimally scale LLMs when training data is limited. There is a short section on mixing code (Python) into the training data and the effect this has on performance on e.g. natural language tasks. One of their findings was that training data can be up to 50% code without actually degrading performance, and in some cases (benchmarks like bAbI and WebNLG) with improvements (probably because these tasks have an emphasis on what they call "long-range state tracking capabilities").

For reference: In the Llama 3 technical report (2407.21783), they mention that they ended up using 17% code tokens in their training data.

discuss

eru|1 year ago

Is the network only trained on the source code, or does it have access to the results of running the code, too?

YetAnotherNick|1 year ago

Also GPT-3.5 was another extreme if I remember correctly. They first trained only on code then they trained on other text. I can't seem to find the source though.