top | item 46528495

(no title)

afspear | 1 month ago

Maybe we should find other datasets not generated by humans to train LLMs?

discuss

Sadly, we have n=1 for intelligence and that's humans. The "second best" of intelligence is already LLMs. And it's hard to expect imitation learning on data that wasn't produced by anything intelligent to yield intelligence - although there are some curious finds.

Even for human behavior: we don't have that much data. The current datasets don't capture all of human behavior - only the facets of it that can be glimpsed from text, or from video. And video is notoriously hard to use well in LLM training pipelines.

That LLMs can learn so much from so little is quite impressive in itself. Text being this powerful was, at its time, an extremely counterintuitive finding.

Although some of the power of modern LLMs already comes from nonhuman sources. RLVR and RLAIF are major parts of training recipes for frontier labs.

threethirtytwo|1 month ago

The datasets going into LLMs have to have an element of human-ness to it.

For example I can’t just feed it weather data from the past decade and expect it to understand weather. It needs input and output pairs with the output being human language. So you can feed it weather data but it has to be paired with human description of said data. So if we give it data of a rain storm there has to be an english description paired with it saying it’s a rainstorm.