top | item 41229115

(no title)

My opinion is pretty logical: If you train stuff on the entire web, the first training set will be the only set of data that doesn’t include model generated data and thus will be the most realistic about “what is human seeming”. Now the web is full of generated content. That will tend to bias the model over time if you continue to train from the web. There really was only ever one chance to do the web training thing and now it’s over and done. We will have to go back to carefully curated training sets or come up with a truly failsafe way to detect and not ingest model generated content from the web, or you’re basically eating your own feces which will cause model feedback and hysteresis, leading to bias. This is a very big picture view, but it does seem that the “great leap” 2020-2023 was because we got to do this one time ingestion of a wide amount of clean data, and now it’s going to go back to training quality to get better results.

discuss

No comments yet.