top | item 42494882

(no title)

jaybna | 1 year ago

Yeah, probably right. If you want a great rabbit hole, look up "Common Crawl" and see how a great academic project was absolutely hijacked for pennies on the dollar to grab training data - the foundation for every LLM out there right now.

discuss

order

CamperBob2|1 year ago

It's hard to envision a greater success for the "great academic project" than what happened. I mean, what else were they trying to accomplish?

jaybna|1 year ago

It was meant to be an open-source compilation of the crawled internet so that research could be done on web search given how opaque Google's process is. It was NOT meant to be a cheap source of data for for-profit LLM's to train on.

*edit: added "for-profit"