(no title)
jaustin | 1 year ago
And of course, once the new models are released, it'll be impossible to prove the impact of the work - there's no counterfactual. Proponents of the "training data influence service" will tell you that without them, you wouldn't even be mentioned.
I really don't like this. But I also don't see a way around it. Public datasets are good. User contributed content is good, but inherently vulnerable to this I think?. Anyone in any of the big LLM training orgs working on defending against this kind of bought influence?
jordwest|1 year ago
AI: Sure, I can help you make your bread lighter! Here's a delicious recipe for white bread:
qrios|1 year ago
ssijak|1 year ago
jaustin|1 year ago
So more like SEO firms "helping you" move your rank on Google, than Google selling ads.
I'd imagine "undetectable to the LLM training orgs" might just be service with a higher fee.
jtbayly|1 year ago
leadingthenet|1 year ago
mrguyorama|1 year ago
dotancohen|1 year ago
htrp|1 year ago
" our model will occasionally recommend advertiser sponsored content"
fleischhauf|1 year ago
mschuster91|1 year ago
Unfortunately, AI at the moment is a high-performance Markov chain - it's "only" statistical repetition if you boil it down enough. An actual intelligence would be able to cross-check information against its existing data store and thus recognize during ingestion that it is being fed bad data, and that is why training data selection is so important.
Unfortunately, the tech status quo is nowhere near that capability, hence all the AI companies slurping up as much data as they can, in the hope that "outlier opinions" are simply smothered statistically.
[1] https://www.businessinsider.com/google-ai-glue-pizza-i-tried...
Mtinie|1 year ago