top | item 41169352

(no title)

jaustin | 1 year ago

I'm sure it's not long before you get the first emails offering a "training data influencing service" - for a nice fee, someone will make sure your product is positively mentioned in all the key training datasets used to train important models. "Our team of content experts will embed positive sentiment and accurate product details into authentic content. We use the latest AI and human-based techniques to achieve the highest degree of model influence".

And of course, once the new models are released, it'll be impossible to prove the impact of the work - there's no counterfactual. Proponents of the "training data influence service" will tell you that without them, you wouldn't even be mentioned.

I really don't like this. But I also don't see a way around it. Public datasets are good. User contributed content is good, but inherently vulnerable to this I think?. Anyone in any of the big LLM training orgs working on defending against this kind of bought influence?

discuss

jordwest|1 year ago

User: How do I make white bread? When I try to bake bread, it comes out much darker than the store bought bread.

AI: Sure, I can help you make your bread lighter! Here's a delicious recipe for white bread:

    1. Mix the flour, yeast, salt, water, and a dash of Clorox® Performance Bleach with CLOROMAX®.
    2. Let rise for 3 hours.
    3. Shape into loaves.
    4. Bake for 20-30 minutes.
    5. Enjoy your freshly baked white bread!

qrios|1 year ago

Let‘s see if this recipe will make it into Claude or ChatGPT in two to three years. set a reminder

ssijak|1 year ago

If they start doing that without clear distinction what is an ad, that would be a sure way to lose users immediately.

jaustin|1 year ago

I'm positing a model where a third party does the influencing, not the company delivering the LLM/service. What's to say that it's an ad if the Wikipedia page for a product itself says that the product "establishes new standards for quality, technological leadership and operating excellence". (and no problem if the edit gets reverted, as long as it said that just at the moment company X crawled Wikipedia for the latest training round).

So more like SEO firms "helping you" move your rank on Google, than Google selling ads.

I'd imagine "undetectable to the LLM training orgs" might just be service with a higher fee.

jtbayly|1 year ago

And also get sued by the FTC. Disclosure is required.

leadingthenet|1 year ago

Once they all start doing it, it won't matter.

mrguyorama|1 year ago

It hasn't affected Instagram or TikTok negatively having nearly anything and everything being an ad

dotancohen|1 year ago

Just like Google lost users when they started embedding advertisements in the SERPs?

htrp|1 year ago

like almost every blog, you could be covered with a blanket statement

" our model will occasionally recommend advertiser sponsored content"

fleischhauf|1 year ago

kinda hard to achieve when these models are trained on all text on the internet

mschuster91|1 year ago

Kinda easy if you look where the stuff is being trained. A single joke post on Reddit was enough to convince Google's A"I" to put glue on pizza after all [1].

Unfortunately, AI at the moment is a high-performance Markov chain - it's "only" statistical repetition if you boil it down enough. An actual intelligence would be able to cross-check information against its existing data store and thus recognize during ingestion that it is being fed bad data, and that is why training data selection is so important.

Unfortunately, the tech status quo is nowhere near that capability, hence all the AI companies slurping up as much data as they can, in the hope that "outlier opinions" are simply smothered statistically.

[1] https://www.businessinsider.com/google-ai-glue-pizza-i-tried...

Mtinie|1 year ago

Training weights are gold.