Huge proportion of internet is AI-generated slime, researchers find

[+] chacham15|2 years ago|reply

The title is clickbait. The actual study[0] reads as follows:

> Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages

This is talking specifically about text being translated into less common languages being done mostly by machine translation and that in those languages most of the content found was as a result of machine translation. This has nothing to do with English or more common languages overall.

[0] https://arxiv.org/pdf/2401.05749.pdf

[+] hhs|2 years ago|reply

To be fair, the author does link to that study in the first paragraph of this piece, and then adds some context about languages near the end:

“But while the English-language web is experiencing a steady — if palpable — AI creep, this new study suggests that the issue is far more pressing for many non-English speakers.

What's worse, the prevalence of AI-spun gibberish might make effectively training AI models in lower-resource languages nearly impossible in the long run. To train an advanced LLM, AI scientists need large amounts of high-quality data, which they generally get by scraping the web. If a given area of the internet is already overrun by nonsensical AI translations, the possibility of training advanced models in rarer languages could be stunted before it even starts.”

[+] figassis|2 years ago|reply

So it is AI generated slime for people in those languages.

[+] mschuster91|2 years ago|reply

No surprise there... the worst problem for me is that translators go away as a business as the job opportunities vanish, and with them, the ability to translate stuff that requires cultural context to properly translate (jokes, puns, word plays).

We will be literally dumbing down as a species as a result.

[+] idiliv|2 years ago|reply

Hmm, are you sure that translations of LLMs like ChatGPT are not incorporating cultural context?

[+] teaearlgraycold|2 years ago|reply

Bi-lingual people aren’t going away

[+] mattgreenrocks|2 years ago|reply

I cannot help but grin as I read this. It’s as if AI will eat itself.

I have to bet this represents a severe existential threat to companies who don’t already have a war chest of decent training data.

[+] 2OEH8eoCRo0|2 years ago|reply

Humans eat themselves. Humans read and watch humans and are still able to learn and produce.

I think a bigger threat is siloing information behind logins or apps. Less information is out there free to be used and learned from and trained with.

[+] r0ckarong|2 years ago|reply

At some point we're going to just produce all this electricity so some AI can troll itself on the blockchain to decide who is going to trigger the nukes.

[+] feverzsj|2 years ago|reply

Soon, ad blockers will only need to supply whitelists.

[+] kevin_thibedeau|2 years ago|reply

I got halfway through an AI article the other day before it veered into a completely different random topic and revealed itself. Kagi should add a way to flag such sites. Implement a reputation system to minimize gaming from adversaries while still taking in revenue from them.

[+] calamari4065|2 years ago|reply

When I see those, I just block the domain from my results. Sites that publish these unchecked AI pieces usually don't have anything else worthy of my time

[+] Ensorceled|2 years ago|reply

Do we have a backup of the pre 2023 internet (way back machine, wikipedia, etc.)

[+] runeofdoom|2 years ago|reply

Good question. Seeing this certainly makes me think I need to make politely scraping a few valued sites (mostly forums) for backup as personal reference a higher priority.

[+] orbital-decay|2 years ago|reply

The study they refer to [1] seems to be about machine-translated versions of human-written pages (that exist on a large scale for a ~decade already). The article somehow blows it out of proportion, it's not like most of what you're reading is generated by the current crop of large transformers.

[1] https://arxiv.org/pdf/2401.05749.pdf

[+] alwa|2 years ago|reply

Not my domain, and I know I’m amongst experts. But at the risk of stating the obvious: It feels like the claim here (and elsewhere) is that we’re near the breaking point of the incentive model that’s propelled knowledge out of human minds and onto the web-as-we-know-it in coherent, discoverable, standardized, useful form.

I’ve always been a little old-fashioned, in that I prefer to trust specific bodies of writing and specific humans for knowledge about specific topics, even if that keeps me slow and behind the zeitgeist.

But in any case, now that webpages-for-traffic well seems on the verge of being too polluted to drink from.

What’s the next paradigm? Walled gardens of proven-provenance content for our AI summarizers to wade through? AI-vs-AI arms race? Or does the web become more about underlying facts and structured data, and meaning and insight become less commoditized and more person-to-person again?

I mean are any of these tensions really new, or is this just a Google problem?

[+] SimianLogic|2 years ago|reply

I suspect “AI-generated slime” is on average higher quality than what most content mills have been pumping out without AI for the last 10 years (plus).

[+] runeofdoom|2 years ago|reply

What is better or worse though - a million 2/10 sites or a hundred million 3/10 sites?

[+] nih0|2 years ago|reply

so its self destructing

[+] jug|2 years ago|reply

Note: The premise of this article is that they call machine translations "AI generated slime".

It is NOT about ChatGPT generated articles.

It's ironic to me, because the ONE thing I think modern AI has no doubt improved is machine translations. DeepL is often miles ahead of what we had, and while LLM's are not trustworthy scientific experts in all fields, if they are anything almost by definition (as LANGUAGE models), it is that they are linguistic experts. Iceland is famously using GPT-4 for language preservation because it's as good at Icelandic as an expert and native speaker.

So please for the love of god let's abandon the former generation of machine translation and let's welcome AI translations with open arms for improved accessibility and cross-culture reach. And let's stop looking down at AI translations just because you see red as you read the word "AI".

Not sure I need to add that I find this article complete junk. As usual with Futurism.com content.

[+] unknown|2 years ago|reply

[deleted]

[+] CottonMcKnight|2 years ago|reply

enshittification intensifies

American Dialect Society nailed it.

[+] peter_d_sherman|2 years ago|reply

99.99% of the Internet -- is crap -- AI generated or not.

But this is also true about books, movies, music, products, corporations, etc., etc.

Everything really...

The thing is though, that the other 0.01% of the Internet (and the other 0.01% of everything else) -- are the proverbial "diamonds in the rough" -- the things that have great value...

But you gotta search to find them...

You know, "seek and ye shall find", "leave no stone unturned", etc., etc.

Ironically, Google's search engine, whose main rise to fame was caused by too little information on the Internet -- is now completely overwhelmed with too much crappy/spammy/subprime/agenda-based/advertising/biased/subpar/TL;DR/unnecessary information.

In other words, we've went from "too thin" to "too fat", from not enough information to too much information...

Google Search Engine's primary virtue -- its ability to find things on the Internet -- has now become its Achilles' Heel, information-wise...

(And I say that as a great fan of Google! At least at this point in time, 2024, i.e., from 1999-2024 -- the first 25 years of the company!)

Perhaps I'm guessing (as opposed to knowing), but I would say that the rise of ChatGPT (and other AI LLM online chat systems) was at least caused in part by too much information.

Think of ChatGPT -- not as a futuristic scary AI (although it could certainly become that too!) -- but as a more human-friendly filter (and that's the keyword, "filter") of information -- than the Google Search Engine is or ever was...

And that's what we need right now more than anything else -- intelligent filters to block out too much information...

The upcoming problem is (or will be!) -- since censorship and intelligent information filtering are twin siblings and run very similar parallel paths -- how do we distinguish one from the other?

How do we permit one, but not the other?

How do we permit intelligent information filtering, but not permit censorship?

You see, there's a very fine line between the two!

A very fine line!

What are we going to do, have a programmer code for what that line is? Have a 3rd party AI determine that exact line? Have a/the government(s) decide what that is?

?

It is or will be a future problem with no easy answer no apparent solution -- and it is starting to form as of this current day!

Perhaps the original unfiltered Google Search of 20+ years ago -- is not looking all that bad in comparison! :-)

[+] runeofdoom|2 years ago|reply

George Dyson has what I found to be an enlightening metaphor for how our relationship with information has changed. It is repeated here: https://edu.blogs.com/edublogs/2010/01/george-dyson-media-li...

29 comments