The title is clickbait. The actual study[0] reads as follows:
> Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages
This is talking specifically about text being translated into less common languages being done mostly by machine translation and that in those languages most of the content found was as a result of machine translation. This has nothing to do with English or more common languages overall.
To be fair, the author does link to that study in the first paragraph of this piece, and then adds some context about languages near the end:
“But while the English-language web is experiencing a steady — if palpable — AI creep, this new study suggests that the issue is far more pressing for many non-English speakers.
What's worse, the prevalence of AI-spun gibberish might make effectively training AI models in lower-resource languages nearly impossible in the long run. To train an advanced LLM, AI scientists need large amounts of high-quality data, which they generally get by scraping the web. If a given area of the internet is already overrun by nonsensical AI translations, the possibility of training advanced models in rarer languages could be stunted before it even starts.”
No surprise there... the worst problem for me is that translators go away as a business as the job opportunities vanish, and with them, the ability to translate stuff that requires cultural context to properly translate (jokes, puns, word plays).
We will be literally dumbing down as a species as a result.
At some point we're going to just produce all this electricity so some AI can troll itself on the blockchain to decide who is going to trigger the nukes.
I got halfway through an AI article the other day before it veered into a completely different random topic and revealed itself. Kagi should add a way to flag such sites. Implement a reputation system to minimize gaming from adversaries while still taking in revenue from them.
When I see those, I just block the domain from my results. Sites that publish these unchecked AI pieces usually don't have anything else worthy of my time
Good question. Seeing this certainly makes me think I need to make politely scraping a few valued sites (mostly forums) for backup as personal reference a higher priority.
The study they refer to [1] seems to be about machine-translated versions of human-written pages (that exist on a large scale for a ~decade already). The article somehow blows it out of proportion, it's not like most of what you're reading is generated by the current crop of large transformers.
Not my domain, and I know I’m amongst experts. But at the risk of stating the obvious: It feels like the claim here (and elsewhere) is that we’re near the breaking point of the incentive model that’s propelled knowledge out of human minds and onto the web-as-we-know-it in coherent, discoverable, standardized, useful form.
I’ve always been a little old-fashioned, in that I prefer to trust specific bodies of writing and specific humans for knowledge about specific topics, even if that keeps me slow and behind the zeitgeist.
But in any case, now that webpages-for-traffic well seems on the verge of being too polluted to drink from.
What’s the next paradigm? Walled gardens of proven-provenance content for our AI summarizers to wade through? AI-vs-AI arms race? Or does the web become more about underlying facts and structured data, and meaning and insight become less commoditized and more person-to-person again?
I mean are any of these tensions really new, or is this just a Google problem?
I suspect “AI-generated slime” is on average higher quality than what most content mills have been pumping out without AI for the last 10 years (plus).
Note: The premise of this article is that they call machine translations "AI generated slime".
It is NOT about ChatGPT generated articles.
It's ironic to me, because the ONE thing I think modern AI has no doubt improved is machine translations. DeepL is often miles ahead of what we had, and while LLM's are not trustworthy scientific experts in all fields, if they are anything almost by definition (as LANGUAGE models), it is that they are linguistic experts. Iceland is famously using GPT-4 for language preservation because it's as good at Icelandic as an expert and native speaker.
So please for the love of god let's abandon the former generation of machine translation and let's welcome AI translations with open arms for improved accessibility and cross-culture reach. And let's stop looking down at AI translations just because you see red as you read the word "AI".
Not sure I need to add that I find this article complete junk. As usual with Futurism.com content.
99.99% of the Internet -- is crap -- AI generated or not.
But this is also true about books, movies, music, products, corporations, etc., etc.
Everything really...
The thing is though, that the other 0.01% of the Internet (and the other 0.01% of everything else) -- are the proverbial "diamonds in the rough" -- the things that have great value...
But you gotta search to find them...
You know, "seek and ye shall find", "leave no stone unturned", etc., etc.
Ironically, Google's search engine, whose main rise to fame was caused by too little information on the Internet -- is now completely overwhelmed with too much crappy/spammy/subprime/agenda-based/advertising/biased/subpar/TL;DR/unnecessary information.
In other words, we've went from "too thin" to "too fat", from not enough information to too much information...
Google Search Engine's primary virtue -- its ability to find things on the Internet -- has now become its Achilles' Heel, information-wise...
(And I say that as a great fan of Google! At least at this point in time, 2024, i.e., from 1999-2024 -- the first 25 years of the company!)
Perhaps I'm guessing (as opposed to knowing), but I would say that the rise of ChatGPT (and other AI LLM online chat systems) was at least caused in part by too much information.
Think of ChatGPT -- not as a futuristic scary AI (although it could certainly become that too!) -- but as a more human-friendly filter (and that's the keyword, "filter") of information -- than the Google Search Engine is or ever was...
And that's what we need right now more than anything else -- intelligent filters to block out too much information...
The upcoming problem is (or will be!) -- since censorship and intelligent information filtering are twin siblings and run very similar parallel paths -- how do we distinguish one from the other?
How do we permit one, but not the other?
How do we permit intelligent information filtering, but not permit censorship?
You see, there's a very fine line between the two!
A very fine line!
What are we going to do, have a programmer code for what that line is? Have a 3rd party AI determine that exact line? Have a/the government(s) decide what that is?
?
It is or will be a future problem with no easy answerno apparent solution -- and it is starting to form as of this current day!
Perhaps the original unfiltered Google Search of 20+ years ago -- is not looking all that bad in comparison! :-)
[+] [-] chacham15|2 years ago|reply
> Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages
This is talking specifically about text being translated into less common languages being done mostly by machine translation and that in those languages most of the content found was as a result of machine translation. This has nothing to do with English or more common languages overall.
[0] https://arxiv.org/pdf/2401.05749.pdf
[+] [-] hhs|2 years ago|reply
“But while the English-language web is experiencing a steady — if palpable — AI creep, this new study suggests that the issue is far more pressing for many non-English speakers.
What's worse, the prevalence of AI-spun gibberish might make effectively training AI models in lower-resource languages nearly impossible in the long run. To train an advanced LLM, AI scientists need large amounts of high-quality data, which they generally get by scraping the web. If a given area of the internet is already overrun by nonsensical AI translations, the possibility of training advanced models in rarer languages could be stunted before it even starts.”
[+] [-] figassis|2 years ago|reply
[+] [-] mschuster91|2 years ago|reply
We will be literally dumbing down as a species as a result.
[+] [-] idiliv|2 years ago|reply
[+] [-] teaearlgraycold|2 years ago|reply
[+] [-] mattgreenrocks|2 years ago|reply
I have to bet this represents a severe existential threat to companies who don’t already have a war chest of decent training data.
[+] [-] 2OEH8eoCRo0|2 years ago|reply
I think a bigger threat is siloing information behind logins or apps. Less information is out there free to be used and learned from and trained with.
[+] [-] r0ckarong|2 years ago|reply
[+] [-] feverzsj|2 years ago|reply
[+] [-] kevin_thibedeau|2 years ago|reply
[+] [-] calamari4065|2 years ago|reply
[+] [-] Ensorceled|2 years ago|reply
[+] [-] runeofdoom|2 years ago|reply
[+] [-] orbital-decay|2 years ago|reply
[1] https://arxiv.org/pdf/2401.05749.pdf
[+] [-] alwa|2 years ago|reply
I’ve always been a little old-fashioned, in that I prefer to trust specific bodies of writing and specific humans for knowledge about specific topics, even if that keeps me slow and behind the zeitgeist.
But in any case, now that webpages-for-traffic well seems on the verge of being too polluted to drink from.
What’s the next paradigm? Walled gardens of proven-provenance content for our AI summarizers to wade through? AI-vs-AI arms race? Or does the web become more about underlying facts and structured data, and meaning and insight become less commoditized and more person-to-person again?
I mean are any of these tensions really new, or is this just a Google problem?
[+] [-] SimianLogic|2 years ago|reply
[+] [-] runeofdoom|2 years ago|reply
[+] [-] nih0|2 years ago|reply
[+] [-] jug|2 years ago|reply
It is NOT about ChatGPT generated articles.
It's ironic to me, because the ONE thing I think modern AI has no doubt improved is machine translations. DeepL is often miles ahead of what we had, and while LLM's are not trustworthy scientific experts in all fields, if they are anything almost by definition (as LANGUAGE models), it is that they are linguistic experts. Iceland is famously using GPT-4 for language preservation because it's as good at Icelandic as an expert and native speaker.
So please for the love of god let's abandon the former generation of machine translation and let's welcome AI translations with open arms for improved accessibility and cross-culture reach. And let's stop looking down at AI translations just because you see red as you read the word "AI".
Not sure I need to add that I find this article complete junk. As usual with Futurism.com content.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] CottonMcKnight|2 years ago|reply
American Dialect Society nailed it.
[+] [-] peter_d_sherman|2 years ago|reply
But this is also true about books, movies, music, products, corporations, etc., etc.
Everything really...
The thing is though, that the other 0.01% of the Internet (and the other 0.01% of everything else) -- are the proverbial "diamonds in the rough" -- the things that have great value...
But you gotta search to find them...
You know, "seek and ye shall find", "leave no stone unturned", etc., etc.
Ironically, Google's search engine, whose main rise to fame was caused by too little information on the Internet -- is now completely overwhelmed with too much crappy/spammy/subprime/agenda-based/advertising/biased/subpar/TL;DR/unnecessary information.
In other words, we've went from "too thin" to "too fat", from not enough information to too much information...
Google Search Engine's primary virtue -- its ability to find things on the Internet -- has now become its Achilles' Heel, information-wise...
(And I say that as a great fan of Google! At least at this point in time, 2024, i.e., from 1999-2024 -- the first 25 years of the company!)
Perhaps I'm guessing (as opposed to knowing), but I would say that the rise of ChatGPT (and other AI LLM online chat systems) was at least caused in part by too much information.
Think of ChatGPT -- not as a futuristic scary AI (although it could certainly become that too!) -- but as a more human-friendly filter (and that's the keyword, "filter") of information -- than the Google Search Engine is or ever was...
And that's what we need right now more than anything else -- intelligent filters to block out too much information...
The upcoming problem is (or will be!) -- since censorship and intelligent information filtering are twin siblings and run very similar parallel paths -- how do we distinguish one from the other?
How do we permit one, but not the other?
How do we permit intelligent information filtering, but not permit censorship?
You see, there's a very fine line between the two!
A very fine line!
What are we going to do, have a programmer code for what that line is? Have a 3rd party AI determine that exact line? Have a/the government(s) decide what that is?
?
It is or will be a future problem with no easy answer no apparent solution -- and it is starting to form as of this current day!
Perhaps the original unfiltered Google Search of 20+ years ago -- is not looking all that bad in comparison! :-)
[+] [-] runeofdoom|2 years ago|reply