That's literally how this "AI" (autoregressive language model) works. The logic is essentially: A little plagiarism is plagiarism, a lot of plagiarism is original works. So if it is detected, then they need to increase their level of plagiarism until it isn't.
While what I just wrote may come across like a joke, there's a lot of truth in it. These language models aren't actually smart, they're just good at parroting smart sounding things they've been taught (although one could argue humans often do the same thing); the problem here is they need more sources to intersplice to hide the true origins.
For other "AI" like Computer Vision this is less of a problem since even if you're inputting proprietary visual content, they aren't typically outputting parts of that content, but rather the tags that have been assigned to it (often by humans). This is becoming a problem as the outputs are directly made up using the inputs, and those inputs are proprietary.
"These language models aren't actually smart, they're just good at parroting smart sounding things they've been taught"
I still don't understand the purported difference between "sounds smart" and "is smart".
How else can you verify an entity's intelligence other than its "output"? Every doctor, pilot, president has obtained that position of power and responsibility literally only by being tested on their outputs.
We can't see inside a human's brain just as we can't see inside an AI's brain. What is the difference? How else do you propose we verify intelligence?
> These language models aren't actually smart, they're just good at parroting smart sounding things they've been taught (although one could argue humans often do the same thing); the problem here is they need more sources to intersplice to hide the true origins.
That's debatable, and it's not necessarily true. Just the other day, we saw a story here (https://news.ycombinator.com/item?id=34474043) that suggested that a language model trained on Othello game transcripts did in fact build a latent-space "mental model" of the board state. By perturbating the latent space, the researchers could argue that the model wasn't simply parroting memorized game sequences.
Of course, this isn't to say that AI plagiarism is impossible or even unlikely. Plagiarism is a shortcut to acceptable quality output, and AI systems excel at finding shortcuts to acceptable output.
The 'parrot' argument is overconfident and wrong. In fact, these models have capacity for abstract reasoning, but also have capacity for memorization and copying, and use a mixture of these strategies to produce high quality output. Here's an excellent demonstration of abstract modeling:
Yup. It's essentially like a very skillful writer with zero original thoughts--because where would those come from if the model tries to detect likely next words from the preceding string of words?
The examples provided are not that convincing. Especially the two headlines. Yes they are similar because there are a limited number of ways to ask "Can You Buy a Gift Card With a Credit Card?". Should the topic be off limit now that it's been written about? Take the code equivalent of asking 100 people to write a function in secret that takes 2 ints and returns a sum. I suspect a number of those code blocks will be "plagiarized" from each other too.
LOL, give me a break. No respectable technology professional considers CNET the bastion of tech journalism. Even before I knew that their articles are written by AI I never would rely on CNET as any kind of authority on matters technical.
Titan in this context could mean a large entity with a wide reach. No one is mistaking it as being a cutting edge source of news but it’s hard to deny that the brand name has a significant amount of awareness and namr recognition that has been fostered over two decades. Their audience isn’t tech workers. It’s people like your mom and dad looking for tech tips.
CNET is the only one still standing. That it looks indistinguishable from a Gizmodo or 9to5Mac style blogsite is a testament to how difficult it is to maintain an independent existence in the fickle world of digital media.
Most 'pure' tech journalism does not last long as the market for it is too niche. BYTE magazine was dead before the Internet. The UK-based "Net" magazine (webdev-based) lasted until 2016. PC World is still kicking but runs on a skeleton crew and is online-only.
CNET has changed formats many times, and the well-known editors who used to make it worthwhile moved on years ago. That it's trundling on with computer-generated articles is I guess a kind of ironic curtain call.
So sad to see how they've fallen into being a content farm, like there's no purpose except to jam up search engines to bother even putting an article out on "Can You Buy a Gift Card With a Credit Card?"
real techies(tm) might have other sources but there is an amplification chamber from such "techy-like websites" to the "techy" parts of newspapers, to the "techy" trends projectors like Gartner and their ilk, to the "modern, cutting edge" politicians that influence policy agendas about techology
objective and balanced information about what happens in tech was always a problem as those sites and entities "never bite the hand that feeds them" but it seems it gets harder by the day as we drown in an ocean of fakery
I didn't realize people considered CNET as respectable. I always assumed it was some site designed to bombard you with ads after it tricked you into clicking on it. At least, the auto-playing videos and huge banners told me that.
I miss the old CNET of the 2000s. It had a message board back then, true halcyon days when websites had the genius idea of providing something that users could come back to, instead of clickbait articles with auto-playing video ads.
But CNET is just the canary in the coal mine; I haven't gone there for tech coverage in over a decade because of the bland writing and the lack of clear editorial voice. The problem for me will be when the entire internet reads like this.
There are still some good writers there. But it's become mostly gadget/tech tips. [ADDED: But, really, mainstream consumer tech news in general.] It was a lot broader in the 2000s and had much more personality with a fairly large stable of affiliated blog network writers (which I did for a number of years). After CBS bought them, they started becoming a lot more narrowly focused and homogenized--and click-optimized.
At the start of the 18th century, most work had to be done with the muscles of humans or animals. The development of the steam engine, and later engines, allowed these resources to be applied to more productive pursuits.
Near the end of the 19th century, all data processing was by human clerks. Tabulating machines, and later computers, allowed these resources to be applied to more productive pursuits.
What's the biggest drain of resources today? At least in the developed nations, a good candidate is lies and lack of integrity in decision-making, taken in its most general definition. Whether you are personally most irritated by its presence in government or private industry, it's obvious that this is widespread.
Does AI look like it is going to reduce this waste? No, it looks like it's going to add to it.
> And especially once AIs are drawing from the writing of other AIs, which themselves are quoting AI (dark, I know) it might become quite difficult to detect,
No, it will be easier to detect, because the new AIs will learn to mimic earlier generations of their AI - and their faults - rather than mimicking human writing. Because that will score higher on a loss function.
I'm going to assume Dead Internet theory - aka most text online is spambots. It's not strictly true, as humans are obstinately continuing to use the Internet and pouring content into them. But the Internet puts those humans on a level playing field with bots. And AI is basically perfect spambot material - it looks superficially different, meaning you can't remove it with simple patterns. So the training set, which is just scraped off the Internet, will be primarily composed of AI-generated material by less-sophisticated systems. Human-written content will be in the minority of the training set, meaning that no matter how good the optimizer statistics or model architecture is, the system will be spending most of its time remembering how to sound like a bad AI.
There’s still a fitness function based on human interactions with reality: the cost of running these language models needs to keep up with the revenue generated. The revenue ultimately comes from the human economy concerned with housing, food, warmth, water, etc, so the language models need to cater to the concerns of the source of such economic activity by creating content useful enough to real people in order to cover the costs of production.
The real problem isn't AI content generation, it's a lack of content filtering. There has been a mountain of human garbage pushed out by people trying to make money which is already too big to navigate, AI will make the mountain much larger but the problem won't change.
We need AI discriminators that analyze content and filter out stuff that is poorly written, cliche, trite and derivative.
Related: "Inside CNET's AI-powered SEO money machine - Fake bylines. Content farming. Affiliate fees. What happens when private equity takes over a storied news site and milks it for clicks?"[1]
This kind of plagiarism detects near-verbatim mashups of small snippets. Has anyone seen similar ideas investigated in image generation?
I’m used to seeing papers show pairs as evidence that their models aren’t copying: generated images and the closest images in the training set. But that wouldn’t catch this kind of rephrase-and-mashup pattern’s visual analog. Has anybody looked at that closely?
Human exceptionalists hold up the 0.0001% of human content that isn't plagarized cliche garbage and act like somehow we're different. If you get an AI to produce 100,000 works and take only the few best they would be pearls just the same.
Managers are figuring out how much money they can save replacing human journalists with AI.
Computer generated articles are routine for boilerplate like financial summaries, sports results and police reports. But now creeping into more substantive articles.
This is silly. The "plagiarism" is a bunch extremeley common phrases describing basic facts, used by everyone all over the SEO spam web.
Fortune, one of the "victims" of this "plagiarism", makes its money by subscribing to the New York Times et. al., and paraphrasing articles and placing ads next to the paraphrases.
[+] [-] Someone1234|3 years ago|reply
While what I just wrote may come across like a joke, there's a lot of truth in it. These language models aren't actually smart, they're just good at parroting smart sounding things they've been taught (although one could argue humans often do the same thing); the problem here is they need more sources to intersplice to hide the true origins.
For other "AI" like Computer Vision this is less of a problem since even if you're inputting proprietary visual content, they aren't typically outputting parts of that content, but rather the tags that have been assigned to it (often by humans). This is becoming a problem as the outputs are directly made up using the inputs, and those inputs are proprietary.
[+] [-] qingdao99|3 years ago|reply
I still don't understand the purported difference between "sounds smart" and "is smart".
How else can you verify an entity's intelligence other than its "output"? Every doctor, pilot, president has obtained that position of power and responsibility literally only by being tested on their outputs.
We can't see inside a human's brain just as we can't see inside an AI's brain. What is the difference? How else do you propose we verify intelligence?
[+] [-] Majromax|3 years ago|reply
That's debatable, and it's not necessarily true. Just the other day, we saw a story here (https://news.ycombinator.com/item?id=34474043) that suggested that a language model trained on Othello game transcripts did in fact build a latent-space "mental model" of the board state. By perturbating the latent space, the researchers could argue that the model wasn't simply parroting memorized game sequences.
Of course, this isn't to say that AI plagiarism is impossible or even unlikely. Plagiarism is a shortcut to acceptable quality output, and AI systems excel at finding shortcuts to acceptable output.
[+] [-] sdenton4|3 years ago|reply
https://thegradient.pub/othello/
[+] [-] juujian|3 years ago|reply
[+] [-] reset-password|3 years ago|reply
[+] [-] kris_wayton|3 years ago|reply
[+] [-] iamflimflam1|3 years ago|reply
[+] [-] theptip|3 years ago|reply
The clickbait article is a formulaic format.
[+] [-] trinsic2|3 years ago|reply
LOL, give me a break. No respectable technology professional considers CNET the bastion of tech journalism. Even before I knew that their articles are written by AI I never would rely on CNET as any kind of authority on matters technical.
[+] [-] aardvarkr|3 years ago|reply
[+] [-] rchaud|3 years ago|reply
Most 'pure' tech journalism does not last long as the market for it is too niche. BYTE magazine was dead before the Internet. The UK-based "Net" magazine (webdev-based) lasted until 2016. PC World is still kicking but runs on a skeleton crew and is online-only.
CNET has changed formats many times, and the well-known editors who used to make it worthwhile moved on years ago. That it's trundling on with computer-generated articles is I guess a kind of ironic curtain call.
[+] [-] sct202|3 years ago|reply
[+] [-] college_physics|3 years ago|reply
objective and balanced information about what happens in tech was always a problem as those sites and entities "never bite the hand that feeds them" but it seems it gets harder by the day as we drown in an ocean of fakery
[+] [-] bastardoperator|3 years ago|reply
[+] [-] jarbus|3 years ago|reply
[+] [-] rchaud|3 years ago|reply
But CNET is just the canary in the coal mine; I haven't gone there for tech coverage in over a decade because of the bland writing and the lack of clear editorial voice. The problem for me will be when the entire internet reads like this.
[+] [-] ghaff|3 years ago|reply
[+] [-] lax0|3 years ago|reply
[+] [-] ajb|3 years ago|reply
Near the end of the 19th century, all data processing was by human clerks. Tabulating machines, and later computers, allowed these resources to be applied to more productive pursuits.
What's the biggest drain of resources today? At least in the developed nations, a good candidate is lies and lack of integrity in decision-making, taken in its most general definition. Whether you are personally most irritated by its presence in government or private industry, it's obvious that this is widespread.
Does AI look like it is going to reduce this waste? No, it looks like it's going to add to it.
[+] [-] giaour|3 years ago|reply
[+] [-] noworld|3 years ago|reply
[+] [-] kmeisthax|3 years ago|reply
No, it will be easier to detect, because the new AIs will learn to mimic earlier generations of their AI - and their faults - rather than mimicking human writing. Because that will score higher on a loss function.
I'm going to assume Dead Internet theory - aka most text online is spambots. It's not strictly true, as humans are obstinately continuing to use the Internet and pouring content into them. But the Internet puts those humans on a level playing field with bots. And AI is basically perfect spambot material - it looks superficially different, meaning you can't remove it with simple patterns. So the training set, which is just scraped off the Internet, will be primarily composed of AI-generated material by less-sophisticated systems. Human-written content will be in the minority of the training set, meaning that no matter how good the optimizer statistics or model architecture is, the system will be spending most of its time remembering how to sound like a bad AI.
[+] [-] williamcotton|3 years ago|reply
[+] [-] CuriouslyC|3 years ago|reply
We need AI discriminators that analyze content and filter out stuff that is poorly written, cliche, trite and derivative.
[+] [-] voytec|3 years ago|reply
[1] https://www.theverge.com/2023/1/19/23562966/cnet-ai-written-...
[+] [-] 6gvONxR4sf7o|3 years ago|reply
I’m used to seeing papers show pairs as evidence that their models aren’t copying: generated images and the closest images in the training set. But that wouldn’t catch this kind of rephrase-and-mashup pattern’s visual analog. Has anybody looked at that closely?
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] mavu|3 years ago|reply
What people call AI today is in fact not AI. It is a model that has been trained on input data, and generates prompted output based on that input.
In other words THEY ARE AUTOMATED PLAGIARISM MACHINES. THAT IS HOW ALL "AI" today work.
Truely dumb.
[+] [-] broast|3 years ago|reply
[+] [-] FeepingCreature|3 years ago|reply
[+] [-] CuriouslyC|3 years ago|reply
[+] [-] peter303|3 years ago|reply
Computer generated articles are routine for boilerplate like financial summaries, sports results and police reports. But now creeping into more substantive articles.
[+] [-] hwestiii|3 years ago|reply
[+] [-] honeybadger1|3 years ago|reply
[+] [-] pdntspa|3 years ago|reply
All these morons hunting with a bone to pick are going to be the reason why language is going to be so obtuse for so many.
[+] [-] thrill|3 years ago|reply
[+] [-] gowld|3 years ago|reply
Fortune, one of the "victims" of this "plagiarism", makes its money by subscribing to the New York Times et. al., and paraphrasing articles and placing ads next to the paraphrases.
[+] [-] UncleEntity|3 years ago|reply
Journalism will survive. Journalism will always survive.
Shitty content mill “journalists” should probably get the AI to pad out their resumes for them.