Ultimately it shouldn’t be too surprising that the machine that works by generating the most statistically likely text, generates text that’s statistically identical to human-generated text
> the machine that works by generating the most statistically likely text
You've just described a “base models” (or pre-trained model), but later training stages (RLHF, GRPO, whatever secret sauce model makers use) induce a strong bias in the output.
Also, being “statistically identical to human generated text” doesn't mean it's unrecognizable, because human generated text exhibit many various clusters (you're not texting your friends with the same language you're writing a book with) and an LLM can, and in practice, do, use language that is not appropriate for the tone a human expects in a certain context (like when bots write LinkedIn-worthy posts in reddit comment section). The “average human-looking text” is as unnatural to us as a “synthetic average human” with one testicle and half a vagina would be.
I'm not so sure I buy that. AI written text is fairly obvious to good writers with exposure to LLM output. Is it a case where it's sort of an average of writing styles, but that average is not human and thus humans can detect it?
AI writing you can recognize as AI writing is obvious. Newer models are better about this and the line will only get more blurry. Here's a benchmark where good writers make the assessment rather than different LLMs ranking each other: https://surgehq.ai/leaderboards/hemingway-bench
The top models are also the latest:
Gemini 3.1 Pro: still a bit of a gremlin, but will probably stay on top until the other model makers go xkcd 810 and target this benchmark
Gemini 3 Flash: current favorite of writers using it as a helper for its speed and decent prompt following
I've never seen the word "delve" show up with such frequency in the pre-AI era, but now it's an overwhelmingly large signal of LLM-generated text, so I'm not sure where that came from. Ditto for vomiting emojis everywhere.
I have heard that the human trainers for early LLM models were overwhelmingly from West Africa, so some of the word choices reflect that, including a preference for the word delve. This now means that humans from that part of the world are now frequently unfairly suspected of being AI.
littlestymaar|2 days ago
You've just described a “base models” (or pre-trained model), but later training stages (RLHF, GRPO, whatever secret sauce model makers use) induce a strong bias in the output.
Also, being “statistically identical to human generated text” doesn't mean it's unrecognizable, because human generated text exhibit many various clusters (you're not texting your friends with the same language you're writing a book with) and an LLM can, and in practice, do, use language that is not appropriate for the tone a human expects in a certain context (like when bots write LinkedIn-worthy posts in reddit comment section). The “average human-looking text” is as unnatural to us as a “synthetic average human” with one testicle and half a vagina would be.
slopinthebag|2 days ago
Kye|2 days ago
The top models are also the latest:
Gemini 3.1 Pro: still a bit of a gremlin, but will probably stay on top until the other model makers go xkcd 810 and target this benchmark
Gemini 3 Flash: current favorite of writers using it as a helper for its speed and decent prompt following
userbinator|2 days ago
TRiG_Ireland|2 days ago
Kye|2 days ago
https://books.google.com/ngrams/graph?content=delve&year_sta...
blahaj|1 day ago
Mirroring real human text is only the basis of training. Afterwards they get aligned a.k.a. lobotomized.
lelanthran|2 days ago