My pet theory is similar to the training set hypothesis: em-dashes appear often in prestige publications. The Atlantic, The New Yorker, The Economist, and a few others that are considered good writing. Being magazines, there's a lot of articles over time, reinforcing the style. They're also the sort of thing a RLHF person will think is good, not because of the em-dash but because the general style is polished.
One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."
It also seems that LLMs are using them correctly — as a pause or replacement for a comma (yes, I know this is an imprecise description of when to use them).
Thanks to LLMs I learned that using the short binding dash everywhere is incorrect, and I can improve my writing because of it.
This is mine as well, with the addition of books. If someone wanted to train a bot to sound more human, they would select data that is verifiably human-made.
The approachable tone of popular print media also preselects for the casual, highly-readable style I suspect users would want from a bot.
I think you're correct. The first time I encountered (and recognized) an em-dash in someone's writing was in middle school, and the person that wrote it was someone that I considered to be academically superior to myself. I noticed though, that a lot of people in the same "smart kids" group would use them; almost as if they had worked together on their papers. Maybe they were just reading different material, but it definitely came across as: this will make my writing "look smart".
I guess in the past if you'd shown me a passage with em dashes I'd say it looks good because I associate it with the New Yorker and Economist, both of which I read. Now I'd be a bit more meh due to LLMs.
It’s a real pity to me that em-dashes are becoming so disliked for their association with AI. I have long had a personal soft spot for them because I just like them aesthetically and functionally. I prided myself on searching for and correctly using em, en, and regular dashes, had a Google docs shortcut for turning `- - -` into `—` and more recently created an Obsidian auto-replacement shortcut that turns `-em` into `—`. Guess I’ll just have to use it sparingly and keep my prose otherwise human.
I feel you... For 30+ years of my life I prided myself for writing without typos and other mistakes (without autocorrect), using lots of bullet points, dashes, and words such as "delve into" or "underscore".
Now I find myself intentionally adding typos and other msitakes, and using less sophisticated language, just to not be accused of using AI.
Part of it is the guilt-by-association with the other bad writing habits of LLMs, but I think a lot of it is just that LLMs genuinely overuse them, and that homogeneity is grating just like it's grating when you notice a text reuses a particular noticeable word or whatever. As a fellow em-dash user, I have sometimes noticed myself overusing them too, and revised accordingly, starting well before the proliferation of this particular cancer.
So I think you can keep using em-dashes without being associated with LLMs as long as you reserve them for particularly effective/tasteful occasions.
In my mind, their rightful place is transcription of written speech where the speaker pauses, and either inserts an island idea, or changes course. The comma doesn't suffice, because it's bridging an initial idea with expounding on the same idea. But so many times in written text I see it abused, lazily employed, because the author used a sentence fragment for effect, or wanted to amp up the pause and drama when a comma or, hell, even a semi-colon would have served the purpose better.
The advent of the generic AI writing style has had one good effect on my own work: making me take an unflinching look at my own laziness in writing. Now I tend to clean things up while at the same time try to inject some personality in order to NOT be dismissed as AI.
The em dash is just one of a group of traits that make something obviously written by a bot. If you use em dashes in conjunction with good writing then nobody will give a shit.
Let’s spread the word until everyone fancy uses them, and then those who criticize text for coming from LLMs will be ridiculed by our ridiculous skills.
According to the CEO of Medium, the reason is because their founder, Ev Williams, was a fan of typography and asked that their software automatically convert two hyphens (--) into a single em-dash. Then since Medium was used as a source for high-quality writing, he believes AI picked up a preference for em-dashes based on this writing.
I would think the most obvious explanation is that they are used as part of the watermark to help OpenAI identify text - i.e. the model isn't doing it at all but final-pass process is adding in statistical patterns on top of what the model actually generates (along with words like 'delve' and other famous GPT signatures)
I don't have evidence that that's true, but it's what I assume and I'm surprised it's not even mentioned as a possibility.
When I studied author profiling, I built models that could identify specific authors just by how often they used very boring words like 'of' and 'and' with enough text, so I'm assuming that OpenAI plays around with some variables like that which would much harder to humans to spot, but probably uses several layers of watermarking to make it harder to strip, which results in some 'obvious' ones too.
Obvious watermarking that consistently gets a lot of hate from vocal minorities (devs, journalists, etc.) would probably be simply removed for the benefit of those other layers you mention.
But the watermarking layers is a fascinating idea (and extremely likely to exist), thanks!
Historically I would see far more em-dashes in capital "L" literature than I would in more casual contexts. LLMs assign more weight to literature than to things like reddit comments or Daily Mail articles.
I think this is most of it. The most obvious sign of AI slop is mismatched style with the medium. People are posting generated text to Reddit which reads like a school essay or linkedin inspirational post. Something no one did before. So even though the style is not unprecedented, it’s taken out of its original context.
I think the more correct question is why humans don't use em dashes in the first place while LLMs do all the time. And the short answer to that is, because it's Unicode stuff.
Regular computers for human use only support ASCII in US or ISO-5589-1 in EU still to this day, and Unicode reliant East Asian users turn off Unicode input modes before typing English words, leaving the Asian part mostly in pure Unicode and alphanumeric part pure ASCII. So Unicode-ASCII mixed text is just odd by itself. This in turn makes use of em dashes odd.
Same with emojis. LLMs generate Unicode-mapped tokens directly, so they can vocalize any characters within full Unicode ranges. Humans with keyboards(physical or touchscreen) can mostly only produce what's on them.
> real humans who like em-dashes have stopped using them out of fear of being confused with AI.
Yeah, this is me. I've always liked good type and typography. 5 or 6 years ago I added em-dash to my keyboard configs to make typing it in convenient - mostly because I just think it just looks nicer. But lately I don't use it much because... AI.
However, in recent weeks someone accused an HN post of mine as being from a bot, despite the fact I used a plain old hyphen and not an em-dash. There was nothing in the post which seemed AI-like except possibly that hyphen. At the time, I realized that person probably just couldn't tell a hyphen from a real em-dash. So maybe that means I have to not use any dash at all.
The "book scanning" hypothesis doesn't sound so bad — but couldn't it simply be OCR bias? I imagine it's pretty easy for OCR software to misrecognize hyphens or other kinds of dashes as em-dashes if the only distinction is some subtle differences in line length.
You'd think context-less OCR would prefer interpreting it as a simple hyphen, since that's the most common dash. Seems unlikely any bias would go the other way.
I have always found this complaint quite odd. Em-dashes are great. I use them all the time.
Never spent too much time thinking about em-dashes. Writers I like probably use them all the time—again, never really thought about it.
There are many other language model artifacts that are genuinely shite and are worth criticizing. Though, come to think of it, they have been getting stamped out with each iteration in model. Used to spend a lot of time trying to get models to refrain from words like "crucial".
What I do find strange is how the latest SOTA models appear to write with contractions by default, which began sometime in the past year. Anthropic models, in particular.
This has always seemed intuitively obvious to me. I use a lot of em dashes... because I read a lot. Including a lot of older, academic, or more formally written books. And the amount used in AI prose has never struck me as odd for the same reason. (Ditto for semi colons).
The truth is ... most people don't read much. So it's not too surprising they think it looks weird if all they read is posts on the internet, where the average writer has never even learned how to make one on the keyboard.
Delve on the other hand, that shit looks weird. That is waaay over-represented.
"If AI labs wanted to go beyond that, they’d have to go and buy older books, which would probably have more em-dashes."
Actually, they wouldn't have to go and buy these old books: The texts are already available copyright free, due to legislation stating that copyright expires 70 years after the author's death (any book published in the USA before 1923 is also reproducible without adherence to copyright laws), making the full texts of old books much easier to find on the internet!
Very interesting topic. I also wonder why other signs of AI writing, such as negative parallelism ("It's not just X, it's Y"), are preferred by the models.
Also, I wrote a small extension that automatically replaces ChatGPT responses with em dashes with alternative punctuation marks: https://github.com/nckclrk/rm-em-dashes
What we also learned after GPT-3.5 is that, to circumvent the need for new training data, we could simply resort to existing LLMs to generate new, synthetic data. I would not be surprised if the em dash is the product of synthetically generated data (perhaps forced to be present in this data) used for the training of newer models.
I am no grammarian, but I feel like em-dashes are an easy way to tie together two different concepts without rewriting the entire sentence to flow more elegantly. (Not to say that em-dashes are inelegant, I like them a lot myself.)
And so AI models are prone to using them because they require less computation than rewriting a sentence.
This is sort of my thinking too. It's finding next token once the previous ones have been generated. Dashes are an efficient way to continue a thought once you've already written a nearly complete sentence, but it doesn't create a run-on sentence. They're efficient in the sense that they allow more future grammatically correct options even when you've committed to previous tokens.
As someone who used em-dashes extensively before LLMs I can only hope (?) some of myself is in there. I really liked em-dashes, but now I have to actively avoid them, because many people use them as a marker to recognize text that has been invented by the stochastic machine.
Another reason I think attributes to it at least partially is that other languages use em-dashes. Most people use LLMs in English, but that's not the only language they know and many other languages have pretty specific rules and uses for em-dashes. For example, I see em-dashes regularly in local European newspapers, and I would expect those to be written by a human for most part simply because LLM output is not good enough in smaller languages.
I've been using em-dashes in my own writing for years and it's annoying when I get accused of using AI in my posts. I've since switched to using commas, even though it's not the same.
You should tell the people that are accusing you to go fuck themselves and you should keep writing the way you like. You were here before AI, don't let it dictate how you behave.
I wonder what happens to all that 18 century books scanning data. I imagine it stays proprietary and I've heard a lot of the books they scan are destroyed afterwards.
I’m now reading Pride and Prejudice (first edition released in 1813) and indeed there are many em dashes. It also includes language patterns the models didn’t pick up (vocabulary, to morrow instead of tomorrow)
In Russian written languages, the quotes for the people speaking are prefixed with em-dash, instead of double-quoted like it would be in typical English book:
Instead of
"The time has come," the Walrus said,
"To talk of many things:"
... it would be spelled as
— The time has come, — the Walrus said,
— To talk of many things:
I wonder how much of russian language content was in training model.
I always figured it was because of training on Wikipedia. I used to hate the style zealots (MOStafarians in humorous wiki-jargon) who obsessively enforced typographic conventions like that. Well I still hate them, but I'm sort of thankful that they inadvertently created an AI-detection marker. I've been expecting the AI slop generators to catch on and revert to hyphens though.
lordnacho|4 months ago
One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."
kubb|4 months ago
Thanks to LLMs I learned that using the short binding dash everywhere is incorrect, and I can improve my writing because of it.
cornonthecobra|4 months ago
The approachable tone of popular print media also preselects for the casual, highly-readable style I suspect users would want from a bot.
lunias|4 months ago
mailarchis|4 months ago
tim333|4 months ago
I guess in the past if you'd shown me a passage with em dashes I'd say it looks good because I associate it with the New Yorker and Economist, both of which I read. Now I'd be a bit more meh due to LLMs.
iansteyn|4 months ago
jasonvorhe|4 months ago
Your readers won't care about the dashes as long as the texts read like they had human origins and you have something to tell.
krzrak|4 months ago
Now I find myself intentionally adding typos and other msitakes, and using less sophisticated language, just to not be accused of using AI.
topaz0|4 months ago
So I think you can keep using em-dashes without being associated with LLMs as long as you reserve them for particularly effective/tasteful occasions.
damnesian|4 months ago
The advent of the generic AI writing style has had one good effect on my own work: making me take an unflinching look at my own laziness in writing. Now I tend to clean things up while at the same time try to inject some personality in order to NOT be dismissed as AI.
nandomrumber|4 months ago
avazhi|4 months ago
eastbound|4 months ago
Cmd + Shift + “-“ = —
Let’s spread the word until everyone fancy uses them, and then those who criticize text for coming from LLMs will be ridiculed by our ridiculous skills.
lm28469|4 months ago
Mawr|4 months ago
spuz|4 months ago
https://youtu.be/1d4JOKOpzqU?si=xXDqGEXiawLtWo5e&t=569
hshdhdhehd|4 months ago
don_neufeld|4 months ago
It wasn’t just Ev - I can confirm that many of us were typography nuts ;)
Marcin for example - did some really crazy stuff.
https://medium.design/crafting-link-underlines-on-medium-7c0...
steve1977|4 months ago
That explains a lot…
bazoom42|4 months ago
dagmx|4 months ago
Apple has done it across their systems for ages. Microsoft did it in Word for a long time too.
It was more or less standard on any tool that was geared towards writers long before Medium was a thing.
sixhobbits|4 months ago
I don't have evidence that that's true, but it's what I assume and I'm surprised it's not even mentioned as a possibility.
When I studied author profiling, I built models that could identify specific authors just by how often they used very boring words like 'of' and 'and' with enough text, so I'm assuming that OpenAI plays around with some variables like that which would much harder to humans to spot, but probably uses several layers of watermarking to make it harder to strip, which results in some 'obvious' ones too.
constantius|4 months ago
But the watermarking layers is a fascinating idea (and extremely likely to exist), thanks!
xandrius|4 months ago
Fricken|4 months ago
Gigachad|4 months ago
numpad0|4 months ago
Regular computers for human use only support ASCII in US or ISO-5589-1 in EU still to this day, and Unicode reliant East Asian users turn off Unicode input modes before typing English words, leaving the Asian part mostly in pure Unicode and alphanumeric part pure ASCII. So Unicode-ASCII mixed text is just odd by itself. This in turn makes use of em dashes odd.
Same with emojis. LLMs generate Unicode-mapped tokens directly, so they can vocalize any characters within full Unicode ranges. Humans with keyboards(physical or touchscreen) can mostly only produce what's on them.
mrandish|3 months ago
Yeah, this is me. I've always liked good type and typography. 5 or 6 years ago I added em-dash to my keyboard configs to make typing it in convenient - mostly because I just think it just looks nicer. But lately I don't use it much because... AI.
However, in recent weeks someone accused an HN post of mine as being from a bot, despite the fact I used a plain old hyphen and not an em-dash. There was nothing in the post which seemed AI-like except possibly that hyphen. At the time, I realized that person probably just couldn't tell a hyphen from a real em-dash. So maybe that means I have to not use any dash at all.
xg15|4 months ago
flowerthoughts|4 months ago
adi_kurian|4 months ago
Never spent too much time thinking about em-dashes. Writers I like probably use them all the time—again, never really thought about it.
There are many other language model artifacts that are genuinely shite and are worth criticizing. Though, come to think of it, they have been getting stamped out with each iteration in model. Used to spend a lot of time trying to get models to refrain from words like "crucial".
What I do find strange is how the latest SOTA models appear to write with contractions by default, which began sometime in the past year. Anthropic models, in particular.
SecretDreams|4 months ago
unknown|4 months ago
[deleted]
0xbadc0de5|4 months ago
iainctduncan|4 months ago
The truth is ... most people don't read much. So it's not too surprising they think it looks weird if all they read is posts on the internet, where the average writer has never even learned how to make one on the keyboard.
Delve on the other hand, that shit looks weird. That is waaay over-represented.
redheadednomad|3 months ago
Actually, they wouldn't have to go and buy these old books: The texts are already available copyright free, due to legislation stating that copyright expires 70 years after the author's death (any book published in the USA before 1923 is also reproducible without adherence to copyright laws), making the full texts of old books much easier to find on the internet!
batterylake|3 months ago
Also, I wrote a small extension that automatically replaces ChatGPT responses with em dashes with alternative punctuation marks: https://github.com/nckclrk/rm-em-dashes
spidersouris|4 months ago
keiferski|4 months ago
And so AI models are prone to using them because they require less computation than rewriting a sentence.
bitshiftfaced|4 months ago
atoav|4 months ago
shadowvoxing|4 months ago
Etheryte|4 months ago
stonecharioteer|4 months ago
manuelmoreale|4 months ago
neuroelectron|4 months ago
iddan|4 months ago
moffkalast|4 months ago
AbstractH24|4 months ago
qubex|3 months ago
danielodievich|4 months ago
Instead of
"The time has come," the Walrus said,
"To talk of many things:"
... it would be spelled as
— The time has come, — the Walrus said,
— To talk of many things:
I wonder how much of russian language content was in training model.
kristopolous|4 months ago
byyoung3|4 months ago
DonHopkins|4 months ago
throwaway81523|4 months ago
kentbrew|4 months ago
IshKebab|4 months ago
animanoir|4 months ago
[deleted]
black_13|4 months ago
[deleted]