Tortured phrases: A dubious writing style emerging in science

[+] dang|4 years ago|reply

The paper is at https://arxiv.org/abs/2107.06751.

(We merged this thread and https://news.ycombinator.com/item?id=28108111)

[+] atrettel|4 years ago|reply

I have encountered something similar to this for a submission that I reviewed for a scientific journal. I will not list any names or give much detail past those generalities, but I pointed out that the authors were misusing a particular technical term. In my review I defined the term and explained it briefly. I asked the authors to revise their submission accordingly. The paper was not bad but the authors did not know English very well, so it was quite difficult to read. That was its main problem. However, when I received the revised submission, I noticed that the authors plagiarized my definition and explanation almost word for word (from my confidential review). I pointed this out to the editors and they said to just reject the paper with the stated reason being plagiarism, which I did. The journal ended up rejecting the article, but I discovered it a few years later in a different journal. The plagiarized section remained, but the authors swabbed out a lot my phrases for these kind of "tortured phrases".

That said, the authors did not fabricate their research (as far as I can tell). They just did not know English well, so it was easier to just copy things that you know are phrased well than to learn to write English well. As the saying goes, do not attribute to malice what can be explained by ignorance or laziness. That does not excuse it but it makes it more understandable.

I agree with the article that this is probably just the tip of the iceberg. There are likely many more lesser evils being committed with similar tools that are just much more difficult to spot. I would not have noticed my particular example if I were not a reviewer for the paper, for example. It makes me wonder how big the problem really is.

[+] dkarl|4 years ago|reply

As a native English speaker, but not an academic researcher, I would have naively done the same.

To preface: I am speaking in ignorance of the mechanisms of credit and advancement in your field, so I'm undoubtedly overly harsh. I'm not even trying to be fair, because I admittedly don't have the knowledge to do so.

I know academia is a different world from private sector industry, but I would think if the point of a confidential review is for you to do anonymous work to improve someone else's credited work, and their work was improved by incorporating your feedback, you would be happy with the outcome, or else why are you participating in a confidential review process in the first place? There are different mechanisms for publishing words that you want credit for.

When someone incorporates my feedback on code or documentation word-for-word, I might in the worst case be suspicious that they are trying to get my approval without engaging with my criticism, but in most cases I'm flattered that they respect my idea enough to put their name on it. Although, in my world, putting your name on something is more about responsibility than credit. The command is called "git blame" and not "git credit", after all.

Wanting them to incorporate your feedback, but also wanting them to make some change to the wording to avoid plagiarism, smacks of how freshman essays are graded, not how real work gets done.

Like I said, I'm not trying to be fair and don't have the right background to be fair to you. I only speak up because I grew up in an academic family and know that there is a presumption that academic work is more idealistic, more altruistic, and less mercenary than private sector work, and I think it's worth pointing out when the reverse is true.

[+] turnersr|4 years ago|reply

In your review, did you suggest the definition and explanation that they used? In this situation, would have an acknowledgment at the end have been enough? In my mind, it seems like you all had a conversation and the authors took up your suggestions as the reviewer.

[+] hdjjhhvvhga|4 years ago|reply

> The paper was not bad but the authors did not know English very well, so it was quite difficult to read. That was its main problem.

This seems to confirm my suspicion than these cases are not so much about AI-generated content but rather a result of machine translation.

[+] maydup-nem|4 years ago|reply

So you couldn't ask them to attribute this to you as an anonymous reviewer? And instead wanted them to spend their effort using "their own words" because "inexcusable" and what not? Man, are some people stuck up deep in their own ass. Yeah, a copy-paste isn't nice of them, sure, but are you one fragile snowflake.

[+] kome|4 years ago|reply

But why did you call them for plagiarizing a private note, a confidential review? I mean... In the end they just used your wording to improve their paper, for a concept they already mastered. And you said the paper was good, otherwise. IMHO they acted ethically.

[+] tremon|4 years ago|reply

The solution seems simple though: do not publish in a language you do not master. Get a co-author or fellow researcher who does.

Isn't that what journals are supposed to be for? To help you reach a wider audience?

[+] MikeUt|4 years ago|reply

> the authors plagiarized my definition and explanation almost word for word (from my confidential review).

Is there any way the authors could have kept your definition, and somehow credited you, even anonymously? Because rephrasing definitions is the pinnacle of wasted effort, and leads to confusion - you are asking them to say what you said, but without using your words.

[+] FabHK|4 years ago|reply

Some of these tortured phrases are great. My favourites:

"flag to clamor" for signal to noise

"individual computerized collaborator" for PDA (personal digital assistant)

"haze figuring" for cloud computing

"information stockroom" for data warehouse

"focal preparing unit" for CPU

"discourse acknowledgement" for voice recognition

"mean square blunder" for MSE (mean square error)

"arbitrary right of passage" for random access

"arbitrary timberland" for random forest

"irregular esteem" for random value

ETA:

"notoriety examination" for sentiment analysis

[+] aaron-santos|4 years ago|reply

I enjoyed finding "counterfeit consciousness" for artificial intelligence. To me it evokes a kind of science fiction that's shown up occasionally on HN[1].

[1] https://qntm.org/mmacevedo

[+] abecedarius|4 years ago|reply

Reminiscent of https://en.wikipedia.org/wiki/Uncleftish_Beholding

[+] tikwidd|4 years ago|reply

Reminds me of a Google translation of "bass" to Chinese as "low frequency fish"

[+] neoCrimeLabs|4 years ago|reply

I'm very tempted to introduce tortured phrases at work for occasional humor. For example, who needs "continuous integration" when you have "ceaseless incorporation"? Sometimes it's nice to see if anyone reads my notes.

In all seriousness though, I've experienced something similar before at a Japanese run American corporation as far back as the 90's. The problem was Japanese executives and executive assistants who didn't know American tech-jargon often resulted in accepting mangled suggestions by the spell-checker. A notorious example was the "Data Whorehousing" presentation, which somehow made it through several reviews and rehearsals before being presented to the entire American IT department at an all-hands meeting.

Clearly this made an impact as I remember it 23(ish) years later!

[+] etempleton|4 years ago|reply

I often wonder while reading an academic paper how the writing could be as hopelessly bad as it is.

This type of manipulation and plagiarism may be partially to blame, but the academic writing style has also gone completely off the rails to the point that half the journal articles being published today read as if written by some kind of paper writing AI robot even when I am quite certain that that isn't the case. And no, I am not talking about cases where the author is writing in a non-native language.

I have a theory that it may have to do with imposter syndrome and a need to sound smart. The author, fearing that they don't really belong and at any moment will be found out, therefore never making tenure, starts jamming academic sounding words where they don't belong and stretching sentences with commas and semi colons until the whole thing is just as insufferable to read as it was to write.

There is also the possibility that there are just a lot of terrible writers out there.

[+] zwaps|4 years ago|reply

I am sure this was not your intention or meaning, but please be aware that it is virtually impossible for a non-native speaker to write perfect English. English is a language you have to intuit. In contrast to other languages, it has very few fixed rules. Writing elegantly in English is most certainly an art form.

Of course, writing good science is hard enough for native speakers. It is very difficult for the vast majority of people on the planet - no matter how good their research.

And just so we are clear: Not everyone can afford professional editing services at every point in their career.

We meet in English under the premise that it allows for universal communication. In this, we accept that English natives are almost infinitely more privileged in writing, speaking, conferencing and networking. We also have to accept that the level of English proficiency varies, and - especially English - is easy to learn and so difficult to master.

[+] TheOtherHobbes|4 years ago|reply

There used to be bad writing contests for academics. One of the famous winners was Judith Butler's timeless[1]:

The move from a structuralist account in which capital is understood to structure social relations in relatively homologous ways to a view of hegemony in which power relations are subject to repetition, convergence, and rearticulation brought the question of temporality into the thinking of structure, and marked a shift from a form of Althusserian theory that takes structural totalities as theoretical objects to one in which the insights into the contingent possibility of structure inaugurate a renewed conception of hegemony as bound up with the contingent sites and strategies of the rearticulation of power.

You just can't argue with that.

[1] Ironically.

[+] raincom|4 years ago|reply

A friend submitted a paper to a journal in humanities. The reviewer said "his English is informal". In other words, these reviewers are asking for stilted English.

[+] hutzlibu|4 years ago|reply

"There is also the possibility that there are just a lot of terrible writers out there. "

Surely they are and writing in a way that is easy to read and understand is an art in itself.

But I would agree, that the main reason is probably the intention to sound smarter, than they are. Whole scientific disciplines seem to live by that standard.

This is not limited to science though, I recall a german poet (I think Heinrich Heine) said about his fellow poets:

You only fly so high like the swallow, that no one can actually hear your singing.

[+] yissp|4 years ago|reply

Good essay by Orwell that touches on this sort of thing https://www.orwellfoundation.com/the-orwell-foundation/orwel... I used to be guilty of writing this way and one of my high school English teachers recommended I read it. I've tried to take the message to heart ever since.

[+] quotemstr|4 years ago|reply

Older academic writing was frequently beautiful. See the classic "On Cooling the Mark Out" paper. Every sentence is a joy to read.

https://infofranpro.wdfiles.com/local--files/19520101-on-coo...

[+] gzer0|4 years ago|reply

This is anecdotal evidence at best, but it is worth considering. I know of several individuals who were able to complete their entire Master's thesis utilizing a combination of AI generated content (GPT-3) and a paraphrasing tool.

The generated text was well over 50 pages, completely bypassed all known content/plagiarism checks and was even included in the Universities "exemplary examples". To this day, it is still there.

This is of significant concern as some of these GPT-3 based tools are now integrated within MS Word itself. Word 2021 allows for "add-ons", out of which I have noticed several third party content generation and paraphrasing tools.

[+] bjourne|4 years ago|reply

I really doubt you can computer generate a Master's thesis. Completing a Master's thesis at an accredited institution is a heck of a lot of work and even a cursory reading of a thesis by an examiner, supervisor, opponent, or other interested party would give the generated content away. Maybe if you get your degree from a diploma mill you could get away with it, but then your degree wouldn't be worth toilet paper anyway.

I've heard similar stories about generated phd theses and it is even more implausible. The reason is that writing a thesis is much more than just producing a hundred pages or so of prose. Any university student can poop that out in a few weeks. The main job of a thesis is coming up with a research question, conducting an experiment or a study, and describe the results and how it fits in whatever niche of the scientific world you are working in.

[+] MengerSponge|4 years ago|reply

Does Poe's Law cover parody becoming real? Because BBSpot called this nearly 18 years ago: "Word 2004 to Pioneer AutoUnsummarize Feature" https://www.bbspot.com/News/2003/12/autounsummarize.html

[+] 13415|4 years ago|reply

Please include a link to these theses, because as it stands this anecdote sounds extremely implausible. I don't know what university you were, but I've been at a few in Europe and at every one of them Master theses were evaluated from the start to the end by several humans. GPT-3 is unable to produce even two pages of coherent text, let alone 50 pages good enough to be accepted as a Master thesis in any discipline at any university I could think of (even the worst ones).

I can imagine that plagiators use paraphrasing software quite extensively, though, and that it is a problem.

[+] throwawaygh|4 years ago|reply

> Master's thesis

I don't doubt this at all, and I have no doubt that GPT-3 with a bit of human editing can spit out something better than the lower third of masters students at corn row colleges.

Masters degrees are cash cows, which is why no one in unregulated industries cares about them. People in regulated/unionized industries also don't actually care; even educators, who at least nominally see intrinsic value in education, go to borderline diploma mills to get that union-mandated raise at minimal effort.

[+] lostlogin|4 years ago|reply

> third party content generation and paraphrasing tools.

Presumably this is an arms race against things like https://www.turnitin.com/

Empower students ‘to do their best, original work’ and this is what you get. Though what the alternative is, I have no idea.

[+] laurent92|4 years ago|reply

> I know of several individuals who were able to complete their Master’s thesis utilizing…

Doesn’t it stay published forever? Might be a shame for the someone during their career.

On the other hand, even a chapter of Mein Kampf was accepted in 20 journals, after replacing the old word with newer versions. Human reviews are hard. Maybe we should put computers in charge of reviewing papers, they’d recognize the work of AI quicker?

https://www.foxnews.com/us/academic-journal-accepts-feminist...

[+] phkahler|4 years ago|reply

Sounds like automated review of automatically generated papers. And people pay money for that...

[+] lelanthran|4 years ago|reply

> I know of several individuals who were able to complete their entire Master's thesis utilizing a combination of AI generated content (GPT-3) and a paraphrasing tool.

That's nice, how did it do the defense?

[+] unknown|4 years ago|reply

[deleted]

[+] segfaultbuserr|4 years ago|reply

I searched the keyword "SEO" and didn't find any match in the comments here, I'm surprised.

For anyone who has been a webmaster, one can immediately recognize it's an extremely common technique in the blackhat SEO scene for decades, used by content farms everywhere. One just copies articles from somewhere else, replace all words with dictionary synonyms to evade the search engine penalty, and fill the resulted websites with spam.

Perhaps it's not as popular in the English world, but common in China, and is a standard tool included in all blackhat SEO software. And no, it doesn't work well, the output is gibberish too in spite of the language differences. Oh, and the article says:

> A high proportion of these papers came from authors in China.

Exactly what I expected. The spammers found a new market, apparently. It's sad to see that some scientific papers and journals are literally becoming blackhat SEO spam and content farms.

[+] funfunfunction|4 years ago|reply

This is called content spinning.

[+] ksaj|4 years ago|reply

I noticed this happening in other areas a few years ago, but with faked blogs. The titles and subjects would sound interesting, but then when you tried to read them, you'd need a specialized decoder to get through the utterly baffling word replacements. But they already got their ad revenue by the time you notice the article is complete gibberish.

The first one I found was about dog illnesses. They kept referring to dogs with phrases like "Your domesticated canine," and it was quite a chore trying to figure out most of the symptoms that they were listing. "Heart worms" was translated to "love snakes," which I thought was delightful.

[+] dimatura|4 years ago|reply

Yes, this may be a specific example of a more widespread phenomenon. There's certain websites out there that republish articles from well-established publications (e.g., New York Times) almost word for word, except that they are rife with synonym swaps that may or may not make sense in context, presumably to escape some kind of automated copy detection. Results can be amusing. For example, the copied article said "“Drukqs” acquired a blended essential response..." where the original said "“Drukqs” received a mixed critical response...".

[+] _kst_|4 years ago|reply

I've seen this kind of things in articles posted to social media sites (Quora and Facebook, but it probably exists elsewhere).

I don't have a specific example at hand, but it's typically an article a few paragraphs long with really strange phrasing, so strange that it's not explainable by the author not knowing English well.

In a handful of cases, I've managed to find the original source. Common phrases are systematically replaced by ill-fitting synonyms.

I suspect the motivation is to avoid accusations of plagiarism (though I don't know what benefit the posters derive from doing this).

[+] armchairhacker|4 years ago|reply

Nowadays too many real blogs are padded with weird phrasing and sentences which don't really mean anything.

In this case, sometimes you get lucky and can actually find meaningful information between the padding. But sometimes you just read an article that takes 5 paragraphs and 500 words to say "we don't know".

[+] tasty_freeze|4 years ago|reply

I ran into something like this in an amazon review once. I was looking for a book of transcriptions for the instrument I play, and two of the handful of reviews used the same awkward phrase: "music goals". I scratched my head and then realized what probably happened. They weren't native english speakers and they were being paid to write reviews and they had gotten the wrong synonym. "music goals" was supposed to be "music scores".

[+] guyromm|4 years ago|reply

Back in 2004 or so, I was building a distributed CMS with the goal of creating artificial "link pyramids" with the purpose of SEO, which was a rather new thing at the time.

Content generation was one of our bottlenecks, and as Google was already rather successful at detecting duplicate content, we were looking for a way to "uniqify" posts that would be used to stuff sites intended for googlebot, but not humans.

One of the methods that worked was taking source English content, running it through Babelfish, the Altavista translator to French, Spanish or German, and then using the same method to translate it back to English.

This resulted in texts that did not make much sense to humans, were full of precisely such "tortured phrases" but which were considered unique by Google.

[+] thisrod|4 years ago|reply

Authorship is the metric that scientists get paid for, so of course it has been thoroughly corrupted.

Fake papers and plagiarism are the most blatant form of corruption. They tend to come from certain, let's say, large countries with less developed scientific cultures. Those countries need to put an end to it, because the rest of us keep having to work harder to suppress the racist impressions that we're bound to form of colleagues who look and sound like the cheats.

In more traditional scientific countries, the corruption is more subtle. Today, many groups publish every paper with half a dozen authors, and no indication of what each of them contributed. This enables the professors who run those groups to manipulate authorship more or less as they please, and have total control over who gets to have a career in science. It turns out that absolute power corrupts senior scientists as absolutely as it does other people.

No doubt there are more clever ways to game the system, that I haven't noticed. As long as million dollar grants and first-world citizenship keep being doled out for something as contrived as scientific paper authorship, corruption is inevitable.

[+] FabHK|4 years ago|reply

And the journal involved, Microprocessors and Microsystems, is an Elsevier journal. Huge surprise. I am glad the publisher earns their outrageous fees by careful screening, peer-review, and editing of submitted manuscripts. /s

Ceterum censeo Elsevier(um) esse delendum.

[+] Sebb767|4 years ago|reply

> [...] the editor of Microprocessors and Microsystems began having concerns about the integrity and rigour of peer review for papers that had been published in some of the journal’s special issues.

> The journal’s publisher, Elsevier, launched an investigation. This is still under way, but in mid-July the publisher added expressions of concern to more than 400 papers that appeared across six special issues of the journal.

I hate to open up this topic and I hate to pick on people that are trying to fix their mistake even more, but oh boy. Elsevier has been a pain in the butt for universities and researchers alike. They leech money from both sides of the community, they sue people trying to bring science forward and they gate scientific success. And their literally only reason to keep existing was to prevent exactly this.

I've never been a big fan of the current scientific publishing model. But Elsevier is a top publisher. It's pretty damming that they have one - highly overpaid - job and they don't even do it.

[+] ipsum2|4 years ago|reply

A high profile case (on the internet) similar to the one described in the article is when Siraj Raval plagiarized a paper on quantum ML and made some amusing replacement phrases:

complex Hilbert space -> Complicated Hilbert space

Quantum gate -> Quantum door

https://www.theregister.com/2019/10/14/ravel_ai_youtube/

[+] heurisko|4 years ago|reply

I was reading an article on Nature and noticed their definition of "woman" didn't make sense. Tortured phrases isn't confined to plagiarism avoidance.

> Unfortunately, fibroids are just one of many understudied aspects of health in people assigned female at birth. (This includes cisgender women, transgender men and some non-binary and intersex people; the term ‘women’ in the rest of this editorial refers to cis women.)

The article says it only refers to "Cis women" (presumably "cisgender women"), however the article continues to talk about rugby and brains, in which case the word "woman" would not only refer to "cisgender" women, but also to those people who identify as non-binary or transgender men, as surgery and hormone therapy (if that is undertaken by the individual) won't change brain axons, or the person's physical stature.

The article then talks about "male animals", not "animals assigned male at birth". There's no explanation given why animals are not similarly "assigned" a sex.

https://www.nature.com/articles/d41586-021-02085-6

[+] qwerty456127|4 years ago|reply

> software that attempts to disguise plagiarism

AFAIK today it has became necessary to "disguise plagiarism" even when you are not plagiarizing anything because bullshit "anti-plagiarism" software would detect many phrases similar to what somebody else already used. I believe the war on plagiarism brings little good in exchange for the hassle.

[+] FabHK|4 years ago|reply

Got to agree with the conclusion of the paper:

> In our strong opinion, the root of the problems discussed in this work is the notorious publish or perish atmosphere (Garfield, 1996) affecting both authors and publishers. This leads to blind counting and fuels production of uninteresting (and even nonsensical) publi- cations.

[+] Animats|4 years ago|reply

This is a major failure of Elsevier.

Here's "Microprocessors and Microsystems."[1] This is supposed to be about embedded systems, which is generally a no-bullshit field. I'd never heard of this journal. People read Electronic Design, EE Times, "Embedded.com", maybe Control Systems Journal, etc. Those have either articles about how to do something, or "why what we're selling is great" articles.

Now look at the article titles in Microprocessors and Microsystems.[2] Here are the first three.

- COPS: A complete oblivious processing system

- A perceptron-based replication scheme for managing the shared last level cache

- Efficient underdetermined speech signal separation using encompassed Hammersley-Clifford algorithm and hardware implementation

Now those might be legitimate, although what they're doing in an embedded systems journal isn't clear. They're all behind a paywall, so it's hard to tell if they're any good.

"Oblivious processing" is a security concept. That belongs in a journal on security and encryption, where the crypto people will know what holes to look for. (Microsoft was doing work in this area in 2013, but I don't think a product emerged. If you can make it work, some cloud computing company can use it.)

Cache management belongs in a journal on CPU design, where people who have struggled to make caches work will take a look. There are people using perceptrons for this, which makes sense; a cache has to guess which things will be reused. (If this works well, someone should be trying it in web caches such as NGINX to improve cache hit rates.)

Signal separation is an active field, but this isn't a journal where you'd expect to find articles on it. Wikipedia has a good article on signal separation. The history of that article indicates attempts to sneak in citations to sketchy articles. No idea if the Hammersley-Clifford algorithm is even relevant. (If it's a significant advance, there's commercial value in this in improving audio quality for conferencing systems.)

So these papers were all sent to a journal where the odds of getting published are good, and the odds that the editors have no idea about the subject matter is high.

Why is Elsevier even publishing this journal?

[1] https://www.sciencedirect.com/journal/microprocessors-and-mi...

[2] https://www.sciencedirect.com/journal/microprocessors-and-mi...

[3] https://en.wikipedia.org/w/index.php?title=Signal_separation...

[+] jszymborski|4 years ago|reply

As someone who has had to write technically in a second-language (French, funding agencies in Quebec), this rings particularly true.

Luckily, I'm fluent enough to recognise the particularly egregious examples, but finding good translations for technical words is hard!

One example that comes to mind is when trying to translate the phrase "data feed" which came back as "alimentation données" which ostensibly means "animal feed data".

If you're looking for a lot of English-to-French translations of technical terms, check out the theses any English University in Quebec (McGill, Concordia, etc..). They're made public online [0]. Can't vouch for the quality as I'm sure there are plenty that just use Google Translate, but everyone I know has their abstract edited by a francophone in their field.

A good way to validate translated technical terms is to just give them a quick internet search on e.g. DuckDuckGo or Semanticscholar.

[0] McGill's is https://escholarship.mcgill.ca/

[+] doubtfuluser|4 years ago|reply

Maybe a future direction would be to train new models to identify plagiarism by training on this information. Use „non matching backtranslations for training classifiers. It’s again the typical cat and mouse game I guess

[+] tarboreus|4 years ago|reply

Or someone could...read the papers.

256 comments