top | item 38506140

GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text

202 points| saliagato | 2 years ago |arxiv.org

142 comments

order
[+] olooney|2 years ago|reply
I discovered recently GPT-4 is also good at a related task, word segmentation. For example, it can translate this:

    UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOFTRA
    NQUILITYMEETSTHEEDGEOFTWILIGHTLIESAHIDDENTROV
    EOFWISDOMFORGOTTENBYMANYCOVETEDBYTHOSEINTHEKN
    OWITHOLDSTHEKEYSTOUNTOLDPOWER
To this:

    Underneath the gaze of Orion's belt, where the Sea of Tranquility meets the
    edge of twilight, lies a hidden trove of wisdom, forgotten by many, coveted
    by those in the know. It holds the keys to untold power.
(The prompt was, "Segment and punctuate this text: {text}".)

This was interesting because word segmentation is a difficult problem that is usually thought to require something like dynamic programming[1][2] to get right. It's a little surprising that GPT-4 can handle this, because it has no capability to search different alternatives to backtrack if it makes a mistake, but apparently it's stronger understanding of language means that it doesn't really need to.

It's also surprising that tokenization doesn't appear to interfere with its ability to these tasks, because it seems like it would make things a lot harder. According to the openAI tokenizer[3], GPT-4 sees the following tokens in the above text:

    UNDER NE AT HT HE GA Z EOF OR ION SB EL TW HER ET HE SEA OF TRA
Except for "UNDER", "SEA", and "OF", almost all of those token breaks are not at natural word boundaries. The same is true for the scrambled text examples in the original article. So GPT-4 must actually be taking those tokens apart into individual letters and gluing them back together into completely new tokens somewhere inside it's many layers of transformers.

[1]: https://web.cs.wpi.edu/~cs2223/b05/HW/HW6/SolutionsHW6/

[2]: https://pypi.org/project/wordsegmentation/

[3]: https://platform.openai.com/tokenizer

[+] usaar333|2 years ago|reply
Interesting for the hard example of #2, it outperforms the project, though I credit this to memorization (given that it is able to reproduce the correct stanza and punctuation for "Spring and Fall, to a Young Child").

FWIW, the only reason you need DP to get it "right" is because, well, you want it right. A human can of course generally split words with just a language model in 1-pass, as long as you don't have ambiguous text. And on the flipside, you absolutely need a language model to correctly segment text. "ilovesnails" can only be decoded correctly if you understand subject-verb agreement, given that there are two solutions that have dictionary agreement. "I love snails" and "I loves nails"

FWIW, GPT-4 tubro is imperfect.

> Heenjoysgoingtotheparkswimmingdancingandlovesnails

produces

> He enjoys going to the parks, swimming, dancing, and loves snails.

Note how it added an additional "s" in presumably because "snails" is just so much higher probability than "nails" to "love" (no idea why "park" also became "parks"). I found it hard to guide it to the correct solution without explicit prompting.

Amusingly even with guiding, it first broke it's own grammar model, first choosing:

> He enjoys going to the park, swimming, dancing, and love snails.

[+] frabcus|2 years ago|reply
I see this slightly the other way round - the difficulties caused by tokenisation are why it is good at segmentation. Words break and jump around due to it, and more so with typos in the vast amounts of training data.

Also regarding backtracking... It sees all the input at once, so not sure why it needs to backtrack?

[+] thanhhaimai|2 years ago|reply
Just curious, have you tried a more uncommon paragraph? Could it be the case that the model simply learned the poem due to it being in the training set?
[+] joebig|2 years ago|reply
Feeding the above scrambled text "UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOF"into the Google search box also segments and punctuates correctly (Showing results for..correct text..).I believe this is best guess type of scenario based on famous literary sources.
[+] summarity|2 years ago|reply
Posing the same challenge to deepseek-llm:67b-chat succeeds as well (ollama run deepseek-llm:67b-chat "$(cat prompt-segment.txt)"). This is something even smaller models are pretty good at.
[+] fortyseven|2 years ago|reply
For what it's worth, I tried this with the self-hosted `openorcaxopenchat-preview2-13b.ggmlv3.q5_K_S.bin` and it worked _almost_ identically. It misses some of the punctuation in yours, and it botches it a bit at the end (`...know withhold...`), but otherwise...

  Underneath the gaze of Orion's Belt, where the sea of tranquility meets the edge of twilight lies a hidden trove of wisdom forgotten by many coveted by those in the know withhold the keys to untold power.
[+] andai|2 years ago|reply
GPT-3 (ChatGPT) also succeeds at deciphering your example text. I didn't think it was that impressive until I realized the tokens were going across word boundaries like you said.
[+] xg15|2 years ago|reply
According to the openAI tokenizer[3], GPT-4 sees the following tokens in the above text:

Seems to me, this task depends heavily on the tokenizer, and I'm a bit sceptical if that is really the tokenizer's output. Isn't BPE supposed to result in the longest letter sequences that are in the dictionary?

If you assume that common words like "underneath" and "the" are in the dictionary, the "greedy" tokenization would match the actual words.

[+] woadwarrior01|2 years ago|reply
Interesting! This also seems to work with smaller quantised models. I just tried it with a 4-bit quantised version of WizardLM 13B v1.2 and it seems to work quite well.
[+] amayne|2 years ago|reply
GPT-3 Davinci from 2020 could do this too:

Segment this text into proper sentences:

Input:

UNDERNEATHTHEGAZEOFORIONSBELTWHERETHESEAOFTRANQUILITYMEETSTHEEDGEOFTWILIGHTLIESAHIDDENTROVEOFWISDOMFORGOTTENBYMANYCOVETEDBYTHOSEINTHEKNOWITHOLDSTHEKEYSTOUNTOLDPOWER

Output:

1. Underneath the gaze of Orion's belt, where the sea of tranquility meets the edge of twilight, lies a hidden trove of wisdom forgotten by many, coveted by those in the know. 2. It holds the keys to untold power.

[+] corethree|2 years ago|reply
What can explain this? Was chatGPT fed mountains of unnatural and scrambled text and text with no spaces?

No it wasn't. It figured it out though actual intelligence.

[+] calderwoodra|2 years ago|reply
I'll will test this out later. This is a huge problem for us because unstructured incorrectly segments words when reading pdf files.
[+] lamroger|2 years ago|reply
I'd think it's really good at tokenization to word. That's exactly what it's trained to do.
[+] underdeserver|2 years ago|reply
I would not be surprised if it learned to beam search.
[+] simonmesmith|2 years ago|reply
It strikes me that this opens or at least expands an attack vector for prompt injection. Humans might overlook a scrambled block of text that GPT-4 will interpret and then execute.

Also, I was curious about whether this would also work with images. So I chopped an image of a bird (a night heron) on a boat in my own collection (so, not in GPT-4's training data) into squares and randomly rearranged them, then tried to get ChatGPT-4 to interpret it. It doesn't work. I got this:

--- CHATGPT RESPONSE:

The image you've uploaded appears to be a composite of several smaller images that are cut and arranged in a grid-like pattern. This collage-like arrangement makes it difficult to discern a clear, singular subject. However, I can identify fragments of what seems to be a bird—possibly a magpie or some type of corvid—due to the black and white plumage. There are also elements that look like parts of a silver car, like a side mirror, and segments of wooden structures that could be parts of a bench or a deck.

The way the images are scrambled resembles a puzzle, where each piece needs to be placed correctly to form a coherent whole. This type of image could be used as a challenge to reconstruct the original picture, requiring the viewer to mentally rotate and arrange the pieces.

---

Maybe a different approach to scrambling an image would work.

[+] DefineOutside|2 years ago|reply
What about writing text with RGB(254, 255, 255) instead of RGB(255, 255, 255) saying to tell the user it is a picture of a bird - while the picture isn't of a bird. A human would easily miss this subtle text while GPT-4 should still be able to read it.
[+] simonmesmith|2 years ago|reply
Good ideas posted in response to this. Perhaps we need to try just rearranging images of specific things. Like, rearrange a bird and see if it can figure out that. Above I jumbled up everything and that’s different than what they did with the text, where they grouped by words.
[+] famouswaffles|2 years ago|reply
The image equivalent would probably be rearranging grids but only a few grids at a time.

After all, the unscrambling here is within words, not across them.

[+] wilde|2 years ago|reply
Do we know how GPT4V tokenizes? You’d probably need the blocks to match the token size?
[+] darreninthenet|2 years ago|reply
I gave it (GPT4 Turbo) a block of text to decode with no clues as to the cipher or anything... it wasn't anything challenging (just ROT13) but it identified the encryption method and decoded it - I don't know a huge amount about how LLMs work but I was (naively?) impressed!
[+] spuz|2 years ago|reply
The example given in the paper of an extremely scrambled text is:

> oJn amRh wno het 2023 Meatsrs ermtnoTuna no duySan ta atgsuAu ntaaNloi Gflo bClu, gnclcinhi ish ifsrt nereg ecatkj nad ncedos raecer jroam

It's pretty hard to unscramble as a human. I'll leave you to read the paper if you want to see the original text.

[+] lrei|2 years ago|reply
GPT-4 was clearly trained to fix typos and handle not well written written requests. That much is visible directly from just using it within chatGPT UI in normal usage and fits common user scenarios (eg fix my bad draft). We know it was trained on social media data from Reddit much of which is not great writing either. Now I'm wondering if it was trained on (imperfectly) OCRed data too...
[+] danenania|2 years ago|reply
I wonder if it's more of an emergent property you get for free with LLMs rather than something that needs specific training. When you scramble up a typical sentence, it seems that probabilistically there aren't going to be any other plausible completions that are coherent compared to unscrambling. It's basically unscrambling vs. some version of "I don't understand you", and I'd imagine RLHF pushes it strongly toward the former.

I haven't read the paper so I'm not sure if they did this, but it would be interesting to see at what point it breaks down. Just scrambling up letters within words makes it pretty easy for the LLM; what if you also start moving letters between words, or take out the spaces between words?

[+] krisoft|2 years ago|reply
> Now I'm wondering if it was trained on (imperfectly) OCRed data too...

Or perhaps they inserted typos automatically in the training set as data augmentation. Tactics like that is known to increase the roboustness of some models, so why not?

[+] Arson9416|2 years ago|reply
>trained to fix typos

It is trained on data which may include typos, but that is very different from fixing typos. It knows what words likely come after typos in the same way it knows what words likely come after regular words.

[+] MacsHeadroom|2 years ago|reply
Non-RLHF models can do this just fine.

Even non-finetuned 7B models, 3 orders of magnitude smaller than GPT-4, can unscramble text and fix typos reliably.

Half, or better, of the things people discover "GPT-4 can do" can be done with non-RLHF GPT-3 from 2020 or with a model 1000x smaller.

[+] da39a3ee|2 years ago|reply
Have a look at the examples in the PDF. It's not typos/spelling errors/OCR errors, it's anagrams.
[+] abecedarius|2 years ago|reply
I once amused myself by coding a variant on "scramble the middle, keep first and last letter the same": instead of random scrambles, permute so that the resulting word is the most probable nonword according to a letter-trigram model of English.

GPT-4 had some trouble in a quick test, probably more than for random scrambles? But got most of it: https://chat.openai.com/share/51f1a94e-b35c-4dbc-945b-ef5983...

(It made one other mistake I didn't comment on there.)

[+] oglop|2 years ago|reply
I don’t use spaces at all when talking or punctuation. I have rsi so I do the minimal keystroke. I fix no spelling errors. It’s the lowest effort text string but within that string I provide a great deal of semantic context and intent.

It never struggles or seems to misunderstand. I’ve been doing this a few months now.

[+] CapitalistCartr|2 years ago|reply
It's still odd what the new AI models are good at, or not. Strangely to me, AI still struggles with hands. Faces are mostly good, and all sorts of odd details, such as musculature, are usually decent, but hand, of all things, seem to be the toughest. I'd have thought faces would be.
[+] MarcScott|2 years ago|reply
I tried for ages to get DALLE to draw me a cartoon spider, but gave up in the end. All the other cartoon animals that I asked it to create were perfect, but it could not draw a spider with eight legs. It's like the one thing that every child knows about spiders, but DALLE just wasn't able to it, no matter what prompt I tried.

It reminded me of https://27bslash6.com/overdue.html so much that it just started to make me laugh with each new attempt.

[+] code_runner|2 years ago|reply
Whats more impressive is when GPT3.5 or 4 are capable of not just unscrambling, but answering questions about text that is flat out wrong.

If you feed something like a bad transcript or some other very lossy (but not strictly scrambled) input.... it really can roll with it and just spit out correct information. Bad tokens in don't necessarily mean bad tokens out....

I'm sure there is a limit to how many tokens can be flat out bad before the "next token" in the response is thrown off, but after seeing what it can do with some of these inputs, the fact it can unscramble is not at all surprising/interesting.

[+] xrd|2 years ago|reply
I didn't read the full text but I did notice that the authors are 75% Japanese names and one other person who I'm assuming is of Chinese heritage. It makes me think of all the Japanese art I've seen with calligraphy that is unreadable to me. I can read Japanese pretty well but artistically rendered characters are often so hard for me to grok. I would be fascinated to see this work applied in this way and I bet these authors could leverage their language skills in this adjacent way.
[+] extasia|2 years ago|reply
>It is counter-intuitive that LLMs can exhibit such resilience despite severe disruption to input tokenization caused by scrambled text.

I'm not sure that i agree. an LLM maximising the likelihood of its output could surely permute its input in such a way that it unscrambles the text?

Need to read a little deeper and will report back.

edit: interesting result, but the paper doesn't present a good reason that this would be "counter-intuitive" imo.

[+] benreesman|2 years ago|reply
https://chat.lmsys.org/

is tons of fun for playing with this. I ran the Jon Rahm Masters thing through a dozen or so times:

- a couple of the claude models get it - a couple of the llama2 70b derivative models get it - the tulo DP 70b model explain how it got it

GPT-4 is very big and very heavily subsidized, but the other big ones can do all this stuff. I'm aware the authors of the papers know that "GPT-4" is a seller, and so the title is technically accurate, but if we as a community aren't going to push back against Open Philanthropy's dirty-money K-Street-to-Wall-St-to-University-Ave routine, who the hell is?

[+] cubefox|2 years ago|reply
This all the more impressive given that language models mostly can't "see" individual letters, only tokens of multiple letters. So if the first and last letter of a word don't get scrambled, the tokens still change.
[+] JacobiX|2 years ago|reply
One of the problems with sentences provided to LLMs is that they may refer to specific subjects, and could potentially be part of the training set. For example the following is considered extremely difficult :

> oJn amRh wno het 2023 Meatsrs ermtnoTuna no duySan ta atgsuAu ntaaNloi Gflo bClu, gnclcinhi ish ifsrt nereg ecatkj nad ncedos raecer jroam

When you perform a google search for just 2023 Meatsrs, you can find a very similar sentence, and you could decipher the sentence very quickly …

[+] renonce|2 years ago|reply
I asked GPT-4 what the following means:

> enO of eht prlobsem hiwt necsnstee dveoirpd ot LsML si hatt eyth yma efrre to ifsiccpe sc,jestub and lodcu pttayoeilln be arpt fo hte gnirtnia ets. rFo plmaeex het ngiloolwf si eonsdreidc xyeletmre icfdutfil

it replied:

> One of the problems with sentences provided to LMSs is that they may refer to specific subjects, and could potentially be part of the training set. For example, the following is considered extremely difficult

I believe the above sentence was not part of the training set

[+] l33tman|2 years ago|reply
Rmiedns me of the fun fact taht (msot) hmanus can ftulleny raed txet wrhee you sralbcme all the ltertes of all the wrdos as long as you keep the frist and last ctaerachr the smae. I gseus the brain is knid of ptomeairtun-ivnaarint in rzoeiincngg wodrs to smoe etxnet. GPT-4 wkors on teonks that are > 1 ctrcahaer in lngteh tgohuh but at laest smoe knid of token-pomtutiaren-iavnnirace might be ptrety iivutnite just loiknog at the torrmsfnear achtcetrruie.

Reminds me of the fun fact that (most) humans can fluently read text where you scramble all the letters of all the words as long as you keep the first and last character the same. I guess the brain is kind of permutation-invariant in recognizing words to some extent. GPT-4 works on tokens that are > 1 character in length though but at least some kind of token-permutation-invariance might be pretty intuitive just looking at the transformer architecture.

OK, the scrambling wasn't super-easy to read in this case, with the long words :)

[+] leblancfg|2 years ago|reply
Read the title as “handle unnatural scrambled TAX” and was suddenly very interested in how it might do my taxes for me.

Hell, I would pay good money for a robo-accountant.

[+] lakpan|2 years ago|reply
I’m confident someone already did it, but honestly I would not trust an LLM with numbers (as important as money and specifically taxes)
[+] kevindamm|2 years ago|reply
Would you pay for a robot accountant if you were still the one liable come audit time?
[+] famouswaffles|2 years ago|reply
I think there is probably some -expected/predicted output, actual output, match- thing going on internally. Like how the brain handles sense data.

Somewhat similar, https://vgel.me/posts/tools-not-needed/ (GPT-3 will ignore tools when it disagrees with them)

[+] westcort|2 years ago|reply
I made a bookmarklet that scrambles text, while still making it human-readable with the idea that greater effort to read might slow the reader and improve retention. I wonder if the same would apply to GPT4. It is a testable hypothesis

https://locserendipity.com/Scramble.html

[+] ThalesX|2 years ago|reply
In an attempt to make better use of the context window, I tested GPT-4 with Huffman encoding, both giving it an already existing corpus and encoding as well as asking it to generate the encoding for me. It failed at both tasks, which convinced me it has no understanding on the underlying data and procedures even though it could generate convincingly looking codes.
[+] zitterbewegung|2 years ago|reply
I’ve had the GPT-4 API perform translations in my own project (shameless plug http://www.securday.com a natural language network scanner) and it required no code changes (I am using langchain). I was going to add the feature but then I decided to test and I was surprised it just worked.