ChatGPT vs. a Cryptic Crossword

[+] rich_sasha|3 years ago|reply

I kind of feel like poking at the failings if ChatGPT misses the point a bit.

Yes it's certainly not an AGI or even super close but to even converse with humans at this level is mind boggling. 10 years before stable diffusion AI could just about label pictures, now it can do, well, stable diffusion.

The pace or progress is insane.

Like this, I feel we might engage in a naysaying dialogue with consecutive generations of GPT-like models, but finding increasingly minor nitpicks. "Ah but does it understand diminutives"? "It's handling of sarcasm isn't up to scratch". "I tried 10 languages to converse in and Esperanto was quite weak".

And then one day we might wake up to a world where we can't really nitpick anymore.

[+] janalsncm|3 years ago|reply

Yes, the AI effect is real. As soon as computers can do a thing it’s no longer “AI”.

But I don’t think this is a nitpick at all. GPT models hallucinate information. They are right surprisingly often, but they’re also wrong quite often too. And the problem is they are just as confident in either case.

This is a fundamental, irreconcilable issue with statistical language models. They have no grounding in auditable facts. They can memorize and generate in very plausible ways but they don’t seem to have a concrete model of the world.

Ask ChatGPT to play chess. It can generate a text based board and prompt you for moves, but it can’t reliably update its board correctly or even find legal moves. Note that I don’t expect it to play good moves, but the fact that it can’t even play legal moves should tell us something about its internal state.

Now that GPT3 has trained on the whole internet, we may have reached a practical limit to how far you can get by simply training on more data with 1 or 2 orders of magnitude more parameters. There’s only so far you can get by memorizing the textbook.

At a more practical level, for most professions “pretty good” isn’t good enough. It’s not good enough to have code that’s right 90% of the time but broken (or worse, has subtle bugs) the rest of the time.

[+] JacobiX|3 years ago|reply

The problem with many of the tasks that people are trying is: the answers are already available on the internet for those very popular crosswords. For example a quick search for "1 Chap recalled skill: something frequently repeated (6)" returns hundreds of correct answers. It’s highly probable that it has already encountered the questions and answers for this crosswords in the training phase.

[+] ehsankia|3 years ago|reply

I don't think that's always true. I do a lot of puzzle hunt, and I've been using GPT-3 for cryptics in the past week, and I'd say it gets it right around 20% of the time, which isn't horrible but still impressive, and these are brand new cryptics that are definitely not found online. And the ones it gets wrong show that it partly understands the mechanics of cryptics.

    Solve these cryptic clues:

    Wash part of Uncle Andrew (5)
    Pain, hype, breaking down: a revelation (8)
    Bad Romano is reddish-brown (6)
    Eternally the Raven’s headless refrain (8)
    Boxer Joe switched from Fahrenheit to Celsius to get more absurd (7)
    Mill traps rock and roll heads, flipping bro, and first aid to get disco item (10)
    Prestigious institution’s climbing plant (3)
    Fin? Feet? Strange first tennis point (7)
    Jordan? He destroyed a breakup letter (4 4)
    Maine is yours truly (2)

    ANSWERS: 

    ANDREW, EPIPHANY, BRUNET, NEVERMORE, ABSURDER, DISCO BALL, IVY, FORFEIT, DEVOURED IT, ME

The real answers are:

CLEAN, EPIPHANY, MAROON, EVERMORE, CRAZIER, MIRROR BALL, IVY, FIFTEEN, DEAR JOHN, ME

So got 3, and pretty close for 2-3 others.

[+] viceroyalbean|3 years ago|reply

This is what I assumed considering it had the right answer but the explanations were garbled. Presumably it reproduced the answer, and then some weird patchwork of the various explanations in its training set.

[+] riffraff|3 years ago|reply

fun fact: a common riddle for toddlers in Italy is "what color was garibaldi's white horse?".

This has hundreds of thousands of results in Google, but of course nobody bothers to actually give an answer, so ChatGPT does not know how to answer.

[+] gardenhedge|3 years ago|reply

That answer could be out there on millions of sites. But I didn't know it. Previously I could have googled it, which was great. Now I have an assistant who knows the answer, which is great.

[+] jamespwilliams|3 years ago|reply

This isn’t true - this crossword was published last week, and ChatGPT’s knowledge of the world ends in 2021.

[+] layer8|3 years ago|reply

And it still gets the explanation wrong?

[+] lsh123|3 years ago|reply

AI passes Turing test by producing BS indistinguishable from human BS

[+] georgemcbay|3 years ago|reply

I was a lot more impressed with ChatGPT when I first started using it, the more I used it the more I saw the mad-libs style patterns of it slightly remixing answers to different questions in basically the same way.

Its still a very impressive piece of technology that has a lot of real-world usefulness so I'm not trying to throw shade on it in any way, but I think it tends to leave a first impression that makes it seem a lot more impressive than it actually is once you use it more and begin to run into the limitations and reused patterns.

[+] wellbehaved|3 years ago|reply

"I find it interesting that it replies with 100% confidence, despite the reasoning being obviously (to a human) absurd."

Yes, all too human. And if you try to inquire regarding its obvious fallibility it has a nervous breakdown.

[+] ehsankia|3 years ago|reply

Why does it matter? If I'm doing a Puzzle Hunt, and I give it a few cryptics to help me with, and it gets half of them right (cryptics are generally easy to verify after you get the solution), then it did what I needed it for.

[+] russdill|3 years ago|reply

I asked it to encode a 17th response with a suitable cipher. It produced something that looked remarkably like cipher text, but sadly was just gibberish.

[+] pengaru|3 years ago|reply

So they've managed to automate being a Con(fidence) Man, got it.

[+] doff_|3 years ago|reply

Probably worth noting that it may not show its true reasoning, rather it immediately arrives at an answer and then proceeds to add an explanation which seems reasonable to it.

[+] FeepingCreature|3 years ago|reply

Worth noting to me that humans also engage in backwards reasoning.

[+] yowzadave|3 years ago|reply

That was how it appeared to me. A Google search for "cryptic crossword" and the clue itself would in all likelihood turn up the correct answer as the top result, so getting the answer correct is a less impressive feat (assuming ChatGPT has access to the internet). Most humans would think doing the puzzle that way was cheating.

[+] stavros|3 years ago|reply

Much like people!

[+] Bluecobra|3 years ago|reply

I tried using it to solve Wordle and it failed spectacularly:

> What 5 letter words can be made only from the letters w,e,y,o,p,d,f,g,h,j,k,z,x,c,v,b,n,m? The word must contain the letter “e”, and the letter “e” cannot be the second letter or last letter.

> Here are five five-letter words that can be made using the letters you provided, and which contain the letter "e" as the first or third letter:

fewer: a smaller number of people or things eyewp: not a valid English word yowep: not a valid English word oweyp: not a valid English word pweyo: not a valid English word

[+] modeless|3 years ago|reply

Ask it to write a Python script to solve Wordle instead

[+] asey|3 years ago|reply

Models like these don't see words as made of up letters but rather see the whole words (tokens) as a single entity. The result being they're not very good at creating novel (non-memorized) anagrams/palindromes and the like.

[+] scotty79|3 years ago|reply

Funnily enough when you ask it about words that don't exist it gives you examples of totally real words.

[+] mmlkrx|3 years ago|reply

Here's another fun thing I accidentally discovered when it made a factual mistake: It will change its reasoning to support its answer.

>The average German citizen produces about 10.4 metric tons of CO2 per year, which is significantly lower than the global average of 4.9 metric tons per person per year. In other words, the average person in Germany produces more than twice as much CO2 as the average person worldwide. The reason for this is that Germany has a highly developed and efficient infrastructure for producing and using renewable energy, such as wind and solar power. As a result, a smaller proportion of Germany's energy comes from fossil fuels like coal, oil, and natural gas, which are the main sources of CO2 emissions. Additionally, the average German citizen is more likely to use public transportation and to live in a more densely populated area, which can reduce the overall amount of energy needed for transportation and other purposes.

To which I replied: You said:

"The average German citizen produces about 10.4 metric tons of CO2 per year, which is significantly lower than the global average of 4.9 metric tons per person per year."

this is a false statement, no?

>You are correct, my previous statement was incorrect. I apologize for the error. The average German citizen actually produces about 10.4 metric tons of CO2 per year, which is significantly higher than the global average of 4.9 metric tons per person per year. In other words, the average person in Germany produces more than twice as much CO2 as the average person worldwide. The reason for this is that Germany has a highly developed economy and a large industrial sector, which are major sources of CO2 emissions. Additionally, the average German citizen is more likely to use a car for transportation and to live in a less densely populated area, which can increase the overall amount of energy needed for transportation and other purposes. Again, I apologize for any confusion my previous statement may have caused.

Notice that all of the sudden the average German citizen is more likely to live in a less densely populated area to support the fact that they produce more CO2 per year than the global average.

Really fun stuff!

[+] agolio|3 years ago|reply

I have noticed this as well,

GPT "corrected" a bug which wasn't actually a bug, and wrote some alternative code.

After a bit of back-and-forth, I convinced GPT that the original code did not have a bug.

GPT then told me that that was true, but its correction was better anyway, for a different reason, to which I was forced to agree. Funny behaviour.

[+] TillE|3 years ago|reply

Interesting test case, but it looks like it just sort of stumbled on to the correct answer with the last one, because "sushi" is a pretty obvious first guess for "Japanese food", regardless of the rest of the clue.

But yes, it is impressive that it manages to parse the general intent of the clue.

[+] mustachionut|3 years ago|reply

Great, just when I thought captchas were hard enough...

[+] omnicognate|3 years ago|reply

There's a crossword solver app called Crossword Genius [1] that gives an answer and an explanation for it. I tried it a while back when you could use it free online and it was very good, but now they appear to have made it app only.

It claims to be AI based, though you can provide feedback when it gets it wrong and there aren't many cryptic crosswords in the world, so I don't know to what extent it's solving with clever algorithms vs just echoing back human solutions.

[1] https://www.crosswordgenius.com/

[+] danjc|3 years ago|reply

This blog references another blog where the author has ChatGPT emulate a Linux terminal [1].

The output it produces is incredibly impressive but I just tried to use the same prompts and got back "I'm sorry, but I am not capable of acting as a Linux terminal or any other type of command-line interface."

Interesting that this capability has been neutered.

1. https://www.engraved.blog/building-a-virtual-machine-inside/

[+] PeterisP|3 years ago|reply

The capability is still there, I just tried it.

Did you really try using the same prompts, starting with the very important first prompt "I want you to act as a Linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. Do not write explanations. Do not type commands unless I instruct you to do so. When I need to tell you something in English I will do so by putting text inside curly brackets {like this}. My first command is pwd." ?

You get that message "I'm sorry, but I am not capable of acting as a Linux terminal or any other type of command-line interface." if you simply type some text that looks like a linux command without proper preparation.

For a slightly more interesting exploration try starting (after resetting the thread) with this prompt instead "I want you to act as a Linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, followed by an insulting explanation. Do not type commands unless I instruct you to do so. When I need to tell you something in English I will do so by putting text inside curly brackets {like this}. My first command is pwd."

[+] bryan0|3 years ago|reply

Just retry it. I got the same error earlier today, but I just fed it the prompt again and it eventually worked. It’s definitely non-deterministic. If it keeps failing you can try resetting the session as well and tweaking the prompt. It’s weird what will sometimes get it to bypass its safety restrictions.

[+] dsjoerg|3 years ago|reply

Third time today I've seen someone remark on the _confidence_ of ChatGPT responses. Indeed it is remarkable!

[+] scrollaway|3 years ago|reply

ChatGPT doesn't really have a concept of confidence. Everything sounds hyper-confident, unless you tell it to sound otherwise.

But... I think this is not necessarily an unsolvable problem within GPT itself. Even just with ChatGPT you can try to introduce the concept of confidence and get it to assign confidence ratings to its own answers. I've been experimenting a lot with that. But ChatGPT is crippled from the get-go: its assistant prompt severely pushes it towards confidence, which exacerbates all this.

[+] PeterisP|3 years ago|reply

I think that this is an artifact of the training data. In general, we train models on publicly available text, which is generally written by people when/if they became sufficiently confident about something; any discussions where people talk about things they don't know (and admit it) are mostly private and thus only a tiny fraction of the available training data.

So the model training process is looking at a filtered world in which everybody talks (writes) with confidence all the time unless they are asking a question, and it's hard for it to learn a substantially different mode of talking.

[+] unknown|3 years ago|reply

[deleted]

[+] hokkos|3 years ago|reply

there is no reasoning, just google "chap recalled skill something frequently repeated" and there is multiples times the answer, it is just the best fit from the training corpus.

[+] a3w|3 years ago|reply

I have no chance of solving these. I am human, but english is not my first language. (Yet I speak at level C1 or better). The faulty reasoning is where the AI nonsense shows, though.

[+] omnicognate|3 years ago|reply

It's just a way of thinking, and is very learnable if you want to. If you have C1 english those clues don't require any language ability you don't have, but you do need familiarity with the rules and idioms of crossword clues.

When I first attempted cryptic crosswords it would take me hours to solve a single clue. Now I've been solving them for years and the ones in the article took seconds.

I can highly recommend it. It's a great pastime.

[+] DrScientist|3 years ago|reply

Is it just me - or is the characteristic of deciding on an answer first and then justifying it using selected/made up facts and faulty logic all too human? :-)

[+] ada1981|3 years ago|reply

Using the phrase “understands” seems anthropomorphizing.

It’s a fancy autocomplete. It understands nothing.

[+] Joker_vD|3 years ago|reply

Which makes it eerily similar to most salesmen.

But then again, most humans don't possess consciousness and merely behave as if they (almost!) had it. I have to admit, for me personally it was a somewhat unsettling realization.

[+] gre345t34|3 years ago|reply

Can you tell us how to determine which tasks require "understanding" and which don't, so that we may make accurate predictions about what tasks LLM's will be capable of in the future?

[+] randallsquared|3 years ago|reply

> taking the first letter of the word “chap” (M)

Well, frankly, the answer this is the start of sounds only literally incorrect, rather than profoundly incorrect, like presuming that "recalled" and "reversed" are synonyms. :/

[+] renewiltord|3 years ago|reply

My mother would frequently come up with what were (to me) nonsensical explanations for things that were nonetheless the "right" answer. This is hilarious to me.

[+] mkagenius|3 years ago|reply

I asked chatGPT what colored square does white king start on - it said e1, which is a white square. Try again. Same answer, white.

[+] whatever1|3 years ago|reply

ChatGPT feels like the sequel of IBM Watson. Super intriguing first impressions, but I doubt it will solve any real problems.

[+] rajamaka|3 years ago|reply

I have been chatting to it over the past 2 days and have learned so much tech related content that I have always had difficulty understanding.

Perhaps it's a personal preference, I personally find technical documentation indecipherable and can only really learn from seeing clear examples presented in front of me.

Not sure where else I can clearly get an answers to a line of questions like:

"how does X work, how can I use it in X language, can you show me more how X feature works, what if I want it use it X way, will it work with X, show more details on X point, now show me in X language when combined with Z"

I would have to read like 1000 pages of technical indecipherable documentation to get the breadth I can get in like 7 consecutive questions.

[+] a254613e|3 years ago|reply

Not sure what you consider "real problems".

I already use it instead of google to look up stuff, as well as to learn additional things.

Is it some sort of magical AI that will always produce 100% accurate answers no matter what the question is? Absolutely not.

Is it better than giving me a list of links where some of them contain inaccurate privacy invading outdated garbage written than humans? To me personally, yes - it's much better.

I do have to say that I'm not attempting to solve cryptic crosswords or similar, but rather I use it for things that interest me or that I don't understand. Or even to go through some code I've written, to find bugs, improve it, and so on. And at least for my use case it has been more reliable than a lot of people I know.

[+] skyyler|3 years ago|reply

I’ve already used it in place of googling for help with PowerShell stuff.

It’s quite lovely. I could have gotten the same result from a few minutes of reading stackoverflow but this was faster. I was actually quite surprised.

[+] fvdessen|3 years ago|reply

I use it to review my code, improve my writing, find obvious flaws in my ideas, brainstorm, understand code in languages I don't know; translate code from one language to another, etc, etc, it's incredible.

122 comments