A Better Strategy for Hangman

[+] lorax|14 years ago|reply

An even better strategy would be to weight your guesses towards the most information gain. If a letter appears often, but is in the same position most of the time, getting it won't help you as much as getting a letter that appears in fewer words but in more varied positions. Doing a quick check of 5 letter words in my /usr/share/dict/words file I find: S: 659,62,267,260,2282 E: 122,829,448,1254,679 A: 252,1201,678,410,307

That is, S, when it appears, is heavily skewed towards the last letter in the word, finding it won't help you figure out the word nearly as well as finding the more evenly distributed E or A.

[+] Natsu|14 years ago|reply

Doesn't this all assume that the opponent is choosing a random word? If they're trying to be evil, I'm sure they can pick words where each letter you get gives little information.

For example, which letter do you guess if the puzzle is _og? You still have to go through bcdfhjln even after you've burned however many guesses getting that o & g.

[+] dlss|14 years ago|reply

This is a good insight. To formalize it: each question segments the words into 2^max_word_length groups of various sizes, and we seek to minimize the worst case number of questions still needed, regardless of which answer we get back.

Which, for the ispell words list, appears to be 'e' :( edit: which was the solution found for the suboptimal method in the article, and which is not present for words in the corpus 25% of the time

code at http://pastie.org/3693750

[+] Too|14 years ago|reply

That would be trading risk for gain.

Remember, if you pick a correct letter you will not loose any guesses, in other words you get another guess if your guess is right. It's not about finding the word in a fixed number of guesses.

But otherwise you could also extend your method to increase the weight of letters appearing more than once in a word since that gives even more information gain.

[+] tsewlliw|14 years ago|reply

I won't call out the company because at least until now they were still using the question (I got accidentally cc'ed on a subsequent candidates email), but last summer I got an interview homework question from a "who is hiring" post that was to solve hangman for a given dictionary. I implemented all of the strategies outlined here, and while they seemed impressed with the execution speed of my solution, they implied it could have scored better, saying it was only "average" in that regard.

The thing is, at least for the dictionaries I had available, trying to weight for information gain only considering the guess correctness wasn't all that great a strategy. Even if you know 100% certain that the letter will be correct, its very likely to still cut the search space down by virtue of the pattern of letters, and in aggregate my strategy made no statistical difference. (+/- 1% depending on random seed).

If I were going to try and improve my scores, I guess the next thing would be look at the distribution of letter-patterns for each remaining letter, but my intuition is that it just wouldn't be a big deal.

[+] r00k|14 years ago|reply

I loved this blog post (and the other one I read, about Battelship), so I wondered who was behind these great posts.

Their homepage says this "DataGenetics is a technology consultancy specializing in unlocking the value stored in large databases. Using a variety of techniques we can mine the trends in your data to help you maximize your marketing and advertising campaigns."

Damn.

Yet another group of smart folks focused on more-effective advertising.

[+] apawloski|14 years ago|reply

That's where the money seems to be. It's a shame, but I can't blame groups like these.

[+] jtheory|14 years ago|reply

I kept waiting for the other foot to drop, and it never did.

This is great if you're playing hangman against a computer. Probably, you're not.

You're playing against a person, who probably has some sense of your strategies.

Even when I was a kid, we didn't play hangman by choosing random words. What's the fun in that? You notice how the other person plays, and in the next rounds you pick words that break their strategies.

You notice they are doing the "common letters" or even "common vowels" strategy, and you give them "lynx".

Or you trick them with a word that has a few easy-to-get letters, but is going to be hard to guess the last few because there are so many possible matching words -- I had a great one that I forget now... maybe "budder"?

Did everyone really play with randomly-chosen words? What a waste.

[+] harryh|14 years ago|reply

Clearly we should have a hangman challenge similar to the RoShamBo challenges people used to run to explore possibly strategies in a multi turn game.

http://webdocs.cs.ualberta.ca/~darse/rsb-results1.html

[+] hluska|14 years ago|reply

Great comment - in hangman, choosing letters solely on the basis of data (no matter how complete the data or how compelling the study) is a highly predictable (and thus highly defeatable) strategy.

Of course, this analysis has a counterpoint. If I try to pick words strategically, I suspect that my words would begin to follow patterns...

[+] eru|14 years ago|reply

You can repeat a similar analysis like they did against an opponent who chooses carefully and knows your strategy. With the theory of two person zero-sum games you can even prove that there's a perfect strategy both for the guesser and the chooser each that doesn't get worse if the other party knows it. (Hint: the strategies will involve random selection with carefully weighted probabilities.)

[+] thyrsus|14 years ago|reply

My favorite strategy: four letter word. Say no to everything until they guess U: _ U _ _ then decide on the least likely word which includes none of their guesses, often "junk".

[+] hythloday|14 years ago|reply

It's an interesting article, but I would have thought that the best* letter to guess is the one that appears in as close to 50% of the possible words as possible, rather than the most likely to appear, given that the hangman is drawn per error, rather than per guess. Is my reasoning wrong?

* assuming as he does at the bottom the existence of a computer.

[+] sdevlin|14 years ago|reply

Your intuition is telling you that you should shoot for 50% so that you can divide the search space in half regardless of the outcome. A right or wrong answer will eliminate half of the possible remaining words.

But right and wrong guesses are not equally valuable in hangman. A wrong guess will eliminate all words of a given length containing the given letter. But a correct guess gives you more: you get the location (or locations) of the letter in the word as a bonus. This will reduce your remaining search space drastically.

Thus, you should optimize for the success of your next guess.

[+] Jabbles|14 years ago|reply

It depends whether you care about minimising the number of errors, or just wish to keep it below 11 - thereby minimising your chance of losing. These are subtly different and depend on your "reward function".

[+] klipt|14 years ago|reply

If you guess a letter contained in 99% of remaining possible words, then either

- you are correct (happens 99% of the time). No penalty, and you find out at which positions your guessed letter occurs, a decent information gain.

- you are incorrect. You suffer a penalty but eliminate 99% of the remaining words, a massive information gain.

How would 50% be preferable in either case?

[+] squeakynick|14 years ago|reply

No, we should still go for the letter with the highest frequency.

If we get a hit, it's not just the presence, or not, of the letter, we learn both the number of hits, and the positions of these hits. These are incredibly valuable pieces of information (more valuable that dividing the remaining words into two equal piles).

If we miss, we've carved away and removed the biggest non-possible set of words with one guess.

[+] vasi|14 years ago|reply

This is great, but it's just a first-order strategy! Once your opponent knows you're following these (quite rational) rules, she could pick words where the rules fail particularly badly.

I look forward to seeing your analysis of how to deal with that :)

[+] jerf|14 years ago|reply

The strategy given isn't to solve the puzzle, but to get at least one letter on the board. So if you look at the table at the end, the worst case is a three-letter word that only contains a K out of the letters given, which is the worst case scenario on that entire board. Assuming your opponent sticks to the dictionary there is no way for them to ruin this strategy, only push you into the worst cases given, which once you figure out what's going on will play to the guesser's advantage, not the word-chooser.

[+] user24|14 years ago|reply

I was hoping for a little list at the end saying "By the way, if you're playing against someone who's using this strategy, the best word to play is: xyzzy" (or whatever word it actually would be)

[+] jh3|14 years ago|reply

> There are no two letter words containing the letter C, Q, V or Z.

Does 'qi' count? http://www.merriam-webster.com/dictionary/qi

I know it counts when playing Words with Friends...

I also think 'za' is valid when playing Scrabble type games, but 'za' technically isn't a word.

[+] streptomycin|14 years ago|reply

qi and za are recent additions to the Scrabble dictionary, but in any reasonable friendly word game they wouldn't count because they're not in common usage. IMHO they also significantly changed the gameplay of Scrabble for the worse by making it much easier to get rid of the z and q tiles.

[+] davnola|14 years ago|reply

IIRC "zo" is also valid. It's a Tibetan yak hybrid.

Hmmm... Yak butter tea.

[+] kd5bjo|14 years ago|reply

This came up as an pre-interview employment challenge that I did a few months ago. I've uploaded my code here: https://gist.github.com/2242816.

I could make good arguments for both "most likely to be present" and "most information gain", so I implemented a mixed strategy. I calculated the information gain (in bits) and the probability of a correct guess, then used a linear combination of them to determine which letter to guess. Additionally, the weights changed over the course of the game so that correct guesses were valued more as incorrect guesses became more scarce.

To determine appropriate weights, I ran a crude Monte Carlo simulation. The final weights I ended up with were 0.60:0.03 with no incorrect guesses and 0.54:0.69 once there aren't any strikes left. (bits information gain : p(correct guess))

[+] squeakynick|14 years ago|reply

Good comments. Being paranoid, I double checked the database I used! Neither 'qi' nor 'za' are in the database I used.

I'm not arguing that they are, or are not, words. I'm just saying that they were not in the dictionaty file I used :)

[+] onemoreact|14 years ago|reply

Cool, but assumes all words have an even probability.

Vs a week player the ideal word list should only include 'common' words.

Vs an ideal player you need to assume they will pick from the list of words that you least likely to guess using optimum play.

Generically, optimum play ends up with a somewhat random list, aka 90% of the time pick E, 10% of the time pick I etc.

PS: Actually generating this list is a 'hard' problem and vary dictionary dependent, but you can probably get reasonably close using some sort of genetic algorithm and a enough simulation time.

[+] eru|14 years ago|reply

> PS: Actually generating this list is a 'hard' problem and very dictionary dependent, but you can probably get reasonably close using some sort of genetic algorithm and a enough simulation time.

Sounds like a challenge. I don't think, given a dictionary, finding the ideal strategy for both parties will be that challenging. It's a fairly straightforward two person zero-sum game. You can either model it assuming hidden information, or concurrent play. (Which is basically the same here.)

As an interesting variation, you might allow the chooser to cheat: I.e. don't make them write down the word in the first place, just require their play to be consistent.

[+] K2h|14 years ago|reply

Two possible improvements, that I think may be original (thought I probably missed it in the article and comments)

1) A hit or miss of a letter tells you a lot about what the new subsequent optimal guess is. You could make a flow chart (a really big one) that shows you the optimal letter to call out next based on what has hit and missed.

2) the position that a letter has hit tells you a lot about the target word. if you have a computer, the regex becomes trivial to identify the next optimal letter to guess. an 'optimal' table could be generated based off this pattern, and it would be a huge table.

I kept feeling like every step of the way it was an infomercial, 'but wait... there's more!' and I like that, it got me thinking.

I'd love to have my hangman bot go head to head with yours on random words. that would be a fun little project.

[+] eshvk|14 years ago|reply

Very cool. I was implementing Hangman a few weeks back and I figured out the first strategy, use the frequency distribution for the specific length to initiate your guessing. However, I thought about but discarded the second strategy of conditioning on the previous wrong/right guesses and selecting words and then doing a histogram of the letters. Thinking back, I find it interesting that thinking about it from a complexity point of view made me dread that approach because it becomes significantly more expensive to do the recomputation of frequency distributions compared to blindly uses the naive independence assumption and I was not sure how much of a win I would get by adding that bit of optimization.

[+] hammock|14 years ago|reply

Instead of a table at the end, what if you were to boil it down to a single list of letters that I can easily memorize? You could do this by weighting the final table with the number of 1-, 2-, 3-, etc -letter words in the dictionary.

Put another way, what are the letters I should guess when I don't know the length of the word?

While it wouldn't be perfect strategy, it would be easier to execute.

[+] brownbat|14 years ago|reply

There are lots of good further caveats in this thread.

We should have a hangman AI tournament.

[+] grandalf|14 years ago|reply

Does this mean that RSTLNE on wheel of fortune is non-optimal too?

[+] lorax|14 years ago|reply

If you aren't taking into account the length of the word, then it looks like the optimal choice of 5 consonants and 1 vowel. Look at his section titled "First refinement" the most common letters were (in order) "ESIARNTOL" Eliminate all but the first vowel and you get ESRNTL, the same as WoF.

[+] dwd|14 years ago|reply

Given your perception can generally fill in the vowels - picking any early on would seem non-optimal.

62 comments