An even better strategy would be to weight your guesses towards the most information gain. If a letter appears often, but is in the same position most of the time, getting it won't help you as much as getting a letter that appears in fewer words but in more varied positions. Doing a quick check of 5 letter words in my /usr/share/dict/words file I find:
S: 659,62,267,260,2282
E: 122,829,448,1254,679
A: 252,1201,678,410,307
That is, S, when it appears, is heavily skewed towards the last letter in the word, finding it won't help you figure out the word nearly as well as finding the more evenly distributed E or A.
Doesn't this all assume that the opponent is choosing a random word? If they're trying to be evil, I'm sure they can pick words where each letter you get gives little information.
For example, which letter do you guess if the puzzle is _og? You still have to go through bcdfhjln even after you've burned however many guesses getting that o & g.
This is a good insight. To formalize it: each question segments the words into 2^max_word_length groups of various sizes, and we seek to minimize the worst case number of questions still needed, regardless of which answer we get back.
Which, for the ispell words list, appears to be 'e' :( edit: which was the solution found for the suboptimal method in the article, and which is not present for words in the corpus 25% of the time
Remember, if you pick a correct letter you will not loose any guesses, in other words you get another guess if your guess is right. It's not about finding the word in a fixed number of guesses.
But otherwise you could also extend your method to increase the weight of letters appearing more than once in a word since that gives even more information gain.
I won't call out the company because at least until now they were still using the question (I got accidentally cc'ed on a subsequent candidates email), but last summer I got an interview homework question from a "who is hiring" post that was to solve hangman for a given dictionary. I implemented all of the strategies outlined here, and while they seemed impressed with the execution speed of my solution, they implied it could have scored better, saying it was only "average" in that regard.
The thing is, at least for the dictionaries I had available, trying to weight for information gain only considering the guess correctness wasn't all that great a strategy. Even if you know 100% certain that the letter will be correct, its very likely to still cut the search space down by virtue of the pattern of letters, and in aggregate my strategy made no statistical difference. (+/- 1% depending on random seed).
If I were going to try and improve my scores, I guess the next thing would be look at the distribution of letter-patterns for each remaining letter, but my intuition is that it just wouldn't be a big deal.
I loved this blog post (and the other one I read, about Battelship), so I wondered who was behind these great posts.
Their homepage says this "DataGenetics is a technology consultancy specializing in unlocking the value stored in large databases. Using a variety of techniques we can mine the trends in your data to help you maximize your marketing and advertising campaigns."
Damn.
Yet another group of smart folks focused on more-effective advertising.
I kept waiting for the other foot to drop, and it never did.
This is great if you're playing hangman against a computer. Probably, you're not.
You're playing against a person, who probably has some sense of your strategies.
Even when I was a kid, we didn't play hangman by choosing random words. What's the fun in that? You notice how the other person plays, and in the next rounds you pick words that break their strategies.
You notice they are doing the "common letters" or even "common vowels" strategy, and you give them "lynx".
Or you trick them with a word that has a few easy-to-get letters, but is going to be hard to guess the last few because there are so many possible matching words -- I had a great one that I forget now... maybe "budder"?
Did everyone really play with randomly-chosen words? What a waste.
Great comment - in hangman, choosing letters solely on the basis of data (no matter how complete the data or how compelling the study) is a highly predictable (and thus highly defeatable) strategy.
Of course, this analysis has a counterpoint. If I try to pick words strategically, I suspect that my words would begin to follow patterns...
You can repeat a similar analysis like they did against an opponent who chooses carefully and knows your strategy. With the theory of two person zero-sum games you can even prove that there's a perfect strategy both for the guesser and the chooser each that doesn't get worse if the other party knows it. (Hint: the strategies will involve random selection with carefully weighted probabilities.)
My favorite strategy: four letter word. Say no to everything until they guess U: _ U _ _ then decide on the least likely word which includes none of their guesses, often "junk".
It's an interesting article, but I would have thought that the best* letter to guess is the one that appears in as close to 50% of the possible words as possible, rather than the most likely to appear, given that the hangman is drawn per error, rather than per guess. Is my reasoning wrong?
* assuming as he does at the bottom the existence of a computer.
Your intuition is telling you that you should shoot for 50% so that you can divide the search space in half regardless of the outcome. A right or wrong answer will eliminate half of the possible remaining words.
But right and wrong guesses are not equally valuable in hangman. A wrong guess will eliminate all words of a given length containing the given letter. But a correct guess gives you more: you get the location (or locations) of the letter in the word as a bonus. This will reduce your remaining search space drastically.
Thus, you should optimize for the success of your next guess.
It depends whether you care about minimising the number of errors, or just wish to keep it below 11 - thereby minimising your chance of losing. These are subtly different and depend on your "reward function".
No, we should still go for the letter with the highest frequency.
If we get a hit, it's not just the presence, or not, of the letter, we learn both the number of hits, and the positions of these hits. These are incredibly valuable pieces of information (more valuable that dividing the remaining words into two equal piles).
If we miss, we've carved away and removed the biggest non-possible set of words with one guess.
This is great, but it's just a first-order strategy! Once your opponent knows you're following these (quite rational) rules, she could pick words where the rules fail particularly badly.
I look forward to seeing your analysis of how to deal with that :)
The strategy given isn't to solve the puzzle, but to get at least one letter on the board. So if you look at the table at the end, the worst case is a three-letter word that only contains a K out of the letters given, which is the worst case scenario on that entire board. Assuming your opponent sticks to the dictionary there is no way for them to ruin this strategy, only push you into the worst cases given, which once you figure out what's going on will play to the guesser's advantage, not the word-chooser.
I was hoping for a little list at the end saying "By the way, if you're playing against someone who's using this strategy, the best word to play is: xyzzy" (or whatever word it actually would be)
qi and za are recent additions to the Scrabble dictionary, but in any reasonable friendly word game they wouldn't count because they're not in common usage. IMHO they also significantly changed the gameplay of Scrabble for the worse by making it much easier to get rid of the z and q tiles.
This came up as an pre-interview employment challenge that I did a few months ago. I've uploaded my code here: https://gist.github.com/2242816.
I could make good arguments for both "most likely to be present" and "most information gain", so I implemented a mixed strategy. I calculated the information gain (in bits) and the probability of a correct guess, then used a linear combination of them to determine which letter to guess. Additionally, the weights changed over the course of the game so that correct guesses were valued more as incorrect guesses became more scarce.
To determine appropriate weights, I ran a crude Monte Carlo simulation. The final weights I ended up with were 0.60:0.03 with no incorrect guesses and 0.54:0.69 once there aren't any strikes left. (bits information gain : p(correct guess))
Cool, but assumes all words have an even probability.
Vs a week player the ideal word list should only include 'common' words.
Vs an ideal player you need to assume they will pick from the list of words that you least likely to guess using optimum play.
Generically, optimum play ends up with a somewhat random list, aka 90% of the time pick E, 10% of the time pick I etc.
PS: Actually generating this list is a 'hard' problem and vary dictionary dependent, but you can probably get reasonably close using some sort of genetic algorithm and a enough simulation time.
> PS: Actually generating this list is a 'hard' problem and very dictionary dependent, but you can probably get reasonably close using some sort of genetic algorithm and a enough simulation time.
Sounds like a challenge. I don't think, given a dictionary, finding the ideal strategy for both parties will be that challenging. It's a fairly straightforward two person zero-sum game. You can either model it assuming hidden information, or concurrent play. (Which is basically the same here.)
As an interesting variation, you might allow the chooser to cheat: I.e. don't make them write down the word in the first place, just require their play to be consistent.
Two possible improvements, that I think may be original (thought I probably missed it in the article and comments)
1) A hit or miss of a letter tells you a lot about what the new subsequent optimal guess is. You could make a flow chart (a really big one) that shows you the optimal letter to call out next based on what has hit and missed.
2) the position that a letter has hit tells you a lot about the target word. if you have a computer, the regex becomes trivial to identify the next optimal letter to guess. an 'optimal' table could be generated based off this pattern, and it would be a huge table.
I kept feeling like every step of the way it was an infomercial, 'but wait... there's more!' and I like that, it got me thinking.
I'd love to have my hangman bot go head to head with yours on random words. that would be a fun little project.
Very cool. I was implementing Hangman a few weeks back and I figured out the first strategy, use the frequency distribution for the specific length to initiate your guessing. However, I thought about but discarded the second strategy of conditioning on the previous wrong/right guesses and selecting words and then doing a histogram of the letters. Thinking back, I find it interesting that thinking about it from a complexity point of view made me dread that approach because it becomes significantly more expensive to do the recomputation of frequency distributions compared to blindly uses the naive independence assumption and I was not sure how much of a win I would get by adding that bit of optimization.
Instead of a table at the end, what if you were to boil it down to a single list of letters that I can easily memorize? You could do this by weighting the final table with the number of 1-, 2-, 3-, etc -letter words in the dictionary.
Put another way, what are the letters I should guess when I don't know the length of the word?
While it wouldn't be perfect strategy, it would be easier to execute.
If you aren't taking into account the length of the word, then it looks like the optimal choice of 5 consonants and 1 vowel. Look at his section titled "First refinement" the most common letters were (in order) "ESIARNTOL" Eliminate all but the first vowel and you get ESRNTL, the same as WoF.
[+] [-] lorax|14 years ago|reply
That is, S, when it appears, is heavily skewed towards the last letter in the word, finding it won't help you figure out the word nearly as well as finding the more evenly distributed E or A.
[+] [-] Natsu|14 years ago|reply
For example, which letter do you guess if the puzzle is _og? You still have to go through bcdfhjln even after you've burned however many guesses getting that o & g.
[+] [-] dlss|14 years ago|reply
Which, for the ispell words list, appears to be 'e' :( edit: which was the solution found for the suboptimal method in the article, and which is not present for words in the corpus 25% of the time
code at http://pastie.org/3693750
[+] [-] Too|14 years ago|reply
Remember, if you pick a correct letter you will not loose any guesses, in other words you get another guess if your guess is right. It's not about finding the word in a fixed number of guesses.
But otherwise you could also extend your method to increase the weight of letters appearing more than once in a word since that gives even more information gain.
[+] [-] tsewlliw|14 years ago|reply
The thing is, at least for the dictionaries I had available, trying to weight for information gain only considering the guess correctness wasn't all that great a strategy. Even if you know 100% certain that the letter will be correct, its very likely to still cut the search space down by virtue of the pattern of letters, and in aggregate my strategy made no statistical difference. (+/- 1% depending on random seed).
If I were going to try and improve my scores, I guess the next thing would be look at the distribution of letter-patterns for each remaining letter, but my intuition is that it just wouldn't be a big deal.
[+] [-] r00k|14 years ago|reply
Their homepage says this "DataGenetics is a technology consultancy specializing in unlocking the value stored in large databases. Using a variety of techniques we can mine the trends in your data to help you maximize your marketing and advertising campaigns."
Damn.
Yet another group of smart folks focused on more-effective advertising.
[+] [-] apawloski|14 years ago|reply
[+] [-] jtheory|14 years ago|reply
This is great if you're playing hangman against a computer. Probably, you're not.
You're playing against a person, who probably has some sense of your strategies.
Even when I was a kid, we didn't play hangman by choosing random words. What's the fun in that? You notice how the other person plays, and in the next rounds you pick words that break their strategies.
You notice they are doing the "common letters" or even "common vowels" strategy, and you give them "lynx".
Or you trick them with a word that has a few easy-to-get letters, but is going to be hard to guess the last few because there are so many possible matching words -- I had a great one that I forget now... maybe "budder"?
Did everyone really play with randomly-chosen words? What a waste.
[+] [-] harryh|14 years ago|reply
http://webdocs.cs.ualberta.ca/~darse/rsb-results1.html
[+] [-] hluska|14 years ago|reply
Of course, this analysis has a counterpoint. If I try to pick words strategically, I suspect that my words would begin to follow patterns...
[+] [-] eru|14 years ago|reply
[+] [-] thyrsus|14 years ago|reply
[+] [-] hythloday|14 years ago|reply
* assuming as he does at the bottom the existence of a computer.
[+] [-] sdevlin|14 years ago|reply
But right and wrong guesses are not equally valuable in hangman. A wrong guess will eliminate all words of a given length containing the given letter. But a correct guess gives you more: you get the location (or locations) of the letter in the word as a bonus. This will reduce your remaining search space drastically.
Thus, you should optimize for the success of your next guess.
[+] [-] Jabbles|14 years ago|reply
[+] [-] klipt|14 years ago|reply
- you are correct (happens 99% of the time). No penalty, and you find out at which positions your guessed letter occurs, a decent information gain.
- you are incorrect. You suffer a penalty but eliminate 99% of the remaining words, a massive information gain.
How would 50% be preferable in either case?
[+] [-] squeakynick|14 years ago|reply
If we get a hit, it's not just the presence, or not, of the letter, we learn both the number of hits, and the positions of these hits. These are incredibly valuable pieces of information (more valuable that dividing the remaining words into two equal piles).
If we miss, we've carved away and removed the biggest non-possible set of words with one guess.
[+] [-] vasi|14 years ago|reply
I look forward to seeing your analysis of how to deal with that :)
[+] [-] jerf|14 years ago|reply
[+] [-] user24|14 years ago|reply
[+] [-] jh3|14 years ago|reply
Does 'qi' count? http://www.merriam-webster.com/dictionary/qi
I know it counts when playing Words with Friends...
I also think 'za' is valid when playing Scrabble type games, but 'za' technically isn't a word.
[+] [-] streptomycin|14 years ago|reply
[+] [-] davnola|14 years ago|reply
Hmmm... Yak butter tea.
[+] [-] kd5bjo|14 years ago|reply
I could make good arguments for both "most likely to be present" and "most information gain", so I implemented a mixed strategy. I calculated the information gain (in bits) and the probability of a correct guess, then used a linear combination of them to determine which letter to guess. Additionally, the weights changed over the course of the game so that correct guesses were valued more as incorrect guesses became more scarce.
To determine appropriate weights, I ran a crude Monte Carlo simulation. The final weights I ended up with were 0.60:0.03 with no incorrect guesses and 0.54:0.69 once there aren't any strikes left. (bits information gain : p(correct guess))
[+] [-] squeakynick|14 years ago|reply
I'm not arguing that they are, or are not, words. I'm just saying that they were not in the dictionaty file I used :)
[+] [-] onemoreact|14 years ago|reply
Vs a week player the ideal word list should only include 'common' words.
Vs an ideal player you need to assume they will pick from the list of words that you least likely to guess using optimum play.
Generically, optimum play ends up with a somewhat random list, aka 90% of the time pick E, 10% of the time pick I etc.
PS: Actually generating this list is a 'hard' problem and vary dictionary dependent, but you can probably get reasonably close using some sort of genetic algorithm and a enough simulation time.
[+] [-] eru|14 years ago|reply
Sounds like a challenge. I don't think, given a dictionary, finding the ideal strategy for both parties will be that challenging. It's a fairly straightforward two person zero-sum game. You can either model it assuming hidden information, or concurrent play. (Which is basically the same here.)
As an interesting variation, you might allow the chooser to cheat: I.e. don't make them write down the word in the first place, just require their play to be consistent.
[+] [-] K2h|14 years ago|reply
1) A hit or miss of a letter tells you a lot about what the new subsequent optimal guess is. You could make a flow chart (a really big one) that shows you the optimal letter to call out next based on what has hit and missed.
2) the position that a letter has hit tells you a lot about the target word. if you have a computer, the regex becomes trivial to identify the next optimal letter to guess. an 'optimal' table could be generated based off this pattern, and it would be a huge table.
I kept feeling like every step of the way it was an infomercial, 'but wait... there's more!' and I like that, it got me thinking.
I'd love to have my hangman bot go head to head with yours on random words. that would be a fun little project.
[+] [-] eshvk|14 years ago|reply
[+] [-] hammock|14 years ago|reply
Put another way, what are the letters I should guess when I don't know the length of the word?
While it wouldn't be perfect strategy, it would be easier to execute.
[+] [-] brownbat|14 years ago|reply
We should have a hangman AI tournament.
[+] [-] grandalf|14 years ago|reply
[+] [-] lorax|14 years ago|reply
[+] [-] dwd|14 years ago|reply
[+] [-] rgejman|14 years ago|reply
[+] [-] rgejman|14 years ago|reply
[deleted]
[+] [-] nraynaud|14 years ago|reply