Hi, I am the first author of this paper and I am happy to answer any questions. You can find a link to the technical paper here https://arxiv.org/abs/2205.09665.
Hey this is cool, I do the NYT Crossword every day. A few questions.
1. You mention an 82% solve rate. The NYT puzzle gets "harder" each day Monday through Saturday. Do you track the days separately? If so I'd be curious how much of the 18% unsolved end up on Fridays and Saturday. (for anyone who doesn't know the Sunday puzzle is outside of the M-Sat range since its a bigger puzzle).
2. Related to the above Thursday puzzles usually have "tricks" (skipped letters and what not) in them or require a Rebus (multiple letters in one space) - do you handle these at all?
3. Is this building an ongoing model and getting better at solving? Or did you have to seed it with a set of solved puzzles and clues?
For handling cross-reference clues, do you think it would be feasible in the future to feed the QA model a representation of the partially-filled puzzle (perhaps only in the refinement step - hard to do for the first step before you have any answers!), in order to give it a shot at answering clues that require looking at other answers?
It feels like the challenges might be that most clues are not cross-referential, and even for those that are, most information in the puzzle is irrelevant - you only care about one answer among many, so it could be difficult to learn to find the information you need.
But maybe this sort of thing would also be helpful for theme puzzles, where answers might be united by the theme even if their clues are not directly cross-referential, and could give enough signal to teach the model to look at the puzzle context?
One thing I was curious about - the ACPT is a crossword speed-solving competition, with time spent solving a major aspect of total score. How did you approach leveling the playing field between the human and computer competitors?
American Crosswords are different in two key ways as I understand it:
Firstly, all "serious" British crosswords are "Cryptic" ie once you figure out what the answer is, it's apparent why that's the correct clue, but figuring out the answer from the clue involves lateral thinking and some skills learned from years of staring at such clues.
e.g. Private Eye's crossword 726 (back in April), clue 23 down,
"He finally gets to penetrate agreeable person (relatively) (5)"
The correct answer is "Niece". "Nice" can mean agreeable, the final letter of "He" is E, and so by having the letter E "penetrate" the word nice you produce "niece", a person who is a relative.
[ and yes, Private Eye is a satirical magazine, the crossword clues are, likewise, intended to make you a little uncomfortable while you laugh ]
Secondly, British crosswords are arranged with black "dead" squares between letters to produce more of a lattice, in which many letters only take part in one word, as a result longer answers are common
e.g. same crossword, clue 26 across is
"Figure on getting your teeth into our statistical revelations (6,9)"
I'm reminded of an article I read about an AI that competed in a crossword competition and one particularly difficult clue it faced was "Apollo 11 and 12 [180 degrees]". I don't know if it would be allowed as part of a cryptic crossword, but the number of letters in the (words of the) answer were 8, 4.
My dad was a big crossword puzzler. I asked him if he thought that if you pick one of two possible answers to the first clue whether it would be possible to solve the entire puzzle one way or another way. He sat down and created a series of puzzles with "themes", e.g. "north", "south" or "Schiller", "Goethe", where all the major words were from one or the other theme.
Anyway, it would be interesting what the AI would do with this, would there be two hotspots in the solution space, one for each variant?
Also famously the November 5 1996 NYT puzzle where a clue about the newly elected president could be solved either CLINTON or BOBDOLE and all the crossing words had two solutions.
If they trained the AI on the NYT archive then they would have the results of testing it on this one.
I do the puzzle every day. I've been collecting clues that have caused me trouble, a wrong answer. I hope someday to be able to construct such a puzzle with one set of clues and two complete but incompatible solutions.
A thesaurus will get you far, but will never get you OREOCOOKIE and CHESSBOARD as answers for "It's all black and white" (from today's puzzle).
Anyone else getting a bit bored with all these AI does some super specialised task better than humans after enormous amounts of training. It’s not very interesting anymore.
Sure, it can do crosswords well but the average human that does crosswords well can also do a zillion other things and this type of AI is not getting us any closer to that.
If you skim the paper, you’ll realize what’s most interesting are the new techniques they developed to accomplish this, advancing the field of machine learning in the process.
Do you have any idea just how specialized the human brain is?
I can just imagine if evolution was a side spectator event, people commenting: "Broca's area just regulates breathing. And that Wernicke's area is just pattern recognition in sounds. Those aren't going to get us to anything important."
Point me to actual large generalized models in nature that aren't composed of smaller specialized functions and you might have a leg to stand on.
(Oh wait, no, those legs things are pretty specialized too, and each have their own specialized parts. Bad analogy.)
Well, good luck with your identifying an example of complex generalization without subspecialties!
But every specialized model like this is getting us closer to "doing a zillion other things." By logic it is exactly one step closer. The general AI agent will be composed of many such models.
[+] [-] ericwallace_ucb|3 years ago|reply
[+] [-] mikeryan|3 years ago|reply
1. You mention an 82% solve rate. The NYT puzzle gets "harder" each day Monday through Saturday. Do you track the days separately? If so I'd be curious how much of the 18% unsolved end up on Fridays and Saturday. (for anyone who doesn't know the Sunday puzzle is outside of the M-Sat range since its a bigger puzzle).
2. Related to the above Thursday puzzles usually have "tricks" (skipped letters and what not) in them or require a Rebus (multiple letters in one space) - do you handle these at all?
3. Is this building an ongoing model and getting better at solving? Or did you have to seed it with a set of solved puzzles and clues?
Sorry didn't have time to read the whole paper.
[+] [-] Imnimo|3 years ago|reply
It feels like the challenges might be that most clues are not cross-referential, and even for those that are, most information in the puzzle is irrelevant - you only care about one answer among many, so it could be difficult to learn to find the information you need.
But maybe this sort of thing would also be helpful for theme puzzles, where answers might be united by the theme even if their clues are not directly cross-referential, and could give enough signal to teach the model to look at the puzzle context?
[+] [-] twright0|3 years ago|reply
One thing I was curious about - the ACPT is a crossword speed-solving competition, with time spent solving a major aspect of total score. How did you approach leveling the playing field between the human and computer competitors?
[+] [-] mikeryan|3 years ago|reply
[+] [-] avrionov|3 years ago|reply
[+] [-] gardenfelder|3 years ago|reply
[+] [-] thom|3 years ago|reply
[+] [-] tialaramex|3 years ago|reply
Firstly, all "serious" British crosswords are "Cryptic" ie once you figure out what the answer is, it's apparent why that's the correct clue, but figuring out the answer from the clue involves lateral thinking and some skills learned from years of staring at such clues.
e.g. Private Eye's crossword 726 (back in April), clue 23 down,
"He finally gets to penetrate agreeable person (relatively) (5)"
The correct answer is "Niece". "Nice" can mean agreeable, the final letter of "He" is E, and so by having the letter E "penetrate" the word nice you produce "niece", a person who is a relative.
[ and yes, Private Eye is a satirical magazine, the crossword clues are, likewise, intended to make you a little uncomfortable while you laugh ]
Secondly, British crosswords are arranged with black "dead" squares between letters to produce more of a lattice, in which many letters only take part in one word, as a result longer answers are common
e.g. same crossword, clue 26 across is
"Figure on getting your teeth into our statistical revelations (6,9)"
The answer was "Number Crunching".
[+] [-] dane-pgp|3 years ago|reply
The answer to that clue is included here:
https://www.uh.edu/engines/epi2783.htm
[+] [-] jamespwilliams|3 years ago|reply
[+] [-] interestica|3 years ago|reply
There may be other sites that allow it -- the software seems to power a few diff crossword sites (with certain features enabled/disabled).
[+] [-] zwieback|3 years ago|reply
Anyway, it would be interesting what the AI would do with this, would there be two hotspots in the solution space, one for each variant?
[+] [-] mcherm|3 years ago|reply
If they trained the AI on the NYT archive then they would have the results of testing it on this one.
[+] [-] evanb|3 years ago|reply
A thesaurus will get you far, but will never get you OREOCOOKIE and CHESSBOARD as answers for "It's all black and white" (from today's puzzle).
[+] [-] cinntaile|3 years ago|reply
[+] [-] r0b05|3 years ago|reply
[+] [-] mnd999|3 years ago|reply
Sure, it can do crosswords well but the average human that does crosswords well can also do a zillion other things and this type of AI is not getting us any closer to that.
[+] [-] DantesKite|3 years ago|reply
[+] [-] kromem|3 years ago|reply
I can just imagine if evolution was a side spectator event, people commenting: "Broca's area just regulates breathing. And that Wernicke's area is just pattern recognition in sounds. Those aren't going to get us to anything important."
Point me to actual large generalized models in nature that aren't composed of smaller specialized functions and you might have a leg to stand on.
(Oh wait, no, those legs things are pretty specialized too, and each have their own specialized parts. Bad analogy.)
Well, good luck with your identifying an example of complex generalization without subspecialties!
[+] [-] joshcryer|3 years ago|reply
[+] [-] flafla2|3 years ago|reply
That is not obvious at all.