(no title)
sawwit | 10 years ago
To summarize, I believe what they do is roughly this: First, they take a large collection of Go moves from expert players and learn a mapping from position to moves (a policy) using a convolutional neural network that simply takes the 19 x 19 board as input. Then they refine a copy of this mapping using reinforcement learning by letting the program play against other instances of the same program: For that they additionally train a mapping from the position to a probability of how how likely it will result in winning the game (the value of that state). With these two networks they navigate through state-space: First they produce a couple of learned expert moves given the current state of the board with the first neural network. Then they check the values of these moves and branch out over the best ones (among other heuristics). When some termination criterion is met, they pick the first move of the best branch and then it's the other player's turn.
sillysaurus3|10 years ago
How is this calculated?
When some termination criterion is met
Were these criterion learned automatically, or coded/tweaked manually?
sawwit|10 years ago
2. They just run a certain number of simulations, i.e. they compute n different branches all the way to the end of the game with various heuristics.
someotheridiot|10 years ago
brian_cloutier|10 years ago
Existing research throws a bunch of professional games at a DCNN and trains it to predict the next move.
It generally does quite well but fails hilariously when you give it a situation which never comes up in pro games. Go involves lots of implicit threats which are rarely carried out. These networks learn to make the threats but, lacking training data, are incapable of following up.
The first step of creating AlphaGo worked the same way (and actually was worse at predicting the next move than current state of the art), but Deep Mind then took that base network and retrained it. Instead of playing the move a pro would play it now plays the move most likely to result in a win.
For pros, this is the same move. But for AlphaGo, in this completely different MCTS environment, they are quite different. Deep Mind then played the engine against older versions of itself and used reinforcement learning to make the network as accurate as possible.
They effectively used the human data to bootstrap a better player. The paper used a lot of other cool techniques and optimizations, but I think this one might be the coolest.
space_fountain|10 years ago
In this case though they play and optimize against themselves
yvsong|10 years ago
sawwit|10 years ago
reddytowns|10 years ago
zodiac|10 years ago
ousta|10 years ago