top | item 10982243

(no title)

sawwit | 10 years ago

Great achievement.

To summarize, I believe what they do is roughly this: First, they take a large collection of Go moves from expert players and learn a mapping from position to moves (a policy) using a convolutional neural network that simply takes the 19 x 19 board as input. Then they refine a copy of this mapping using reinforcement learning by letting the program play against other instances of the same program: For that they additionally train a mapping from the position to a probability of how how likely it will result in winning the game (the value of that state). With these two networks they navigate through state-space: First they produce a couple of learned expert moves given the current state of the board with the first neural network. Then they check the values of these moves and branch out over the best ones (among other heuristics). When some termination criterion is met, they pick the first move of the best branch and then it's the other player's turn.

discuss

sillysaurus3|10 years ago

they also train a mapping from the board state to a probability of how how likely it is a particular move will result in winning the game (the value of a particular move).

How is this calculated?

When some termination criterion is met

Were these criterion learned automatically, or coded/tweaked manually?

sawwit|10 years ago

1. The value network is trained with gradient descent to minimize the difference between predicted outcome of a certain board position and the final outcome of the game. Actually they use the refined policy network for this training; but the original policy turns out to perform better during simulation (they conjecture it is because it contains more creative moves which are kind of averaged out in the refined one). I'm wondering why the value network can be better trained with the refined policy network.

2. They just run a certain number of simulations, i.e. they compute n different branches all the way to the end of the game with various heuristics.

someotheridiot|10 years ago

If their learning material is based on expert human games, how can it ever get better than that?

brian_cloutier|10 years ago

This was the question which originally led me to lose faith in deep learning for solving go.

Existing research throws a bunch of professional games at a DCNN and trains it to predict the next move.

It generally does quite well but fails hilariously when you give it a situation which never comes up in pro games. Go involves lots of implicit threats which are rarely carried out. These networks learn to make the threats but, lacking training data, are incapable of following up.

The first step of creating AlphaGo worked the same way (and actually was worse at predicting the next move than current state of the art), but Deep Mind then took that base network and retrained it. Instead of playing the move a pro would play it now plays the move most likely to result in a win.

For pros, this is the same move. But for AlphaGo, in this completely different MCTS environment, they are quite different. Deep Mind then played the engine against older versions of itself and used reinforcement learning to make the network as accurate as possible.

They effectively used the human data to bootstrap a better player. The paper used a lot of other cool techniques and optimizations, but I think this one might be the coolest.

space_fountain|10 years ago

How can a human ever get better than their teacher?

In this case though they play and optimize against themselves

yvsong|10 years ago

I concluded that the all time no. 1 master Go Seigen's secret is 1. learn from all masters; 2. keep inventing/innovating. Most experts do 1 well, and are pretty much stuck there. Few are good at 2. I doubt if computers can invent/innovate.

sawwit|10 years ago

It's because they have a much larger stack size than a human brain (which does not have a stack at all, but just various kinds of short term memories). An expert Go player can realistically maybe consider 2-3 moves into the future and can have a rough idea about what will happen in the coming 10 moves, while this method does tree search all the way to the end of the game on multiple alternative paths for each move.

reddytowns|10 years ago

If you took one expert and faced him against a room full of experts who all together decided on the next move, who would win?

zodiac|10 years ago

the expert human games are used just to predict future moves

ousta|10 years ago

the key part is that they basically just play all the permutations possible and next permutations and so on and get a probability to win out of each path and take the best. It is indeed a very artificial way to be intelligent.