Alpha Go Zero: How and Why It Works

[+] shghs|8 years ago|reply

The main reason AlphaGo Zero learns so much faster than its predecessors is because it uses temporal-difference learning.[1] This effectively removes a huge amount of the value network's state space for the learning algorithm to search through, since it bakes in the assumption that a move's value ought to equal that of the best available move in the following board position, which is exactly what you'd expect for a game like Go.

A secondary reason for AlphaGo Zero's performance is that it combines both value and policy networks into a single network, since it's redundant to have two networks for move selection.

These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.

[1] https://en.wikipedia.org/wiki/Temporal_difference_learning

[+] smallnamespace|8 years ago|reply

Interestingly, the idea behind temporal difference learning is more or less the intuition behind how people price derivatives in finance.

The expected value of a contract at time T, estimated at some time t < T, is assumed to be equal (up to discounting) for all t -- e.g. if today we think the contract will be worth $100 a year later, then we also think that the expected estimate, made n months from now, of the value [12-n] months later, will also be $100. This allows you to shrink the state space considerably.

You can usually work out the payoff of a derivatives in different scenarios given rational exercise decisions by all contract participants. The calculation assumes that every market participant makes the best possible decision given the information they had available at the time by either explicitly or implicitly building a tree and working backwards, back-propagating the 'future' value back to the root.

This closely resembles the modeling of a discrete adversarial game, except the payoffs need to make reference to random market variables like the stock price, so the tree nodes are not just indexed by participant action, but also by variables.

There's actually a nice resemblance between the Longstaff-Schwarz method of pricing American options and MCTS + Alphago, except that the former is using kernel regressions instead of deep neural nets and we sample from a continuous space with an assumed probability distribution instead of a discrete space guided by a policy network [1].

[1] https://people.math.ethz.ch/~hjfurrer/teaching/LongstaffSchw...

[+] gwern|8 years ago|reply

> These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.

David Silver disagrees. The most critical distinguishing characteristic is the expert/tree iteration which makes stable self-play possible at all.

[+] _0ffh|8 years ago|reply

For anyone interested: Learn more on TD and RL in general from Sutton (inventor of TD-lambda) and Barto's book: http://www.incompleteideas.net/sutton/book/the-book.html

Sidenote: It used to be that simply googling "sutton barto book" would bring you to the right place with the first suggested link. Now this stuff is so popular all of a sudden, I needed to consult the link I had set on my own page in order to find it. It's curious how the growth of popularity of an idea will with time obscure it's own roots and primary sources. On the plus side, TIL that Sutton's working on a 2nd edition! =)

[+] nilkn|8 years ago|reply

Are you sure AlphaGo Zero even uses temporal difference learning? The Nature paper suggests it does not and merely references some older programs that did. I think it just uses a somewhat custom form of self-play reinforcement learning combined with MCTS.

[+] dmix|8 years ago|reply

Interesting, a TD algorithm, developed by a Canadian AI researcher now working with Deepmind in the early 1990s, was previously used to beat expert players at Backgammon and advanced human understanding of the game:

> TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players.

> TD-Gammon achieved a level of play just slightly below that of the top human backgammon players of the time. It explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play.

https://www.wikiwand.com/en/TD-Gammon

[+] igravious|8 years ago|reply

OP explicitly discusses the second big distinguishing in the opening paragraph of the section titled, 'The Alpha Zero Neural Ne':

“The Alpha Zero algorithm produces better and better expert policies and value functions over time by playing games against itself with accelerated Monte Carlo tree search. The expert policy π and the approximate value function Ŵ are both represented by deep neural networks. In fact, to increase efficiency, Alpha Zero uses one neural network f that takes in the game state and produces both the probabilities over the next move and the approximate state value. (Technically, it takes in the previous eight game states and an indicator telling it whose turn it is.)”

Regarding whether OP touches on temporal-difference learning I am unqualified to say but they do not explicitly mention it. Furthermore I am unqualified to judge how central this technique is to the level of play achieved. However in the DeepMind paper (pg. 20)† that start talking about temporal-difference learning thus:

“Self-play reinforcement learning has previously been applied to the game of Go. NeuroGo[40, 41] used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporal-difference learning[42] to predict territory in games of self-play, building on prior work[43]. A related approach, RLGO[44], represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal-difference learning to predict the winner in games of self-play. Both NeuroGo and RLGO achieved a weak amateur level of play.”

I'm no expert but this implies to me that it was probably the sum total of all the subtle architectural decisions made by the DeepMind team plus their AI hardware and software platform that made AlphaGo Zero excel.

† https://deepmind.com/documents/119/agz_unformatted_nature.pd...

[+] eru|8 years ago|reply

Of course, these techniques you have mentioned were known before and tried. The big thing in AlphaGo Zero is that they found a way to make them work and the resulting architecture even manages to looks simple.

(Eg AlphaGo using two different networks for policy and valuation was a big breakthrough back when it was young, because people couldn't make the single network one work so well.)

For temporal difference learning the article about TD-Gammon, a backgammon AI from the early 90s, is great: http://www.bkgm.com/articles/tesauro/tdl.html (It's linked from the Wikipedia article you referenced, too.)

[+] praptak|8 years ago|reply

The seminal article on TDRL: Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–34, PDF:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132...

[+] gabrielgoh|8 years ago|reply

could you elaborate on this point? what you're saying sounds like dynamic programming, which does not reduce the state space at all, just saves on redundant computations (and is a favourite of programming interviews everywhere)

[+] mtgx|8 years ago|reply

How would OpenAI's self-play be different?

https://blog.openai.com/competitive-self-play/

[+] timbk|8 years ago|reply

[deleted]

[+] gwern|8 years ago|reply

Some additional discussion: https://www.reddit.com/r/reinforcementlearning/comments/778v... https://www.reddit.com/r/MachineLearning/comments/7apvr4/d_i...

[+] saagarjha|8 years ago|reply

If the author's here: some of the math formulas don't render correctly. In particular, 10^170 is parsed as 10^{1}70, and $5478$ shows up without TeX applied to it.

[+] unpseudo|8 years ago|reply

They are two things a human brain does when playing chess or go: evaluating a position and mentally playing some positions (by doing a search tree).

The AlphaGo neural network is able to do the first part (evaluating positions) but the search tree is still a hand crafted algorithm. Do they have plans to work on a version with a pure neural network? (i.e. a version which would be able to learn how to do a search tree.)

[+] nemo1618|8 years ago|reply

Would be really cool to see a generic framework for this, where you can plug in the rules of your discrete-deterministic-game-with-perfect-information and get a superhuman bot. Does something like this already exist?

[+] fspeech|8 years ago|reply

Given that AlphaGo Zero was trained on several million games of self plays, each game involving hundreds of steps, each step with 1600 MCTS simulations, the total number of board positions it has considered is on the order of trillions. While impressive it pales in comparison to the number of possible board positions of 10^170 (https://en.m.wikipedia.org/wiki/Go_and_mathematics). So its amazing performance tells us that:

1. Possibly the elegant rule of the game cuts down the search space so much that there is a learnable function that gives us optimal MCTS supervision;

2. Or CNN approximates human visual intuition so well, so while Zero has not evaluated so many board positions it has evaluated all the positions that human has ever considered - so it remains possible that a different network could produce different strategies and be better than Zero.

[+] pmarreck|8 years ago|reply

> It is interesting to see how quickly the field of AI is progressing. Those who claim we will be able to see the robot overlords coming in time should take heed - these AI's will only be human-level for a brief instant before blasting past us into superhuman territories, never to look back.

This final paragraph is just editorializing. A computer will never care about anything (including games like Go and domination of other beings) that it is not programmed to imitate care about, and will thus remain perennially unmotivated.

Also, my intuition says that gradient descent is an ugly hack and that there HAS to be some better way (like a direct way) to get at the inverse of a matrix (not just in specific cases but in the general case!), but I digress, and not being a mathematician, perhaps someone has already proved somehow that a general method to directly and efficiently invert all possible matrices is impossible

[+] EGreg|8 years ago|reply

I wonder how the STYLE of Alpha Go Zero is regarded by human experts. Is it far different from AlphaGo? Why bother learning from AlphaGo if they can learn from AlphaGo Zero?

Did they unleash a second "Master" program?

I am wondering if the "better" strategy moves are now super wacky and weird and break all theory.

[+] naveen99|8 years ago|reply

someone posted an attempt at an open source implementation of alphagozero https://github.com/yhyu13/AlphaGOZero-python-tensorflow

anyone try it yet ?

[+] xelxebar|8 years ago|reply

There's also Leela Zero [1].

[1] https://github.com/gcp/leela-zero

[+] bluetwo|8 years ago|reply

Saw the AlphaGo movie at a festival recently.

Been following the AlphaGo Zero developments, which leap-frog what was going on in the movie (although still very much worth seeing).

One thing I was curious about is if Go would be considered solved, either hard or weakly solved, since AlphaGo Zero at this point doesn't seem to be able to be beat by any living human. Wikipedia does not list it as solved in either sense, and I was wondering if this was an oversight.

[+] gizmo686|8 years ago|reply

Go still has not been Ultra-weakly solved (eg. we do not know who is supposed to win).

If AlphaGo has weakly solved go, then it should either have a 100% win rate when playing against itself as white, or a 100% win rate when playing against itself as black.

[+] gcb0|8 years ago|reply

if alpha go is just an adversarial network to brute force states, then it is not solved (note I don't research alphaGo, and most of what I know about it is from HN comments)

[+] unknown|8 years ago|reply

[deleted]

[+] erikb|8 years ago|reply

I don't get what is new in the set of attributes that this article describes.

Monte Carlo was already used in 2005 in AIs playing on KGS. Gradient Descent is a basic algorithm is a basic algorithm that I saw in an AI class in ~2008 as well. I bet both are even a lot older and well known by all experts.

This is not what makes AlphaGo special or Zero successful. The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.

So one question is actually how they used these same old algorithms so much more efficiently, and the second question is how did they overcame the local maximum problem. Additionally there may be other problems involved that experts know better than me.

But an explanation of basic algorthims can't be the answer.

[+] gwern|8 years ago|reply

> The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.

No. The curious thing is that you can train a godawful huge NN with 40 layers via pure self-play with no checkpoints or baselines or pretraining or library of hard problems or any kind of stabilization mechanism, and it won't diverge but will learn incredibly rapidly and well and stably. As Silver says in the AmA, all their attempts at pure self-play ran into the usual divergence problems where the training explodes and engages in catastrophic forgetting, which is what the RL folklore predicts will happen if you try to do that. Local maximums are not the problem - the problem is the self-play can't even find a local maximum much less maintain it or improve it.

> So one question is actually how they used these same old algorithms so much more efficiently, and the second question is how did they overcame the local maximum problem.

Er, this is exactly what OP is all about: the Monte Carlo tree search supervision. That's how they used them.

[+] nilkn|8 years ago|reply

> The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.

Deep networks have many parameters and represent extremely high-dimensional cost functions. Some basic heuristics suggest that sufficiently random high-dimensional functions have relatively few local extrema, but lots of saddle points. (Look at the signs of the eigenvalues of the Hessian matrix and apply a coin flipping argument. This suggests that local extrema become exponentially rarer as the dimension increases.)

Moreover, stochastic gradient descent is actually pretty good at escaping saddle points.

[+] zeristor|8 years ago|reply

Are there any plans to do this for Chess?

I imagine that this is an iteration of the Alpha Go engine, people working on this are very current with Alpha Go.

If Chess is similar, then wouldn't DeepMind be able to bootstrap game knowledge. Perhaps this isn't a big goal, but Chess is Chess after all.

[+] naveen99|8 years ago|reply

There was a project for chess, called giraffe https://bitbucket.org/waterreaction/giraffe whose author shut it down after joining google deep mind: http://www.talkchess.com/forum/viewtopic.php?t=59003

He thinks it's only a matter of time till machine learning beats hand crafted systems like stockfish even in chess.

http://arxiv.org/abs/1509.01549

[+] unknown|8 years ago|reply

[deleted]

[+] partycoder|8 years ago|reply

Go has been studied for hundreds of years. In many cases, by people who study the game since their childhood and work on it as a full-time occupation.

The consequence of Alpha Go Zero is that it can, in a matter of days, disregard and surpass all human knowledge about the game.

Maximizing a score margin has been equated for a long time with maximizing your probability of winning. Alpha Go doesn't play like that... it's a substantial paradigm shift. If you see the early commentaries you will see that human players were initially saying Alpha Go moves were mistakes because they were slow and wasted opportunities to get more territory, to then realize that Alpha Go was actually winning.

[+] yters|8 years ago|reply

Are there adversarial examples for Alpha Go Zero?

[+] javajosh|8 years ago|reply

"On a long enough timeline, everything is a discrete game." (With apologies to _Fight Club_)

Personally, I look forward to the day when the software I own works for me to the extent of optimizing the decisions I make during the day, even many mundane ones. Properly executed, such a system could make a big difference in my quality of life. I believe that a big piece that is missing is a solid life-model, a life-representation, that can be optimized. Once that is defined, an RNN or MCS can optimize it and I can reap the benefits.

[+] ben_w|8 years ago|reply

If it is optimising all your decisions, even the mundane ones, is it really your life any more? If it plugged into your brainstem and took over, by definition nobody else would notice the difference (you always did what it said anyway, or you’d be suboptimal), so why not just flood the 20 watt carbon-based neural network with bliss drugs and let the silicon-based neural network be the person instead?

[+] StavrosK|8 years ago|reply

Complete with creepy predictions like "Predicted optimal life path: $120,000 net worth, 63 year life span".

[+] jeffshek|8 years ago|reply

Site : https://betterself.io/ GitHub : https://github.com/jeffshek/betterself

I do this similarly but with supplements and medication. I track a lot of my supplements that boost productivity and sleep and disregard anything that's negative. I've been continuing this over the course of a year.

I haven't added anything like a RNN to it for recommendations though. Some fears about missing up the gradient descent component scare me :)

[+] dlwdlw|8 years ago|reply

This may interest you, AI as an artificial belief system. http://us1.campaign-archive.com/?u=78cbbb7f2882629a5157fa593...

My interpretation is that AI can allow you to rapidly load and unload mental models so that you're free to devote 100% of your mind to what only you (a human) can do.

[+] pmontra|8 years ago|reply

What if everybody uses the same software by a handful of companies, maybe only one? It will be interesting to see how it resolves conflicts between goals of different customers. More interestingly, how does it regards the goals of its company compared to the ones of its customers?

Based on human nature and history I don't expect anything good coming from that.

[+] QML|8 years ago|reply

What makes this different from a minimax algorithm with alpha-beta pruning?

[+] gwern|8 years ago|reply

Aside from MCTS being a different tree search method, there is no 'closing of the loop'. In regular MCTS, it is far from unheard of to do the random playouts with instead some 'heavier' heuristic to make the playouts a little better estimators of the node value, but the heavy playouts do not do any kind of learning, the heuristic you start with is what you end with; what makes this analogous to policy iteration (hence the names for the Zero algorithm of 'tree iteration' or 'expert iteration') is that the refined estimates from the multiple heavy playouts are then used to improve the heavy playout heuristic (ie. a NN which can be optimized via backpropagation in a supervised learning of board position -> value). Then in a self-play setting, the MCTS continually refines its heavy heuristic (the NN) until it's so good that the NN+MCTS is superhuman. Then at play time you can drop the MCTS entirely and just use the heavy heuristic to do a very simple tree search to choose a move (which I think might actually be a minimax with a fixed depth but I forget).

[+] Mageek|8 years ago|reply

The branching factor for games like Go is too high. Alpha Go Zero uses a stochastic policy to focus its attention on useful branches. This policy is optimized alongside the predicted value for each node.

[+] GolDDranks|8 years ago|reply

Minimax algorithm with alpha-beta pruning is an exhaustive search algorithm. MCTS is a probabilistic algorithm that optimises for converging faster for a partial "good enough" result.

Introducing MCTS for Go playing programs in 2006 brought upon a revolution in playing strength. For the first time Go programs stood a chance against a dan-level amateur.

[+] unknown|8 years ago|reply

[deleted]

[+] cgearhart|8 years ago|reply

I wonder how AlphaGo Zero would fare against the others if they were all using the same search algorithm, and I wonder how the search depth vs breadth changes in Zero compared to earlier variants.

[+] unknown|8 years ago|reply

[deleted]

[+] Blazespinnaker|8 years ago|reply

Mageek, any reason why they haven't applied this to chess yet?

[+] glinscott|8 years ago|reply

MCTS has done very poorly on chess compared to alpha-beta. Chess has a high number of forcing moves in capture sequences, and it's been very difficult to evaluate positions that are not settled. Traditionally an algorithm called a quiescence search is used, but it relies on doing an evaluation at each node of the search, which would be prohibitive with the latency for a network evaluation.

One of the things that amazed me the most about AlphaGo Zero was that they didn't do any tricks to minimize the latency of the network evaluation!

Still, it's certainly worth a try, I'd be extremely interested to see what style of chess a self-trained MCTS chess version would have :).

[+] bluetwo|8 years ago|reply

The founder of DeepMind was an internationally ranked chess prodigy, so that's a good question.

[+] jacobkg|8 years ago|reply

Can this technique be used to write a strong chess engine?

109 comments