Ironically, a lot of the tricks Stockfish is using here are reminiscent of tricks that were used in the original AlphaGo and later discarded in AlphaGoZero.
In particular, the AlphaGo paper mentioned four neural networks of significance:
- a policy network trained on human pro games.
- a RL-enhanced policy network improving on the original SL-trained policy network.
- a value network trained on games generated by the RL-enhanced policy network
- a cheap policy network trained on human pro games, used only for rapid rollout simulations.
The cheap rollout policy network was discarded because DeepMind found that a "slow evaluations of the right positions" was better than "rapid evaluations of questionable positions". The independently trained value network was discarded because co-training a value and policy head on a shared trunk saved a significant amount of compute, and helped regularize both objectives against each other. The RL-enhanced policy network was discarded in favor of training the policy network to directly replicate MCTS search statistics.
The depth and branching factor in chess and Go are different, so I won't say the solutions ought to be the same, but it's interesting nonetheless to see the original AlphaGo ideas be resurrected in this form.
The incremental updates are also related to Zobrist Hashing, which the Stockfish authors are certainly aware of.
It's also different because Stockfish uses Alpha-beta treesearch instead of MCTS:
MCTS tries to deal with an explosion in search-space by only sampling very small parts of the searchspace and relying on a very good heuristic to guid that search process. In this case it is crucial to find the most relevant subset of the tree to explore it, so spending more time on your policy makes sense.
Alpha-beta pruning however always explores the entire searchtree systematically (up to a certain depth using itd) and prunes the searchtree by discarding bad moves. In this case you don't need as good of an evaluation function because you search the entire tree anyways. Rather you need the function to be fast, as it is evaluated on many more states.
In general AB-pruning only needs the heuristic for estimating the tail-end of the tree and for sorting the states based on usefulness, while MCTS uses all the above plus guiding the whole search process.
Spending tons of time on the heuristic is not useful as even the optimal search order can only double your searchdepth. Don't get me wrong that still a lot (especially considering exponential blowup) but MCTS can surpass this depth easily. The disadvantage is that MCTS loses a lot of the guarantees of AB-pruning and tends to "play down to his opponent" when trained using self-play because the exploration order is entirely determined by the policy.
Right now there is a lot of experimentation to try adjusting the network architecture. The current leading approach is a much larger net which takes in attack information per square (eg. is this piece attacked by more pieces than it's defended by?). That network is a little slower, but the additional information seems to be enough to be stronger than the current architecture.
Btw, the original Shogi developers really did something amazing. The nodchip trainer is all custom code, and trains extremely strong nets. There are all sorts of subtle tricks embedded in there as well that led to stronger nets. Not to mention, getting the quantization (float32 -> int16/int8) working gracefully is a huge challenge.
Just wanted to say thanks for many years of fantastic work on both Stockfish and Leela. The computer chess community owes you a huge debt of gratitude!
Interesting how before A0 it was mainly "search matters the most", with crazy low branching factors to get deeper. It seems that humans were just better in search heuristics than in evaluation ones.
This is very nice. Reminds me a lot of tricks used for simple linear models. And it seems to work, given that Leela Chess Zero is losing with quite a gap.
Most of the times one would learn a model by just changing a single feature and then doing the whole sum made no sense.
A good example is learning a sequence of decisions, where each decision might have a cost associated to it, you can then say that the current decision depends on a previous one and vary the previous one to learn to recover from errors. If previous decision was bad, then you'd still like to make the best decision for current state.).
So even if your training data does not have this error-recovery example, you can just iterate through all previous decisions and make the model learn to recover.
An optimization in that case would be to just not redo the whole sum (for computing the decision function of a linear model).
"Leela Chess Zero is losing with quite a gap"
Check https://tcec-chess.com/ ATM Leela is in front again.
That being said it is not clear which approach is better e.g. a very smart but relatively slow evaluation function (Leela) or SFs approach to use a relatively dumb eval function with a more sophisticated search. It is pretty clear that Leelas search can be improved (Check https://github.com/dje-dev/Ceres for example)
It seems that the strategy is to use the neural network to score various moves and then a search strategy to try to find moves that result in a favourable score. And this post focuses on some of the technical engineering details to design such a scoring network. In particular the scoring is split into two parts: 1. A matrix multiplication by a sparse input vector to get a dense representation of a position, and 2. Some nonlinear and further layers after this first step. And it seems that step 1 is considered more expensive.
The way this is made cheap is by making it incremental: given some board s and output of the first layer b + Ws, it is cheap to compute b + Wt where a t is a board that is similar to s (the difference is W(t-s) but the vector t-s is 0 in almost every element.)
This motivates some of the engineering choices like using integers instead of floats. If you used floats then this incremental update wouldn’t work.
It seems to me that a lot of the smarts of stockfish will be in the search algorithm getting good results, but I don’t know if that just requires a bit of parallelism (surely some kind of work-stealing scheduler) and brute force or if it mostly relies on some more clever strategies. And maybe I’m wrong and the key is really in the scoring of positions.
> It seems that the strategy is to use the neural network to score various moves and then a search strategy to try to find moves that result in a favourable score
There is a LOT of cleverness built into the stockfish search as well.
In fact, while neural networks have now won the position evolution game, it is still an open discussion in the chess programming community if the Alpha Zero / Leela search (NN + Monte Carlo) is as good as Stockfish's PVS.
I don't think the integer weights are neccessary for sparsity; they are just faster because they allow for low precision.
Of course floats aren't strictly associative so you wouldn't get bitwise equivalence between the incremental and non-incremental updates, but I don't see how that would matter in this context.
It's difficult to organise Leela-Stockfish as a fair fight, because they run on different hardware (CPU vs GPU) and both get substantial improvements by playing on better hardware.
Traditionally this wasn't a big problem as every engine was more or less optimised for a fast Intel CPU with a moderate to large amount of RAM. The organisers would decree the specs of the championship hardware some time in advance. Now, (at least for TCEC, the other major engine tournament) they pick two hardware configurations, one CPU-heavy and one GPU-heavy, and give each team the choice.
How do you balance those? It's been suggested you should make them equal in terms of watts of power, or dollar cost to buy, but neither of those are obviously best. In practice the TCEC organisers pick something close to what they picked last time but shade it against the winning engines, making the contest more even. Chess.com likely do something similar though they're less rigorous about the details.
The chess.com version is the old Stockfish from mid 2020.
The NUE architecture was only put in place around August. If you see https://github.com/glinscott/fishtest/wiki/Regression-Tests you'll notice that gave a very significant (100+ ELO) boost.
There is a current tournament going on at tcec-chess.com/ which stockfish has been leading so far, but I see Leela has just caught up in the head to head.
The gap between white and black piece performance is massive in these top engines if I'm reading it right. LCZero won 0/ lost 4 as black and won 24 / lost zero as white (with lots of draws)? I had no idea the split was so big now. Do human tournaments look like this too these days?
(From https://tcec-chess.com/)
Not really---similar for sure, but not as bad for black. Consider that computer chess tournaments do not start with the pieces in their initial positions. The first 5-15 moves are predetermined by human master players to achieve positions that are imbalanced while not having a side with a clearer advantage. Otherwise it would be even more of a draw fest.
This makes things a bit worse for black, typically, because playing for the win as black means making some positional concession (that white can exploit after successfully defending) in exchange for the initiative and a chance for an attack.
In human play, black has a little more ability than white to steer the game towards their opening preparation[1]. If they want to win, they will try to get have a position that they know in and out from (computer-assisted) home preparation. But there are still a lot of draws in classical (many hours per game) chess.
[1] Chess openings that white chooses are typically called "attack" (King's Indian Attack) or "game" (Scotch Game), while those that black chooses are called "defense" (Sicilian defense). You'll find that there are a lot more "defenses" that black can choose from than "attacks".
Overwhelmingly the result is a draw. White can win sometimes, and wins for black are very rare. It's similar for top grandmasters, but not quite as stark.
Current divp uses very unbalanced openings for engines to be able to score. Using starting position/more optimal openings inevitably leads to draw. So it's actually exactly the opposite -- the advantage is so low and draw tendencies are so strong that we need to force position where white has advantage to be able to compare two strong engines
Using a previous Stockfish scorer they trained an NN without any labelling effort. This is also similar to how unsupervised translation is done in some methods. They start from word->word dictionary results and iteratively train lang1->lang2 and lang2->lang1 models feeding on each other's output.
One thing I don't understand is why it would be smarter to augment the inputs with some of the missing board information - particularly the availability of castling. Even though this network is a quick-and-dirty evaluation, seems like there's room for a few additional bits in the input and being able to castle very much changes the evaluation of some common positions.
I am not an expert (and don’t have any idea what I am talking about), but wouldn’t this be captured the same way other future positional advantages are when evaluating the next “level” of possible board positions? Ie, the fact that a piece can later give check, or castle, or anything else if you first move it here is a function of that next move’s resulting position score.
This network is just to signal to the final search algorithm how good the board looks in a given spot on tree of possible moves.
I don't know a lot about chess but I have one question: isn't chess a solved game (in the sense that given a board state we always can compute the right move)? why use a neural network that can introduce, a very small percentage of the time, mistakes? I guess it is for performance reasons?
No, it's not close to being solved. I made a new estimate recently at there are about 8.7E+45 unique positions: https://github.com/lechmazur/ChessCounter. You don't need to analyze all of them to generate the proof but it's still far beyond our computational and storage capacities to solve chess.
Chess is estimated to have 10^120 possible games. Observable universe is estimated to have between 10^78 to 10^82 atoms. So, unless you had some way of dramatically compressing or simplifying this search space, you'd be needing to store 38 bits+ information, if you had every atom in the observable universe as your memory. Suffice to say, this has not yet been achieved.
You can imagine playing chess as a tree. You start with a game state, and every possible move is an edge to a new game state. Unfortunately, the amount of game states grows approximately exponentially (e.g. for every game state there are approximately 40 moves to play, thus every extra move multiplies #(game states) by 40, approximately). This neural engine is a trick so that Stockfish does not have to simulate all 40 moves every game states. The neural net outputs the moves that are most likely to be strong moves, and Stockfish will only consider those moves. Of course, this is a very basic explanation and there are many more optimisations Stockfish uses, though the ultimate goal of almost every optimisation is to reduce the amount of simulation Stockfish has to do.
Chess is not a solved game (though I think the stalemate rules make it finite, there are far too many positions). All computer chess programs have to try to determine if positions will be good without playing out all possible ways the game could go. Usually this is some combination of search strategies and scoring positions (the article describes how stockfish scores positions).
You're correct, in that given a board state the right move can always be exhaustively computed, because the game has no randomness or hidden information. But enough computing power doesn't exist on the planet to iterate through all the possible states.
The word you're looking for is "deterministic", rather than solved. Solved is usually used to mean the exhaustive computation has been done. Deterministic would mean it could be if enough computing power existed, but for chess it doesn't.
No, it is not solved. To solve it, you would need to investigate all possible continuations of the game, all the way until the game ends, but there are so many continuations that this is impossible.
I have to disagree with the others.
Since AlphaZero I think chess is closed to being solved.
Technically it is not, but practically is is.
Chess is a hard game to solve completely, and I think
one reason is that there are many states in chess where there is not one superior move, but only a probable optimal move, in the sense that the game-tree for winning against the move is smaller than other moves.
Then the agent/player has to guess what the opponents strategy is given that move, and that depends on the opponent.
I plays are truly creative, and its results speak for itself.
From wiki:
"In AlphaZero's chess match against Stockfish 8 (2016 TCEC world champion), each program was given one minute per move. Stockfish was allocated 64 threads and a hash size of 1 GB,[1] a setting that Stockfish's Tord Romstad later criticized as suboptimal. AlphaZero was trained on chess for a total of nine hours before the match. During the match, AlphaZero ran on a single machine with four application-specific TPUs. In 100 games from the normal starting position, AlphaZero won 25 games as White, won 3 as Black, and drew the remaining 72. In a series of twelve, 100-game matches (of unspecified time or resource constraints) against Stockfish starting from the 12 most popular human openings, AlphaZero won 290, drew 886 and lost 24."
Lots of pieces about neural network design skip over the design of the representation in the input stage, which is one of the key design issues when building a custom neural network.
I love how much depth this articles goes into about that representation.
Analyses like this are a great indication that the main effect of refining neural networks around chess will make neural networks more exciting and ultimately make chess more boring.
[+] [-] brilee|5 years ago|reply
In particular, the AlphaGo paper mentioned four neural networks of significance:
- a policy network trained on human pro games. - a RL-enhanced policy network improving on the original SL-trained policy network. - a value network trained on games generated by the RL-enhanced policy network - a cheap policy network trained on human pro games, used only for rapid rollout simulations.
The cheap rollout policy network was discarded because DeepMind found that a "slow evaluations of the right positions" was better than "rapid evaluations of questionable positions". The independently trained value network was discarded because co-training a value and policy head on a shared trunk saved a significant amount of compute, and helped regularize both objectives against each other. The RL-enhanced policy network was discarded in favor of training the policy network to directly replicate MCTS search statistics.
The depth and branching factor in chess and Go are different, so I won't say the solutions ought to be the same, but it's interesting nonetheless to see the original AlphaGo ideas be resurrected in this form.
The incremental updates are also related to Zobrist Hashing, which the Stockfish authors are certainly aware of.
[+] [-] mattalex|5 years ago|reply
Alpha-beta pruning however always explores the entire searchtree systematically (up to a certain depth using itd) and prunes the searchtree by discarding bad moves. In this case you don't need as good of an evaluation function because you search the entire tree anyways. Rather you need the function to be fast, as it is evaluated on many more states.
In general AB-pruning only needs the heuristic for estimating the tail-end of the tree and for sorting the states based on usefulness, while MCTS uses all the above plus guiding the whole search process. Spending tons of time on the heuristic is not useful as even the optimal search order can only double your searchdepth. Don't get me wrong that still a lot (especially considering exponential blowup) but MCTS can surpass this depth easily. The disadvantage is that MCTS loses a lot of the guarantees of AB-pruning and tends to "play down to his opponent" when trained using self-play because the exploration order is entirely determined by the policy.
[+] [-] glinscott|5 years ago|reply
There are two trainers currently, the original one, which runs on CPU: https://github.com/nodchip/Stockfish, and a pytorch one which runs on GPU: https://github.com/glinscott/nnue-pytorch.
The SF Discord is where all of the discussion/development is happening: https://discord.gg/KGfhSJd.
Right now there is a lot of experimentation to try adjusting the network architecture. The current leading approach is a much larger net which takes in attack information per square (eg. is this piece attacked by more pieces than it's defended by?). That network is a little slower, but the additional information seems to be enough to be stronger than the current architecture.
Btw, the original Shogi developers really did something amazing. The nodchip trainer is all custom code, and trains extremely strong nets. There are all sorts of subtle tricks embedded in there as well that led to stronger nets. Not to mention, getting the quantization (float32 -> int16/int8) working gracefully is a huge challenge.
[+] [-] pk2200|5 years ago|reply
[+] [-] EvgeniyZh|5 years ago|reply
[+] [-] knuthsat|5 years ago|reply
Most of the times one would learn a model by just changing a single feature and then doing the whole sum made no sense.
A good example is learning a sequence of decisions, where each decision might have a cost associated to it, you can then say that the current decision depends on a previous one and vary the previous one to learn to recover from errors. If previous decision was bad, then you'd still like to make the best decision for current state.).
So even if your training data does not have this error-recovery example, you can just iterate through all previous decisions and make the model learn to recover.
An optimization in that case would be to just not redo the whole sum (for computing the decision function of a linear model).
[+] [-] kohlerm|5 years ago|reply
[+] [-] dan-robertson|5 years ago|reply
The way this is made cheap is by making it incremental: given some board s and output of the first layer b + Ws, it is cheap to compute b + Wt where a t is a board that is similar to s (the difference is W(t-s) but the vector t-s is 0 in almost every element.)
This motivates some of the engineering choices like using integers instead of floats. If you used floats then this incremental update wouldn’t work.
It seems to me that a lot of the smarts of stockfish will be in the search algorithm getting good results, but I don’t know if that just requires a bit of parallelism (surely some kind of work-stealing scheduler) and brute force or if it mostly relies on some more clever strategies. And maybe I’m wrong and the key is really in the scoring of positions.
[+] [-] thomasahle|5 years ago|reply
The neural net is only for scoring positions, not moves. Check the article on Stockfish NNUE at the chess programming wiki: https://www.chessprogramming.org/Stockfish_NNUE
There is a LOT of cleverness built into the stockfish search as well.
In fact, while neural networks have now won the position evolution game, it is still an open discussion in the chess programming community if the Alpha Zero / Leela search (NN + Monte Carlo) is as good as Stockfish's PVS.
[+] [-] eutectic|5 years ago|reply
Of course floats aren't strictly associative so you wouldn't get bitwise equivalence between the incremental and non-incremental updates, but I don't see how that would matter in this context.
[+] [-] LittlePeter|5 years ago|reply
Some of the Leela-Stockfish games are analyzed by agadmator on YouTube [2].
[1] https://www.chess.com/news/view/13th-computer-chess-champion...
[2] https://www.youtube.com/watch?v=YtXZjKItuC8
[+] [-] dmurray|5 years ago|reply
Traditionally this wasn't a big problem as every engine was more or less optimised for a fast Intel CPU with a moderate to large amount of RAM. The organisers would decree the specs of the championship hardware some time in advance. Now, (at least for TCEC, the other major engine tournament) they pick two hardware configurations, one CPU-heavy and one GPU-heavy, and give each team the choice.
How do you balance those? It's been suggested you should make them equal in terms of watts of power, or dollar cost to buy, but neither of those are obviously best. In practice the TCEC organisers pick something close to what they picked last time but shade it against the winning engines, making the contest more even. Chess.com likely do something similar though they're less rigorous about the details.
[+] [-] zone411|5 years ago|reply
[+] [-] thomasahle|5 years ago|reply
There is a current tournament going on at tcec-chess.com/ which stockfish has been leading so far, but I see Leela has just caught up in the head to head.
Of course Leela also keeps evolving.
[+] [-] zetazzed|5 years ago|reply
[+] [-] bonzini|5 years ago|reply
This makes things a bit worse for black, typically, because playing for the win as black means making some positional concession (that white can exploit after successfully defending) in exchange for the initiative and a chance for an attack.
In human play, black has a little more ability than white to steer the game towards their opening preparation[1]. If they want to win, they will try to get have a position that they know in and out from (computer-assisted) home preparation. But there are still a lot of draws in classical (many hours per game) chess.
[1] Chess openings that white chooses are typically called "attack" (King's Indian Attack) or "game" (Scotch Game), while those that black chooses are called "defense" (Sicilian defense). You'll find that there are a lot more "defenses" that black can choose from than "attacks".
[+] [-] dsjoerg|5 years ago|reply
[+] [-] EvgeniyZh|5 years ago|reply
[+] [-] pcwelder|5 years ago|reply
[+] [-] spiantino|5 years ago|reply
One thing I don't understand is why it would be smarter to augment the inputs with some of the missing board information - particularly the availability of castling. Even though this network is a quick-and-dirty evaluation, seems like there's room for a few additional bits in the input and being able to castle very much changes the evaluation of some common positions.
[+] [-] mattnewton|5 years ago|reply
This network is just to signal to the final search algorithm how good the board looks in a given spot on tree of possible moves.
[+] [-] Fragoel2|5 years ago|reply
[+] [-] zone411|5 years ago|reply
[+] [-] elcomet|5 years ago|reply
See https://en.wikipedia.org/wiki/Endgame_tablebase#Computer_che...
[+] [-] danielbarla|5 years ago|reply
[+] [-] mvanaltvorst|5 years ago|reply
[+] [-] dan-robertson|5 years ago|reply
[+] [-] T-hawk|5 years ago|reply
The word you're looking for is "deterministic", rather than solved. Solved is usually used to mean the exhaustive computation has been done. Deterministic would mean it could be if enough computing power existed, but for chess it doesn't.
[+] [-] hiq|5 years ago|reply
[+] [-] aliceryhl|5 years ago|reply
[+] [-] tspduck|5 years ago|reply
Chess is a hard game to solve completely, and I think one reason is that there are many states in chess where there is not one superior move, but only a probable optimal move, in the sense that the game-tree for winning against the move is smaller than other moves. Then the agent/player has to guess what the opponents strategy is given that move, and that depends on the opponent.
I plays are truly creative, and its results speak for itself.
From wiki: "In AlphaZero's chess match against Stockfish 8 (2016 TCEC world champion), each program was given one minute per move. Stockfish was allocated 64 threads and a hash size of 1 GB,[1] a setting that Stockfish's Tord Romstad later criticized as suboptimal. AlphaZero was trained on chess for a total of nine hours before the match. During the match, AlphaZero ran on a single machine with four application-specific TPUs. In 100 games from the normal starting position, AlphaZero won 25 games as White, won 3 as Black, and drew the remaining 72. In a series of twelve, 100-game matches (of unspecified time or resource constraints) against Stockfish starting from the 12 most popular human openings, AlphaZero won 290, drew 886 and lost 24."
source: https://en.wikipedia.org/wiki/AlphaZero#Chess
[+] [-] nl|5 years ago|reply
Lots of pieces about neural network design skip over the design of the representation in the input stage, which is one of the key design issues when building a custom neural network.
I love how much depth this articles goes into about that representation.
[+] [-] billiam|5 years ago|reply
[+] [-] gelert|5 years ago|reply