kotach's comments

kotach | 10 years ago | on: DoorDash Raises $127M in ‘Down’ Round

It would be innovative if they had some super fast on-line optimization of delivery routes. Optimized routes would allow them to chain several pickups from restaurants with a chain of deliveries.

Given the fact that the optimization is global they would have the advantage of doing global optimization over each restaurant doing it locally. Therefore they could charge less for deliveries while paying their workers the same.

Unfortunately, a google search of the founders tells us they haven't got the slightest clue how to do that. What the investors can hope is that this money will allow them to find people that can make the extra effort to rise the business above all of the similar ones.

kotach | 10 years ago | on: FlappyBird hack using Deep Q-Learning

Checkout Dagger [2], SEARN [3] and LOLS [1] (LOLS is available in vowpal wabbit search capabilities). A lot of interesting stuff on mimicking optimal policies, local optimality, joint learning and similar stuff :D

The whole point of playing is doing your decisions jointly, dependent on the previous decisions. If you learn your model that way it'll make its decisions trying to minimize future regret.

Local optimality is a very nice property. It means that if you play out a game, not a single change of any of the previous moves could lead you to a better result. Of course, local optimality is hard but for some problems it's pretty easy to achieve if your optimal policy is good, and your features are adequate (which they will be if you use neural networks).

Of course, flappy bird is pretty local game and all of this might be an overkill :D

AlphaGo wasn't trained jointly over Go games, so it's lacking in that regard. But the power of neural networks is compensating. Who can imagine what AlphaGo would be like if they trained their policy networks jointly? :D

A nice introduction to LSTMs: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[1]: http://arxiv.org/pdf/1502.02206.pdf

[2]: http://arxiv.org/pdf/1011.0686.pdf

[3]: http://searn.hal3.name

kotach | 10 years ago | on: Another Way of Looking at Lee Sedol vs. AlphaGo

If it works for chess, it'll work for Go. Chess has lots of games that you can learn from, Komodo wins any grandmaster or draws.

The problem with Go was lack of evaluation function that would guide the policy. So it had to be learned simultaneously.

You can leave AlphaGo to play a billion games and then learn a policy that requires little to no search but has almost perfect evaluation (local optimality of minimizing future regret).

Same positional play is exhibited by Komodo, and it requires not that much of depth searching, while currently AlphaGo rolls out a whole game for every move.

kotach | 10 years ago | on: FlappyBird hack using Deep Q-Learning

LSTM would converge even faster.

A K-level breadth first search mimicking the optimal policy and a simple learning to search algorithm with a cost sensitive binary linear classifier would work well too.

After training it would be a constant time evaluation of what to do next.

kotach | 10 years ago | on: AIs Have Mastered Chess. Will Go Be Next? (2014)

chess grandmaster can easily be beat by a smartphone app (komodo), is smartphone using more energy than a human brain?

the problem with Go is that there's little data and the game is more complex - but given the small sizes of the deepmind NNs if they generate billions of meaningful games, they could probably compress their model and require less searching thus less energy

kotach | 10 years ago | on: The Neuroscientist Who Lost Her Mind

Studies on twins, especially those that observe separated ones, pretty much show how much bodies behave in a deterministic way, from diseases, to relationships, names, jobs, wishes.

kotach | 10 years ago | on: Deep Q-Learning: Space Invaders

Now, train the network jointly over the game sequence. Or even better, when given a chance to take action rollout on each action and learn jointly on that rest of gameplay.

Reinforcement learning is very hard. Especially when you create meaningful games and then don't use the fact that a whole game is a one long chain of events, and instead force learning on windowed sequence.

Neural network has enough parameters to remember much of these windows and will clearly perform well, but the training last too long given the fact that no structured information is used.

kotach | 10 years ago | on: Yann LeCun's comment on AlphaGo and true AI

Given Langford's locality vs globality argument this also gets quite obvious for the 4th game mistake and overconfidence that AlphaGo had. The rate of growth of the compounding error for local decision maker is going to cause these kinds of mistakes more often than not.

kotach | 10 years ago | on: Lee Sedol Beats AlphaGo in Game 4

Yes, it is true. In the case of Super Mario he does the learning by simulating level-K BFS from positions that resulted in errors (unseen states) and thus minimizes the regret for the next K moves.

Although, if you checkout his papers, the problems I've talked about, when you have more than enough data and when you know you should be able to generalize well you still can get subpar performance if you don't optimize jointly. AlphaGo model isn't optimizied jointly but its power mostly lies in the extreme representation ability of deep neural networks.

kotach | 10 years ago | on: Lee Sedol Beats AlphaGo in Game 4

That's not really a problem. Given a large enough dataset you want to generalize from it - there are always states not present in the dataset - the whole point is now to extract features out of your dataset to allow generalization on unseen states. Seeing all of the Go games isn't possible.

The compounding errors problem that stems from decision bias isn't because you haven't seen the trajectory, it is because the model isn't trained jointly.

We're talking about the same thing. You just aren't familiar with the difference present between joint learning discriminative models and local decision classifiers (Markov entropy model vs conditional random fields - or recursive CNNs trained on joint loss over the sequence or recursive CNNs trained to minimize the loss of all local decisions).

In the case of Go, one would try to minimize the loss over the whole game of Go, or over the local decisions made during the game of Go. The latter will result in decision bias - that will lead to compounding errors. The joint learning has a guarantee that the compounding error has a globally sound bound. (proofs are information theory based and put mathematical guarantees on discriminative models applied to sequence labelling (or sequence decision making))

edit:

Checkout the lecture below, around the 16 minute mark it has a Super Mario example and describes exactly the problem you mentioned. The presenter is one of leading figures in joint learning.

https://www.cs.umd.edu/media/2015/12/video/17235-daume-stuff...

It is completely supervised learning problem. But, look at reinforcement learning as a process that has to have a step of generating a meaningful game from which a model can learn. After you have generated bazillion of meaningful games you can discard the reinforcement and just learn. You now try to get as close to the global "optimal" policy as you can, instead of trying to go from an idiot player to a master.

Of course, the data will have flaws if your intermediate model plays with a decision bias. So, instead of training the intermediate to have a bias, train it without :D

kotach | 10 years ago | on: Lee Sedol Beats AlphaGo in Game 4

They train using trajectories but train them to guess the trajectory locally, not globally. Discounted long-term rewards are just a hack, they aren't joint learning.

The concept of label bias, or decision bias is a joint/structured learning concept. It is a machine learning concept, it has nothing to do with the application. There are training modes with mathematical guarantee that the local decisions will minimize the future regret.

Joint learning is done not on the whole permutation but on the markov-chain of decisions, which is sometimes a good enough assumption. For example, the value policy network of AlphaGo is percisely a Markov chain, given a state, tell me which next state has the highest probability of victory. The search then tries to find the sequence of moves that will maximize the probability, and then it makes the best local decision (one move). It works like limited depth min-max or beam search. They do rollouts (play the whole game) to train the value network, but it is now a question if they train it to minimize the local loss of the made decisions, or if they train it to minimize the future regret of a local decision. As I've stated before, minimizing joint loss over the sequence, or minimizing local loss over each of made decisions, is exactly influencing if there will be bias or not.

The whole point of reinforcement learning is to create a huge enough dataset to overcome the trajectories-not-seen problem. The training of the models for playing Go is entirely a whole different kind of a problem.

Now when they have hundreds of millions of meaningful games they can skip the reinforcement learning and just learn from the games.

The illustration of the "label bias" problem is available in one source I referenced. Terms like compounding errors and unseen state are there. The "label bias" is present only in discriminative models not generative ones. Which means that AlphaGo - being a discriminative model, can suffer from "label bias" if it wasn't trained to avoid it.

kotach | 10 years ago | on: Lee Sedol Beats AlphaGo in Game 4

Yes, the "label bias" is more of a structured learning / joint learning term that is present in natural language processing. But reinforcement learning suffers only if you do the learning to minimize local loss of the decision (label) - if you try to build a classifier that minimizes its loss on local decisions, instead on sequence of decisions.

Their value policy network isn't trained jointly and can compound errors. There are approaches with deep neural networks that don't have a joint training but work pretty well. The reason is that networks have a pretty good memory/representation and by that they avoid much of the problems. But for huge games like Go it is quite possible that more games need to be played for these non-structured models to work well.

kotach | 10 years ago | on: Lee Sedol Beats AlphaGo in Game 4

What you are talking about here is called "label bias". [2] It is present only if training is done badly.

When you have a game of Go, or Super Mario level. You don't want to make your decisions by just checking the local features and doing them, because it can be the case that by compounding errors you end up in a state you never saw, and all of the future decisions won't be good.

One can avoid these situations by training jointly over the whole game.

For example, maximum entropy models can work for decision making problems but their training leaves them in a "label bias" state because the training is trying to minimize loss of local decisions, instead of trying to minimize the future regret of current local decision.

The solution to these label bias problems are Conditional Random Fields, or Hidden Markov Models. You could accomplish the same with Recursive Neural Networks if you trained them properly. For example, there is no search part (monte carlo tree search, or dynamic programming [viterbi] like it is in CRFs or HMMs) in RNNs but they are completely adequate for decision based problems (sequence labeling etc.). Why is that the case? Because search results are present in the data, there's no need to search if you can just learn to search from the data.

If DeepMind open-sourced the hundreds of millions of games that AlphaGo played, it is quite possible to train a model that wouldn't need a Monte Carlo search and would work quite well, because you would learn the model to make local decisions to minimize future regret, not to minimize its local loss. [1]

The only reason why reinforcement learning is used is because there are too few human games of Go available for the model to generalize well. Reinforcement learning can be used in the setting of joint learning because you play out the whole game before you do the learning. This means that you can try to learn a classifier that will minimize the regret by making a proper local decision. Although, as far as I know, and can see from the paper, they didn't train AlphaGo jointly over the game sequence.

But! Now they have a lot of data and they can repeat the process.

[1]: http://arxiv.org/abs/1502.02206

[2]: http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162...

kotach | 10 years ago | on: AlphaGo beats Lee Sedol 3-0 [video]

I believe the whole point of pretraining on reference policies, which a collection of "optimally" played human games is, is just avoidance of bad local optimum.

It can be a case that training and learning on just a learned policy is going to get you stuck in a local optimum that is of worse quality than the one with pretraining.

If they stored all of the AI played games their reference policy (the data) would be of extreme value. You could train a recurrent neural network, without any reinforcement learning, that you could probably run on a smartphone and beat all of the players. You wouldn't need a monte carlo search too.

There are algorithms [1] that have mathematical guarantees of achieving local optimality from reference policies that might not be optimal, and can even work better than the reference policy (experimentally) - assuming that the reference policy isn't optimal. The RNN trained with LOLS would make jointly local decisions over the whole game and each decision would guarantee that a minimization of future regret is being done. Local optimality mentioned here isn't finding a locally optimal model that approximates the strong reference policy, it means that it will find the locally optimal decisions (which piece to put where) without the need for search.

The problem is that for these algorithms you have to have a closely good reference policy, and given a small amount of human played Go games, reinforcement learning was the main algorithm instead, it allowed them to construct a huge number of meaningful games, from which their system learned, which allowed them to construct a huge number of more meaningful games, etc.

But, now when they have games that have a pretty good (AlphaGo is definitely playing on a superhuman level) reference policy, they can train the model based on that reference policy and they wouldn't need a search part of the algorithm at all.

The model would try to approximate the reference policy and would definitely be worse than AlphaGo real-search based policy, but it wouldn't be significantly worse (mathematical guarantee). The model is trained starting from a good player, and it tries to approximate the good player, on the other hand, reinforcement learning starts from an idiot player, and tries to become a good player, reinforcement learning is thus much much harder.

[1]: http://www.umiacs.umd.edu/~hal/docs/daume15lols.pdf

kotach | 10 years ago | on: AlphaGo shows its true strength in 3rd victory against Lee Sedol

AlphaGo is approximating global optimality by finding local optimality. Local optimality is already computationally very hard, but it is exactly what AlphaGo is doing.

The rollouts they are doing, evaluating every probable move, it is a search process of trying to find local optimality. You have a current state of the board and you're trying to find a decision that minimizes your future regret. It is by definition local optimality.

Global optimality would be finding a sequence of moves that wins you a game.

page 1