MuZero: Mastering Go, chess, shogi and Atari without rules

[+] ignoranceprior|5 years ago|reply

Whoa, this is extremely impressive. Quotes from the BBC article:

> "For the first time, we actually have a system which is able to build its own understanding of how the world works, and use that understanding to do this kind of sophisticated look-ahead planning that you've previously seen for games like chess.

> "[It] can start from nothing, and just through trial and error both discover the rules of the world and use those rules to achieve kind of superhuman performance."

> [...] MuZero is effectively able to squeeze out more insight from less data than had been possible before, explained Dr Silver.

https://www.bbc.com/news/technology-55403473

It seems like we're getting much closer to artificial general intelligence from two directions: reinforcement learning (such as MuZero), and sequence prediction (such as GPT-3 and iGPT). Very interesting times to be in the AI field.

[+] bko|5 years ago|reply

I've noticed all the top performing AI reinforcement algorithms i hear about know next to nothing about the initial rules. And not only do they perform as well as more supervised methods, but much better

The one exception is self driving. I listened to the Lex Fridman interview with ceo of waymo recently and he made a case for the controlled environment (e.g. separate detection from decision making and planning) and pushed back against the end to end approach that doesn't make any preconceived assumptions about the environment. As an example he takes red lights. They're clearly human engineered signals, so it makes sense to have a module that can explicitly determine the signal as opposed to learning the behavior

But that's true about other games as well and end to end methods still outperform. Which makes me ask, is end to end learning an inevitability for self driving as well or is this the one domain special due to complexity or other aspects?

[+] chongli|5 years ago|reply

What does it mean to “not be given the rules”? If you set a child down in front of a chess board with the pieces nearby and they are not aware of the rules, I doubt they’d ever figure out how to play even a single correct game of chess. Heck, the child may decide to put the pieces in their mouth or dress them up as make belief characters.

Without any concept of the rules you have no way of even knowing that you’ve set up the pieces for a legal starting position, never mind executing a legal move to open the game.

This is really bizarre.

[+] thomasahle|5 years ago|reply

>> "[It] can start from nothing, and just through trial and error both discover the rules of the world

Unless they've changed a lot of things since the original paper, this is a bit exaggerated.

MuZero learns what moves are allowed in a given position/situation, but it still needs to know a finite overall set of possible actions.

E.g. for chess, it isn't told which fourty moves a available at each point in its search tree, but it still knows to only consider 64x64 discrete options.

[+] bshanks|5 years ago|reply

This is celebrating their Nature publication today. Here is the preprint:

https://deepmind.com/research/publications/Mastering-Atari-G... https://arxiv.org/pdf/1911.08265.pdf

[+] syntaxing|5 years ago|reply

As impractical as the idea is, reinforcement learning is so damn fun. I highly recommend others to play around with it. I originally was using the famous fork of OpenAi baseline, stable baseline but had issues with tuning with Optuna. I recently stumbled across Ray from Berkley [1] and it has a newer and fancier built-in hyper-parameter tuner. Even as a hardware engineer that's only a software hobbyist can make the computer play some atari games. I think my next step is to try to make my own Super Mario agent.

[1] https://docs.ray.io/en/latest/index.html

[+] nightcracker|5 years ago|reply

Eh, I'd say it's fun if you have a couple thousand TPUs lying around.

If you're just messing around with 1 GPU and a desktop PC you should be happy to get Atari breakout to work.

[+] jedharris|5 years ago|reply

Same topic as a year ago, but deserves much more examination than it got then.

[+] MasterScrat|5 years ago|reply

Yeah it's basically the exact same thing as in Nov 2019 right?

They're just hyping up their Nature publication. Or did I miss something?

[+] ArtWomb|5 years ago|reply

Curious if MuZero was ever unleashed upon the ProcGen Benchmark? Where immediate rewards are sparse. But once the underlying generators are "solved". They can be readily exploited ;)

https://www.amazon.science/blog/neurips-reinforcement-learni...

[+] skybrian|5 years ago|reply

The full Atari game list (Appendix I) is interesting. It's not better at every game, and scores a zero on Pitfall and Montezuma's Revenge.

[+] lacker|5 years ago|reply

You might be interested to read https://deepmind.com/blog/article/Agent57-Outperforming-the-... which has an alternative approach that does better on those games. Basically those games involve lots of “exploration” to find the winning states, so you need some algorithm that is incentivized to explore through a large state space even when it hasn’t seen any reward there.

[+] webmaven|5 years ago|reply

I may be missing something, but it seems that what is being described is a neural net architecture that can be trained on any of several games to get impressive results for those game.

NOT that one neural net can be trained to play all of the games.

So, while this is an interesting result and makes using the same architecture for specific applications easier and a bit more plug-and-play with little to no modification of the code, what it accomplishes is reducing the effort required by a software engineer or researcher on adapting the software before training even begins, but it pretty much still requires the same amount of training.

What this doesn't seem to do is allow the same trained network to be applied to multiple tasks (which I think most of the AGI comments are assuming), and it certainly doesn't generalize anything among the games it is trained on.

[+] aleem|5 years ago|reply

Point this thing at the stock market and see how that game plays out.

[+] hnracer|5 years ago|reply

Many very smart people have tried and failed. State of the art remains very basic supervised models with hand engineered features. In the markets, data is permanently scarce, so these methods don't work well. In the RL problems that DeepMind is solving, data is literally unlimited, and that's the problem space that these methods have been designed for.

[+] patagurbon|5 years ago|reply

It's not so clear to me how you would train a reinforcement learning agent for the stock market. You have historic data for prices etc. But that's more of a supervised learning thing. You could set it loose on one of those realtime market simulators, but the agents actions wouldn't have any impact on the simulation right?

[+] Buttons840|5 years ago|reply

I'd still be most impressed to see an AI beat the top Civilization players. No mechanical advantage since it's turn based, but there are several different types of decisions to make beyond just "move a piece". AIs haven't yet conquered such environments.

It would also give the gaming industry a kick in the pants to start making better AIs.

[+] dwohnitmok|5 years ago|reply

I am fairly confident a team of DeepMind's calibre could put together an AI in fairly short order that would demolish top-level Civilization players. Despite my confidence, I still would love to see such a thing made.

DeepMind made a good effort with AlphaStar at building an AI that could compete with top-level humans in Starcraft. It wasn't superhuman; it could still be consistently beaten by the absolute best Starcraft players, especially as Zerg or Terran. However, as Protoss, AlphaStar was truly a pro-level player. I'm somewhat surprised DeepMind didn't go further and try to optimize AlphaStar to truly be superhuman. I'm not sure if that indicates a fundamental limitation of their approach or whether it was a shift in approach. This was with successively refined limitations on AI action speeds that caused AlphaStar to really rely on strategy and tactics rather than brute force speed.

Regardless, real-time strategy games feel much more difficult than turn-based strategy games to develop a good AI for. Just being able to split things into discrete turns seems like a massive simplification.

[+] iamcreasy|5 years ago|reply

Would it be able to play(and win) Among Us as an imposter or we are still far away from that?

[+] lacker|5 years ago|reply

Far away - the Atari games tested do not include multiplayer logic or any communication with other agents.

[+] vermilingua|5 years ago|reply

DeepMind seems to be building the Wintermute to OpenAI’s Neuromancer. Where’s Turing?

[+] willowwonder45|5 years ago|reply

Curious as to what is "Turing" in this context?

[+] kovek|5 years ago|reply

Amazing! Anyone has ideas on how to: 1. Bet on AGI, 2. Encourage AGI?

I have a strong belief that it could grow and I’d like to contribute (and join the development)

[+] mellosouls|5 years ago|reply

It's not obvious this has much to do with AGI in the sense of human level sentience.

[+] xiphias2|5 years ago|reply

Buy Alphabet or OpenAI stock. Or even Tesla, which also helps us to get closer to self driving cars. Although I can't comment on whether the trades themselves will give you profit or not, as that's impossible to say at this point :)

[+] deegles|5 years ago|reply

Is there a way to learn Go from scratch using these AIs? I wonder if it would pay off in the long run to be fully trained by one.

[+] cgreerrun|5 years ago|reply

What you can do is checkout the algorithm at particular stages of development. AlphaZero&Friends start out not being very good at the game, then over time they learn and eventually become super human. You typically checkpoint the weights for the model at various stages. So early on, the algo would be like a 600 elo player for chess and then eventually get to superhuman elo levels. If you wanted to train using an AlphaX algo, you can gradually play against underdeveloped versions of the algo until you can beat them by loading up the weights at increasing stages of deveopment.

If you're curious how it would work, I implemented AlphaZero (but not Mu yet) using GBDTs instead of NNs here: https://github.com/cgreer/alpha-zero-boosted. Instead of saving the "weights" for a GBDT, you save the split points for the value/policy model trees, but the concept is the same.

[+] kadoban|5 years ago|reply

You can play against open source reimplementations of some of the ideas behind AlphaGo family AIs. LeelaZero was one of the early ones, KataGo is probably your best bet right now. Sai is also in the mix.

All are _very_ strong. KataGo is ungodly strong, it beats pros.

Learning Go is about more than just playing against strong players, but it could help. The biggest difficulty is that the strong AIs aren't actually that good at playing handicap games, and they're also almost completely unable to explain to you why you should play one move over another.

[+] devindotcom|5 years ago|reply

I think it would just smoke you from the outset. As far as I know it doesn't have a structured intelligence it can scale back - it would make the optimal move every time, destroying you like it destroyed top-tier players.

I tried learning Go a little while back but hit a wall. Was thinking about trying this more gamified option:

https://www.wolfeystudios.com/TheConquestOfGo.html

[+] hakuseki|5 years ago|reply

Yes, you can just `pip install katrain` followed by `python -m katrain` to get started. Personally I would recommend at least reading abut the rules first (unlike MuZero).

I think the strength or lack thereof of your opponent is actually much less important than the strength of the AI you use to review your games. After each game you should study the AI's advice and learn the moves it recommends.

[+] mark_l_watson|5 years ago|reply

I watched the Alpha Go vs. Lee Sedol games live. Big fan.

That said, I think Deep Mind should go all in for solving practical real world problems.

[+] syntheticmindai|5 years ago|reply

They have used muzero to do video compression and saved 5% of bits. Source: david silver wired.co.uk interview

[+] tobessebot|5 years ago|reply

I mean they pretty much solved protein folding this year...

[+] ngcc_hk|5 years ago|reply

We live only once. Could this uniqueness meant some of life decision must be done without repeating the case billion of time, which obviously s impossible.

I read qm. But does this actually useful for partial info. This is also another life situation where you never have full information.

I still wonder about the intelligence.

[+] asbund|5 years ago|reply

Yea, wake me up when these models understand rule of physics and rule of law

78 comments