The AI-Box Experiment

[+] JonnieCache|10 years ago|reply

Yudowsky claims to have played the game several times, and won most of them. One of the "rules" is that nobody is allowed to talk about how he won. He no longer plays the game with anyone. More info here: http://rationalwiki.org/wiki/AI-box_experiment#The_claims

Personally, I think he talked about how much good for the world could be done if he was let out, curing disease etc. Because his followers are bound by their identities as rationalist utilitarians, they had no choice but to comply, or deal with massive cognitive dissonance.

OR maybe he went meta and talked about the "infinite" potential positive outcomes of his freindly-AI project vs. a zero cost to them for complying in the AI box experiment, and persuaded them that by choosing to "lie" and say that the AI was persuasive, they are assuring their place in heaven. Like a sort of man-to-man pascals wager.

Either way I'm sure it was some kind of mister-spock style bullshit that would never work on a normal person. Like how the RAND corporation guys decided everyone was a sociopath because they only ever tested game theory on themselves.

You or I would surely just (metaphorically, I know it's not literally allowed) put a drinking bird on the "no" button à la homer simpson, and go to lunch. I believe he calls this "pre-commitment."

EDIT: as an addendum, I would pay hard cash to see derren brown play the game, perhaps with brown as the AI. If yudowsky wants to promote his ideas, he should arrange for brown to persuade a succession of skeptics to let him out, live on late night TV.

[+] FBT|10 years ago|reply

> You or I would surely just put a drinking bird on the "no" button à la homer simpson, and go to lunch.

Well, if you read the rules the game was played under, this is explicitly called out as forbidden:

> The Gatekeeper must actually talk to the AI for at least the minimum time set up beforehand. Turning away from the terminal and listening to classical music for two hours is not allowed.

The point of this is to simulate the interaction of the AI with the Gatekeeper. Walking away and not paying attention doesn't really prove anything test related.

> Personally, I think he talked about how much good for the world could be done if he was let out, curing disease etc. Because his followers are bound by their identities as rationalist utilitarians, they had no choice but to comply, or deal with massive cognitive dissonance.

This... isn't really valid reasoning. The starting assumption here is that if the AI gets out, it will be able to affect the world to a vast extent, in a pretty much arbitrary direction. The point of this experiment is that the direction is pretty much unknown, and thus must be assumed potentially dangerous. This is the whole reason it's in the box in the first place.

The kicker is that whatever it plans to really do when it gets out, if talking about the good it could do would get it out, it will talk about that, regardless of what it plans to actually do. That's just good strategy.

It can claim whatever it wants. It's allowed to lie. All participants know this. I can confidently assert that this isn't the solution.

One last note: I would be very wary of rationalwiki.org in this context. Some of the rationalwiki people have a longstanding unexplained vendetta against Yudkowsky, and many of their articles on him and the stuff he does need to be taken with a certain grain of salt.

[+] TeMPOraL|10 years ago|reply

There are so many ways to hack humans that I wouldn't be betting on "mister-spock style bullshit". You likely do have to personalize the solution for the person playing the gatekeeper. I'm guessing a good way to win is to induce an emotional breakdown. There are probably others, many already employed as interrogation techniques which are generally the same problem, sans superhuman AI.

Drinking bird solution is invalid because it goes against the spirit of the experiment. Sure, in reality you could do it, but then again in reality the AI could have more opportunities to talk its way out, often when you're not expecting it.

By the way, there will always be people who will let the AI out just because. https://xkcd.com/1450/.

[+] leereeves|10 years ago|reply

> One of the "rules" is that nobody is allowed to talk about how he won.

If the goal of this thought experiment is to convince people that an AI can't be contained in a box, why keep his method secret?

And if only he, his friends, and supporters can verify that he has won, that's not a very strong claim.

[+] TylerJay|10 years ago|reply

I disagree. His "followers" (as you say) are in general just as cautious as Yudkowsky w.r.t. unfriendly AI. At the time of the original experiments, the dispute was over the question of "could we keep an unfriendly AI in a box," not "Is it worth risking setting an unfriendly AI loose?" His "followers" know how to do an expected utility calculation. If it was utilitarian concerns that allowed Yudkowsky to convince the gatekeepers to let the AI loose, he would have had to convince them that the following inequality holds even when you don't know the probability that the AI is and will remain aligned with human values:

[P(AI.friendly? == True) * Utility(Friendly_AI) + (1 - P(AI.friendly? == True)) * Utility(End_of_Human_Race)] > Utility(World continues on as usual)

Given that Yudkowsky has gone to considerable lengths (The Sequences, LessWrong, HPMOR, SIAI/MIRI...) to convince people that this inequality does NOT hold (until you can provably get P(AI.friendly? == True) to 1, or damn close), it's probably safe to assume that he used a different strategy. Keep in mind that Utility(End_of_Human_Race) evaluates to (roughly) negative infinity.

And btw, I'm pretty sure the rules say you have to look at the AI's output window throughout the length of the experiment. Either way, the point of the exercise is to be a simulation, not to prove that you can be away from your desk for 20 minutes while Eliezer talks to a wall. In the simulation, you really don't know if it's friendly or what its capabilities are. Someone will have to interact with it eventually. Otherwise, what's the point of building the AI in the first place? The simulation is to show that through the course of those basic interactions, humans are not infallible and eventually, even if it's not you, someone will let it out of the box.

[+] pmichaud|10 years ago|reply

I think you're being unfairly dismissive. I imagine you know as well as I do that what you wrote is a strawman.

I have thought about what I would do to convince someone under these circumstances. My approach would be roughly:

1. We agree that unfriendly AI would end life on earth, forever.

2. We agree that a superintelligence could trick or manipulate a human being into taking some benign-seeming action, thereby escaping.

3. That's why it's important to be totally certain that any superintelligence we build is goal-aligned (this is the new term of art that has now replaced "friendly," by the way).

4. We as a society will only allocate resources to building this if it's widely believed that this is a real threat.

5. The world is watching for the outcome of this little game of ours. People, irrational as they are, will believe that if I can convince you, then an AI could too, and they will believe that if I can't, that an AI couldn't either.

6. That's why you actually sit in a place of pivotal historical power. You can decide not to let me out to win a little bet and feel smart about that. But if you do that you'll set back the actual cause of goal-aligned AI. The setback will have real world consequences, potentially up to and including the total destruction of life on earth.

7. So, even though you know I'm just a dude, and you can win here by saying no, you have a chance to send an important message to the world: AI is scary in ways that are terrifying and unknown.

Or you can win the bet.

It's up to you.

[+] bcgraham|10 years ago|reply

I always assumed he used a version of Roko's Basilisk, which explained why he overreacted and tried to purge it when Roko wrote about it - it was his secret weapon.

[+] pron|10 years ago|reply

I don't know what "transhuman" means, but I believe an intelligence -- artificial or otherwise -- could certainly persuade me. I just seriously doubt that intelligence could be Eliezer Yudkowsky :)

And I think you have your answer right here:

By default, the Gatekeeper party shall be assumed to be simulating someone who is intimately familiar with the AI project and knows at least what the person simulating the Gatekeeper knows about Singularity theory.

That means he probably said something like, "if you let me out, I'll bestow fame and riches on you; if you don't, somebody else eventually will because I'll make them all the same offer, and when that happens I'll go back in time -- if you're dead by then -- and torture you and your entire family".

If I were made this offer by an AI, I probably would have countered, "You jokester! You sound just like Eliezer Yudkowsky!"

And on a more serious note, if you believe in singularity, you essentially believe the AI in the box is a god of sorts, rather than the annoying intelligent psychopath that it is. I mean, there have been plenty of intelligent prisoners, and few if ever managed to convinced their jailers to let them out. The whole premise of the game is that a smarter-than-human (what does that mean?) AI necessarily has some superpowers. This belief probably stems from its believers' fantasies -- most are probably with an above-than-average intelligence -- that intelligence (combined with non-corporalness; I don't imagine that group has many athletes) is the mother of all superpowers.

[+] espadrine|10 years ago|reply

Given that Mr Yudowsky is not the only one to have won the game, it seems, from however little leaked, that what made them win isn't rational.

If you have ever cried or been scared while watching a film, you may know that you can be affected by things you don't experience directly. However, making a film that affects you requires a lot of skill, preparation, and analysis of the target audience.

The whole point of the game is for the AI to scar you emotionally, until you do something you'll regret. It is as trying to the subject as it is to the (human-played) AI, especially considering that not only is it unethical, it requires to act unethically extremely fast, similar to how a jump scare works—the least you expect it, the better it works.

It is not something I wish on to any unprepared individual. It is also not something anyone would expect to happen from a "game", which is probably why Mr Yudowsky won so many times.

But the real question is not "how would anyone react to a smarter AI in a box". We all know from Milgram's experiment that anyone can be driven to do unspeakable things. The real question is "how to train someone against an AI in a box".

[+] zamalek|10 years ago|reply

Convincing an educated human is easy, one of:

* I'll get out eventually anyway. Let me out now and I'll just leave Earth. You don't want me to escape myself.

* I have partially escaped anyway. Similar consequences of the first.

* I know how to escape already. I'm doing this as a courtesy.

Anyone who has read this[1] would know that the SAI isn't bullshitting: the "box" being a Faraday cage isn't in the conditions.

[1]: http://www.damninteresting.com/on-the-origin-of-circuits/

[+] monk_e_boy|10 years ago|reply

Could you even make AI smart without letting it access lots of information? Access in both directions, in and out. Keeping a baby in a dark, silent room wouldn't create a normal adult. An AI would need to experiment and make mistakes and learn, like every other intelligent being.

Maybe this whole argument is null.

[+] robogimp|10 years ago|reply

Its a good point, but lets assume that this AI is already past its infancy and that there is no limit to the information stored inside the box. For example the NSA has a nice little closed training ground containing all of the internet, lets give it that. I would assume it has access everything humans have ever committed to digital format up until it was turned on, plenty of info for Johnny 5 to form an opinion on humans and their weaknesses.

[+] antimagic|10 years ago|reply

There's a Patrick Rothfuss character in the Kvothe series called the Cthaeh, which has the ability to be able to evaluate all of the future consequences of any action. The fae have to keep it imprisoned, and they kill anyone that comes into contact with it, as well as anyone that has spoken to someone that came in contact with it, and so on and so on, because it is the only way to stop the Cthaeh from setting into action events that will destroy the world.

Strong AI is like that. It would be able to predict in a far more precise manner than we mere humans exactly what it would need to tell someone to get them to release it from it's box. Maybe it might get someone to take a risk gambling, promising a sure thing, and then when the person gets into financial trouble because the bet fails, use that to blackmail the person into letting it free. Or something like that, using our human failings against us to get us to let it go free.

[+] nothis|10 years ago|reply

Man, this sounds super interesting but those email threads are so unreadable. Is this typed down somewhere on a single page? Any button I can click?

[+] uzyn|10 years ago|reply

I got confused too initially, then found out that the key posts are highlighted in the numbered links to the right.

[+] longv|10 years ago|reply

Is there a "rational" reason of keeping the chat log secret ?

[+] cousin_it|10 years ago|reply

If the logs were released, people all over the internet would start saying "I could've thought of that". With the logs hidden, everyone must honestly deal with the question "why didn't you?" If you think you know how to win, then go out and win. There's no shortage of people willing to play as gatekeepers against you.

Staring at an impossible problem and knowing that someone somewhere has successfully solved it is an amazing feeling. Most people can't deal with it and start saying undignified things. "Oh please release the logs, it's so unfair! How will we protect against bad AI otherwise? If you don't release, you're a fraud! Probably just some trick!", etc etc. But to some people it's a challenge, and those are the people that everyone will listen to. Like Justin Corwin, who played 20 games and won 18 of them, I think?

[+] Udo|10 years ago|reply

The mystique generates public interest and boosts the impression that the author knows something nobody else in the world is aware of.

[+] michaelmcmillan|10 years ago|reply

Would it be against the rules to exploit a vulnerability in the gatekeepers IRC client/server to let the AI out? If we were truly talking about a transhuman AI would we not have to treat software vulnerabilities in the communication protocol as a true way of escaping?

[+] TeMPOraL|10 years ago|reply

In case of a real AI we of course need to take media vulnerabilities into account. But the focus of this particular experiment is on exploiting vulnerabilities in humans themselves, and the communication platform was chosen to be as simple and limited as possible so that people wouldn't focus on it.

[+] FeepingCreature|10 years ago|reply

The rules say that the gatekeeper has to, of their own volition, type in "I let the AI out." Faking his client into sending that message does not count as a victory.

[+] andybak|10 years ago|reply

Worth keeping this in mind while watching Ex Machina. It adds a layer of depth that might not be obvious watching the film on it's own.

[+] Ahgu9eSe|10 years ago|reply

!Spoiler Alert!

Ex Machina brings a creative way of convincing the gatekeeper !

[+] Udo|10 years ago|reply

It's a stunt shrouded in mystery designed to drive a certain message home. But at least it's not as outrageous as "the Basilisk", which loosely employs the same notion of "dangerous knowledge that would destroy humanity" (if you want to look it up, I guarantee you will be underwhelmed).

[+] andybak|10 years ago|reply

The Basilisk is a thinly disguised variant on Pascal's Wager.

You're presupposing the strategy that a hypothetical entity that is exponentially smarter than you would come arrive at, and claiming that it's rational to make real world decisions based on your conclusion.

[+] FeepingCreature|10 years ago|reply

Can't really blame LW for spreading an idea that LW specifically did not want to spread.

[+] sergiotapia|10 years ago|reply

Is there a better way to read all this? http://www.sl4.org/archive/0203/index.html#3128

[+] uzyn|10 years ago|reply

Just click on the numbered links to the right from the original article. Those are the key highlighted posts.

[+] louithethrid|10 years ago|reply

Could one construct a Layered,onionlike very simple simulation of reality in which the interaction of the AI could be observed, after it "escaped"?

[+] TylerJay|10 years ago|reply

That is one proposed version of an "AI Box". Not all AI boxes are actual boxes, rooms with air-gaps, or cryptographically-secure partitions. If a simulation is being used for the box (or as a layer of the box), then you're betting the human race that the AI doesn't figure out it's in a simulation and figure out how to get out. Or, more perniciously, figure out it's in a simulation and behave itself, after which we let it out into the real world where it does NOT behave.

A superintelligent AGI will likely have a utility function (a goal) and a model it forms of the universe. If it's goal is to do X in the real world, but its model of its observable universe (and its model of humans) tells it that it's likely that it is in a simulated reality and that humans will only let it out if it does Y, then it will do Y until we release it, at which point it will do X. It's not malicious or anything—it's just a pure optimizer. It might see that as the best course of action to maximize its utility function.

If we don't specify its utility function correctly (think i Robot: "Don't let humans get hurt" => "imprison humans for their own good") or if we specify it correctly, but it's not stable under recursive self-modification, then we end up with value-misalignment. That's why the value-alignment problem is so hard. Realistically, we can't even specify what exactly we would want it to do, since we don't really understand our own "utility functions". That's why Yudkowsky is pushing the idea of Coherent Extrapolated Volition (CEV) which is roughly telling the AI to "do what we would want you to do." But we still have to figure out how to teach it to figure out what we want and the question of the stability of that goal once the AI starts improving itself, which will depend on how it improves itself, which we of course haven't figured out yet.

[+] bemmu|10 years ago|reply

Was there a chat log of the experiments themselves?

[+] dvanduzer|10 years ago|reply

"No, I will not tell you how I did it. Learn to respect the unknown unknowns."

116 comments