(no title)
gtoubassi | 3 years ago
As much as I want to be enthusiastic about this, it’s not entirely clear to me that it is surprising that such a feat can be achieved. For example it may be possible to train a 2 layer MLP to predict the state of a tile directly from the inputs. It may be that the most influential activations are closer to the inputs then the outputs, implying that Othello-GPT itself doesn’t have a world model, instead showing that you can predict board colors from the transcript. Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there. I think it would be more impressive if they were only taking “later” activations (further from the input), and using a linear classifier to ensure the world model isn’t in the tile predictor instead of Othello-GPT. I would appreciate it if somebody could illuminate or set my admittedly naive intuitions straight!
That said, I am reminded of another OpenAI paper [1] from way back in 2017 that blew my mind. Unsupervised “predict the next character” training on 82 million amazon reviews, then use the activations to train a linear classifier to predict sentiment. And it turns out they find a single neuron activation is responsible for the bulk of the sentiment!
codeulike|3 years ago
It turns out that the error rates of these probes are reduced from 26.2% on a randomly-initialized Othello-GPT to only 1.7% on a trained Othello-GPT. This suggests that there exists a world model in the internal representation of a trained Othello-GPT.
I take that to mean that the 64 trained Probes are then shown other OthelloGTP internals and can tell us what what the state of their particular 'square' is 98.3% of the time. (we know what the board would look like, but the probes dont)
As you say "Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there."
But then they go back and actually mess around with OthelloGTPs internal state (using the Probes to work out how), changing black counters to white and so on, and then this directly affects the next move OthelloGTP makes. They even do this for impossible board states (e.g. two unlinked sets of discs) and OthelloGTP still comes up with correct next moves.
So surely this proves that the Probes were actually pointing to an internal model? Because when you mess with the model in a way to affect the next move, it changes OthelloGTPs behaviour in the expected way?
moomoo11|3 years ago
pedrosorio|3 years ago
A “Classic” neural network, where every node from layer i is connected to every node on layer i+1
whatshisface|3 years ago
marzullo|3 years ago