top | item 34799737

(no title)

gtoubassi | 3 years ago

I agree this is an incredibly interesting paper. I am not a practitioner but I interpreted the gradient article differently. They didn’t directly find 64 nodes (activations) that represented the board state as I think you imply. They trained “64 independent two-layer MLP classifiers to classify each of the 64 tiles”. I interpret this to mean all activations are fed into a 2 layer MLP with the goal of predicting a single tile (white, black, empty). Then do that 64 times once for each tile (64 separately trained networks).

As much as I want to be enthusiastic about this, it’s not entirely clear to me that it is surprising that such a feat can be achieved. For example it may be possible to train a 2 layer MLP to predict the state of a tile directly from the inputs. It may be that the most influential activations are closer to the inputs then the outputs, implying that Othello-GPT itself doesn’t have a world model, instead showing that you can predict board colors from the transcript. Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there. I think it would be more impressive if they were only taking “later” activations (further from the input), and using a linear classifier to ensure the world model isn’t in the tile predictor instead of Othello-GPT. I would appreciate it if somebody could illuminate or set my admittedly naive intuitions straight!

That said, I am reminded of another OpenAI paper [1] from way back in 2017 that blew my mind. Unsupervised “predict the next character” training on 82 million amazon reviews, then use the activations to train a linear classifier to predict sentiment. And it turns out they find a single neuron activation is responsible for the bulk of the sentiment!

[1] https://openai.com/blog/unsupervised-sentiment-neuron/

discuss

codeulike|3 years ago

Right, so the 64 Probes are able to look at OthelloGTPs internals and are trained using the known board-state-to-OthelloGPT-internals data. The article says

It turns out that the error rates of these probes are reduced from 26.2% on a randomly-initialized Othello-GPT to only 1.7% on a trained Othello-GPT. This suggests that there exists a world model in the internal representation of a trained Othello-GPT.

I take that to mean that the 64 trained Probes are then shown other OthelloGTP internals and can tell us what what the state of their particular 'square' is 98.3% of the time. (we know what the board would look like, but the probes dont)

As you say "Again, not a practitioner but once you are indirecting internal state through a 2 layer MLP it gets less obvious to me that the world model is really there."

But then they go back and actually mess around with OthelloGTPs internal state (using the Probes to work out how), changing black counters to white and so on, and then this directly affects the next move OthelloGTP makes. They even do this for impossible board states (e.g. two unlinked sets of discs) and OthelloGTP still comes up with correct next moves.

So surely this proves that the Probes were actually pointing to an internal model? Because when you mess with the model in a way to affect the next move, it changes OthelloGTPs behaviour in the expected way?

moomoo11|3 years ago

What is MLP?

pedrosorio|3 years ago

https://en.m.wikipedia.org/wiki/Multilayer_perceptron

A “Classic” neural network, where every node from layer i is connected to every node on layer i+1

whatshisface|3 years ago

Multi-layer perception, synonym of neural network but perhaps with the additional implication that it is fully connected.

marzullo|3 years ago

Multilayer Perceptron