top | item 36075748

(no title)

This is a field I find fascinating. It's generally the research field of Machine Learning Interpretability. The BlackboxNLP workshop is one of the main places for investigating this and is a very popular academic workshop https://blackboxnlp.github.io/

One of the most interesting presentations in the last session of the workshop is this talk by David Bau titled "Direct Model Editing and Mechanistic Interpretability". David and his team locate exact information in the model, and edit it. So for example they edit the location of the Eiffel Tower to be in Rome. So whenever the model generates anything involving location (e.g., the view from the top of the tower), it actually describes Rome

Talk: https://www.youtube.com/watch?v=I1ELSZNFeHc

Paper: https://rome.baulab.info/

Follow-up work: https://memit.baulab.info/

There is also work on "Probing" the representation vectors inside the model and investigating what information is encoded at the various layers. One early Transformer Explainability paper (BERT Rediscovers the Classical NLP Pipeline https://arxiv.org/abs/1905.05950) found that "the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way: POS tagging, parsing, NER, semantic roles, then coreference". Meaning that the representations in the earlier layers encode things like whether a token is a verb or noun, and later layers encode other, higher-level information. I've made an intro to these probing methods here: https://www.youtube.com/watch?v=HJn-OTNLnoE

A lot of applied work doesn't require interpretability and explainability at the moment, but I suspect the interest will continue to increase.

discuss

HarHarVeryFunny|2 years ago

Thanks, Jay!

I wasn't aware of that BERT explainability paper - will be reading it, and watching your video.

Are there any more recent Transformer Explainability papers that you would recommend - maybe ones that build on this and look at what's going on in later layers?

jayalammar|2 years ago

Additional ones that come to mind now are:

Transformer Feed-Forward Layers Are Key-Value Memories https://arxiv.org/abs/2012.14913

The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention https://arxiv.org/abs/2202.05798

https://github.com/neelnanda-io/TransformerLens