Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Neel Nanda, Andrew Lee, Martin Wattenberg


Abstract
How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023a). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for “my colour” vs. “opponent’s colour” may be a simple yet powerful way to interpret the model’s internal state. This precise understanding of the internal representations allows us to control the model’s behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.
Anthology ID:
2023.blackboxnlp-1.2
Volume:
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
December
Year:
2023
Address:
Singapore
Editors:
Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, Hosein Mohebbi
Venues:
BlackboxNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16–30
Language:
URL:
https://aclanthology.org/2023.blackboxnlp-1.2
DOI:
10.18653/v1/2023.blackboxnlp-1.2
Bibkey:
Cite (ACL):
Neel Nanda, Andrew Lee, and Martin Wattenberg. 2023. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. In Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 16–30, Singapore. Association for Computational Linguistics.
Cite (Informal):
Emergent Linear Representations in World Models of Self-Supervised Sequence Models (Nanda et al., BlackboxNLP-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.blackboxnlp-1.2.pdf
Video:
 https://aclanthology.org/2023.blackboxnlp-1.2.mp4