Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

Róbert Csordás, Christopher Potts, Christopher D Manning, Atticus Geiger


Abstract
The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.
Anthology ID:
2024.blackboxnlp-1.17
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
248–262
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.17
DOI:
Bibkey:
Cite (ACL):
Róbert Csordás, Christopher Potts, Christopher D Manning, and Atticus Geiger. 2024. Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 248–262, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations (Csordás et al., BlackboxNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.blackboxnlp-1.17.pdf