An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak, Can Rager, James Dao, Yeu-Tong Lau


Abstract
Prior work suggests that language models manage the limited bandwidth of the residual stream through a “memory management” mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.
Anthology ID:
2024.blackboxnlp-1.15
Volume:
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, Hanjie Chen
Venue:
BlackboxNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
232–237
Language:
URL:
https://aclanthology.org/2024.blackboxnlp-1.15
DOI:
Bibkey:
Cite (ACL):
Jett Janiak, Can Rager, James Dao, and Yeu-Tong Lau. 2024. An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 232–237, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L (Janiak et al., BlackboxNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.blackboxnlp-1.15.pdf