Understanding the Inner-workings of Language Models Through Representation Dissimilarity

Davis Brown; Charles Godfrey; Nicholas Konz; Jonathan Tu; Henry Kvinge

doi:10.18653/v1/2023.emnlp-main.403

Understanding the Inner-workings of Language Models Through Representation Dissimilarity

Davis Brown, Charles Godfrey, Nicholas Konz, Jonathan Tu, Henry Kvinge

Abstract

As language models are applied to an increasing number of real-world applications, understanding their inner workings has become an important issue in model trust, interpretability, and transparency. In this work we show that representation dissimilarity measures, which are functions that measure the extent to which two model’s internal representations differ, can be a valuable tool for gaining insight into the mechanics of language models. Among our insights are: (i) an apparent asymmetry in the internal representations of model using SoLU and GeLU activation functions, (ii) evidence that dissimilarity measures can identify and locate generalization properties of models that are invisible via in-distribution test set performance, and (iii) new evaluations of how language model features vary as width and depth are increased. Our results suggest that dissimilarity measures are a promising set of tools for shedding light on the inner workings of language models.

Anthology ID:: 2023.emnlp-main.403
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6543–6558
Language:
URL:: https://aclanthology.org/2023.emnlp-main.403/
DOI:: 10.18653/v1/2023.emnlp-main.403
Bibkey:
Cite (ACL):: Davis Brown, Charles Godfrey, Nicholas Konz, Jonathan Tu, and Henry Kvinge. 2023. Understanding the Inner-workings of Language Models Through Representation Dissimilarity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6543–6558, Singapore. Association for Computational Linguistics.
Cite (Informal):: Understanding the Inner-workings of Language Models Through Representation Dissimilarity (Brown et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.403.pdf
Video:: https://aclanthology.org/2023.emnlp-main.403.mp4

PDF Cite Search Video Fix data