Feature Interactions Reveal Linguistic Structure in Language Models

Jaap Jumelet; Willem Zuidema

doi:10.18653/v1/2023.findings-acl.554

Feature Interactions Reveal Linguistic Structure in Language Models

Abstract

We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.

Anthology ID:: 2023.findings-acl.554
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8697–8712
Language:
URL:: https://aclanthology.org/2023.findings-acl.554/
DOI:: 10.18653/v1/2023.findings-acl.554
Bibkey:
Cite (ACL):: Jaap Jumelet and Willem Zuidema. 2023. Feature Interactions Reveal Linguistic Structure in Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8697–8712, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Feature Interactions Reveal Linguistic Structure in Language Models (Jumelet & Zuidema, Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.554.pdf

PDF Cite Search Fix data