An Investigation of Language Model Interpretability via Sentence Editing

Samuel Stevens; Yu Su

doi:10.18653/v1/2021.blackboxnlp-1.34

An Investigation of Language Model Interpretability via Sentence Editing

Abstract

Pre-trained language models (PLMs) like BERT are being used for almost all language-related tasks, but interpreting their behavior still remains a significant challenge and many important questions remain largely unanswered. In this work, we re-purpose a sentence editing dataset, where faithful high-quality human rationales can be automatically extracted and compared with extracted model rationales, as a new testbed for interpretability. This enables us to conduct a systematic investigation on an array of questions regarding PLMs’ interpretability, including the role of pre-training procedure, comparison of rationale extraction methods, and different layers in the PLM. The investigation generates new insights, for example, contrary to the common understanding, we find that attention weights correlate well with human rationales and work better than gradient-based saliency in extracting model rationales. Both the dataset and code will be released to facilitate future interpretability research.

Anthology ID:: 2021.blackboxnlp-1.34
Volume:: Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Jasmijn Bastings, Yonatan Belinkov, Emmanuel Dupoux, Mario Giulianelli, Dieuwke Hupkes, Yuval Pinter, Hassan Sajjad
Venue:: BlackboxNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 435–446
Language:
URL:: https://aclanthology.org/2021.blackboxnlp-1.34
DOI:: 10.18653/v1/2021.blackboxnlp-1.34
Bibkey:
Cite (ACL):: Samuel Stevens and Yu Su. 2021. An Investigation of Language Model Interpretability via Sentence Editing. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 435–446, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: An Investigation of Language Model Interpretability via Sentence Editing (Stevens & Su, BlackboxNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.blackboxnlp-1.34.pdf
Code: samuelstevens/bert-edits + additional community code
Data: 100DOH

PDF Cite Search Code