Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, Kentaro Inui


Abstract
Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers’ progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.
Anthology ID:
2021.emnlp-main.373
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4547–4568
Language:
URL:
https://aclanthology.org/2021.emnlp-main.373
DOI:
10.18653/v1/2021.emnlp-main.373
Bibkey:
Cite (ACL):
Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, and Kentaro Inui. 2021. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4547–4568, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Incorporating Residual and Normalization Layers into Analysis of Masked Language Models (Kobayashi et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.373.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.373.mp4
Code
 gorokoba560/norm-analysis-of-transformer +  additional community code
Data
CoNLL 2003MultiNLISSTSST-2