How Far Does BERT Look At: Distance-based Clustering and Analysis of BERT’s Attention

Yue Guan, Jingwen Leng, Chao Li, Quan Chen, Minyi Guo


Abstract
Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.
Anthology ID:
2020.coling-main.342
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3853–3860
Language:
URL:
https://aclanthology.org/2020.coling-main.342
DOI:
10.18653/v1/2020.coling-main.342
Bibkey:
Cite (ACL):
Yue Guan, Jingwen Leng, Chao Li, Quan Chen, and Minyi Guo. 2020. How Far Does BERT Look At: Distance-based Clustering and Analysis of BERT’s Attention. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3853–3860, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
How Far Does BERT Look At: Distance-based Clustering and Analysis of BERT’s Attention (Guan et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.342.pdf
Data
MRPCMultiNLISQuAD