Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns

Zihan Wang, Jiuxiang Gu, Jason Kuen, Handong Zhao, Vlad Morariu, Ruiyi Zhang, Ani Nenkova, Tong Sun, Jingbo Shang


Abstract
We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.
Anthology ID:
2022.findings-acl.74
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
916–925
Language:
URL:
https://aclanthology.org/2022.findings-acl.74
DOI:
10.18653/v1/2022.findings-acl.74
Bibkey:
Cite (ACL):
Zihan Wang, Jiuxiang Gu, Jason Kuen, Handong Zhao, Vlad Morariu, Ruiyi Zhang, Ani Nenkova, Tong Sun, and Jingbo Shang. 2022. Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns. In Findings of the Association for Computational Linguistics: ACL 2022, pages 916–925, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns (Wang et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-acl.74.pdf
Video:
 https://aclanthology.org/2022.findings-acl.74.mp4
Data
GLUELRAQNLI