Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Huanru Henry Mao


Abstract
Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention’s performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.
Anthology ID:
2022.emnlp-main.697
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10236–10242
Language:
URL:
https://aclanthology.org/2022.emnlp-main.697
DOI:
10.18653/v1/2022.emnlp-main.697
Bibkey:
Cite (ACL):
Huanru Henry Mao. 2022. Fine-Tuning Pre-trained Transformers into Decaying Fast Weights. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10236–10242, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights (Mao, EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.697.pdf