One Wide Feedforward Is All You Need

Telmo Pires, António Vilarinho Lopes, Yannick Assogba, Hendra Setiawan


Abstract
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model’s parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.
Anthology ID:
2023.wmt-1.98
Volume:
Proceedings of the Eighth Conference on Machine Translation
Month:
December
Year:
2023
Address:
Singapore
Editors:
Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
1031–1044
Language:
URL:
https://aclanthology.org/2023.wmt-1.98
DOI:
10.18653/v1/2023.wmt-1.98
Bibkey:
Cite (ACL):
Telmo Pires, António Vilarinho Lopes, Yannick Assogba, and Hendra Setiawan. 2023. One Wide Feedforward Is All You Need. In Proceedings of the Eighth Conference on Machine Translation, pages 1031–1044, Singapore. Association for Computational Linguistics.
Cite (Informal):
One Wide Feedforward Is All You Need (Pires et al., WMT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.wmt-1.98.pdf