Rethinking the Value of Transformer Components

Wenxuan Wang, Zhaopeng Tu


Abstract
Transformer becomes the state-of-the-art translation model, while it is not well studied how each intermediate component contributes to the model performance, which poses significant challenges for designing optimal architectures. In this work, we bridge this gap by evaluating the impact of individual component (sub-layer) in trained Transformer models from different perspectives. Experimental results across language pairs, training strategies, and model capacities show that certain components are consistently more important than the others. We also report a number of interesting findings that might help humans better analyze, understand and improve Transformer models. Based on these observations, we further propose a new training strategy that can improves translation performance by distinguishing the unimportant components in training.
Anthology ID:
2020.coling-main.529
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
6019–6029
Language:
URL:
https://aclanthology.org/2020.coling-main.529
DOI:
10.18653/v1/2020.coling-main.529
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.529.pdf