On the Sub-layer Functionalities of Transformer Decoder

Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, Zhaopeng Tu


Abstract
There have been significant efforts to interpret the encoder of Transformer-based encoder-decoder architectures for neural machine translation (NMT); meanwhile, the decoder remains largely unexamined despite its critical role. During translation, the decoder must predict output tokens by considering both the source-language text from the encoder and the target-language prefix produced in previous steps. In this work, we study how Transformer-based decoders leverage information from the source and target languages – developing a universal probe task to assess how information is propagated through each module of each decoder layer. We perform extensive experiments on three major translation datasets (WMT En-De, En-Fr, and En-Zh). Our analysis provides insight on when and where decoders leverage different sources. Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance – a significant reduction in computation and number of parameters, and consequently a significant boost to both training and inference speed.
Anthology ID:
2020.findings-emnlp.432
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4799–4811
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.432
DOI:
10.18653/v1/2020.findings-emnlp.432
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.432.pdf
Video:
 https://slideslive.com/38940141