Towards a Better Understanding of Label Smoothing in Neural Machine Translation

Yingbo Gao, Weiyue Wang, Christian Herold, Zijian Yang, Hermann Ney


Abstract
In order to combat overfitting and in pursuit of better generalization, label smoothing is widely applied in modern neural machine translation systems. The core idea is to penalize over-confident outputs and regularize the model so that its outputs do not diverge too much from some prior distribution. While training perplexity generally gets worse, label smoothing is found to consistently improve test performance. In this work, we aim to better understand label smoothing in the context of neural machine translation. Theoretically, we derive and explain exactly what label smoothing is optimizing for. Practically, we conduct extensive experiments by varying which tokens to smooth, tuning the probability mass to be deducted from the true targets and considering different prior distributions. We show that label smoothing is theoretically well-motivated, and by carefully choosing hyperparameters, the practical performance of strong neural machine translation systems can be further improved.
Anthology ID:
2020.aacl-main.25
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
212–223
Language:
URL:
https://aclanthology.org/2020.aacl-main.25
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.aacl-main.25.pdf