Investigating Softmax Tempering for Training Neural Machine Translation Models

Raj Dabre, Atsushi Fujita


Abstract
Neural machine translation (NMT) models are typically trained using a softmax cross-entropy loss where the softmax distribution is compared against the gold labels. In low-resource scenarios and NMT models tend to perform poorly because the model training quickly converges to a point where the softmax distribution computed using logits approaches the gold label distribution. Although label smoothing is a well-known solution to address this issue and we further propose to divide the logits by a temperature coefficient greater than one and forcing the softmax distribution to be smoother during training. This makes it harder for the model to quickly over-fit. In our experiments on 11 language pairs in the low-resource Asian Language Treebank dataset and we observed significant improvements in translation quality. Our analysis focuses on finding the right balance of label smoothing and softmax tempering which indicates that they are orthogonal methods. Finally and a study of softmax entropies and gradients reveal the impact of our method on the internal behavior of our NMT models.
Anthology ID:
2021.mtsummit-research.10
Volume:
Proceedings of Machine Translation Summit XVIII: Research Track
Month:
August
Year:
2021
Address:
Virtual
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
114–126
Language:
URL:
https://aclanthology.org/2021.mtsummit-research.10
DOI:
Bibkey:
Cite (ACL):
Raj Dabre and Atsushi Fujita. 2021. Investigating Softmax Tempering for Training Neural Machine Translation Models. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 114–126, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Investigating Softmax Tempering for Training Neural Machine Translation Models (Dabre & Fujita, MTSummit 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mtsummit-research.10.pdf