Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training

Minghao Wu, Yitong Li, Meng Zhang, Liangyou Li, Gholamreza Haffari, Qun Liu


Abstract
Learning multilingual and multi-domain translation model is challenging as the heterogeneous and imbalanced data make the model converge inconsistently over different corpora in real world. One common practice is to adjust the share of each corpus in the training, so that the learning process is balanced and low-resource cases can benefit from the high resource ones. However, automatic balancing methods usually depend on the intra- and inter-dataset characteristics, which is usually agnostic or requires human priors. In this work, we propose an approach, MultiUAT, that dynamically adjusts the training data usage based on the model’s uncertainty on a small set of trusted clean data for multi-corpus machine translation. We experiments with two classes of uncertainty measures on multilingual (16 languages with 4 settings) and multi-domain settings (4 for in-domain and 2 for out-of-domain on English-German translation) and demonstrate our approach MultiUAT substantially outperforms its baselines, including both static and dynamic strategies. We analyze the cross-domain transfer and show the deficiency of static and similarity based methods.
Anthology ID:
2021.emnlp-main.580
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7291–7305
Language:
URL:
https://aclanthology.org/2021.emnlp-main.580
DOI:
10.18653/v1/2021.emnlp-main.580
Bibkey:
Cite (ACL):
Minghao Wu, Yitong Li, Meng Zhang, Liangyou Li, Gholamreza Haffari, and Qun Liu. 2021. Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7291–7305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Uncertainty-Aware Balancing for Multilingual and Multi-Domain Neural Machine Translation Training (Wu et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.580.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.580.mp4
Data
WMT 2014