EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

Yulin Xu, Zhen Yang, Fandong Meng, Jie Zhou


Abstract
Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes “Extract and Generate” (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.
Anthology ID:
2022.acl-long.560
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8141–8153
Language:
URL:
https://aclanthology.org/2022.acl-long.560
DOI:
10.18653/v1/2022.acl-long.560
Bibkey:
Cite (ACL):
Yulin Xu, Zhen Yang, Fandong Meng, and Jie Zhou. 2022. EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8141–8153, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation (Xu et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.560.pdf
Software:
 2022.acl-long.560.software.zip
Data
OPUS-100