Advances and Challenges in Unsupervised Neural Machine Translation

Unsupervised cross-lingual language representation initialization methods, together with mechanisms such as denoising and back-translation, have advanced unsupervised neural machine translation (UNMT), which has achieved impressive results. Meanwhile, there are still several challenges for UNMT. This tutorial first introduces the background and the latest progress of UNMT. We then examine a number of challenges to UNMT and give empirical results on how well the technology currently holds up.

1 Tutorial Content

Introduction
Machine translation (MT) is a classic topic in the NLP community. Since 2010s, deep learning methods have been adopted in neural MT (NMT) and NMT has achieved promising performances (Bahdanau et al., 2015). Recently, NMT has been adapted to the unsupervised scenario. Unsupervised NMT (UNMT) (Artetxe et al., 2018b;Lample et al., 2018a) only requires monolingual corpora, using a combination of diverse mechanisms such as an initialization with bilingual word embeddings, denoising auto-encoder, back-translation, and shared latent representation.

Methods
Cross-lingual language representation initialization. In supervised NMT, language representation initialization is not so necessary, because the bilingual corpus can help NMT learn the crosslingual representation. In comparison, there is only monolingual corpus for UNMT. Therefore, the pretrained unsupervised bilingual word embedding 1 https://wangruinlp.github.io/unmt. html (Artetxe et al., 2017;Lample et al., 2018b) or unsupervised cross-lingual language model (Lample and Conneau, 2019) provide a naive translation knowledge to enable the back-translation to generate pseudo-parallel corpora at the beginning of the UNMT training. Denoising auto-encoder: Noise obtained by randomly performing local substitutions and word reorderings (Vincent et al., 2010), is added to the input sentences to improve model learning ability and regularization. The denoising auto-encoder model objective function would be optimized by maximizing the probability of encoding a noisy sentence and reconstructing it. Back-translation: The back-translation plays a key role in achieving unsupervised translation relying only on monolingual corpora in each language (Sennrich et al., 2016). The pseudo-parallel sentence pairs produced by the model at the previous iteration have been used to train the new translation model. Sharing latent representations: Encoders and decoders are (partially) shared for two languages. Therefore, the two languages must use the same vocabulary. The entire training of UNMT needs to consider back-translation between the two languages and their respective denoising processing.

Recent Advances
USMT and UNMT. Since 2016, statistical MT (SMT) has been significantly over-passed by NMT. Lample et al. (2018c) and Artetxe et al. (2018a) proposed an alternative method, that is, unsupervised statistical machine translation (USMT) method. However, in the supervised scenario, the performance of USMT method is comparable with that of UNMT. In addition, several works (Marie and Fujita, 2018;Ren et al., 2019;Artetxe et al., 2019) combined UNMT and USMT to improve unsupervised machine translation performance. In WMT-2019, the unsupervised MT task (German-Czech) first-time became the officially task of WMT, and the system from NICT (Marie et al., 2019) won the first place and achieved state-of-the-art performances by combining the USMT and UNMT. However, after the advanced pre-training technologies was developed, USMT became less important. Advanced Pre-Training Technologies. Similar as other NLP tasks, the quality of language representation pre-training significantly affects the performance of UNMT. Several works focus on improving the language representation pre-training. Sun et al. (2019b) proposed to train UNMT jointly with bilingual word embedding agreement. More recently, it has been shown that the pre-trained cross-lingual language model (Lample and Conneau, 2019; Song et al., 2019) achieve better UNMT performance than the bilingual word embedding. In high-resource scenario, UNMT has achieved remarkable performance. However, the performance of low-resource UNMT is still far below expectations Multilingualism. To improve the low-resource UNMT, multi-lingual UNMT (MUNMT) is proposed (Sun et al., 2020;Liu et al., 2020). The translation of low-resouce and zero-shot language pairs can be enhanced by the similar languages in the shared latent representation. In addition, the pivotbased methods are proposed. Leng et al. (2019) introduced unsupervised pivot translation for distant language pairs. The SJTU-NICT team used monolingual corpus together with parallel third-party languages to enhance the low-resource UNMT performance (Li et al., 2020b) and their system achieved the best performance in WMT-2020 unsupervised task (Li et al., 2020a).

Challenges
Most existing works focus on modeling UNMT systems and few works investigate the reason why UNMT works and the scenario where UNMT works. UNMT still has limit performance in the distant language pair and domain-specific scenarios. Distant Language Pairs. we will first empirically show that the performances of UNMT in distant language pairs (Chinese/Japanese-English) are much worse than the similar language pairs (German/French-English). Then, we will show the hypotheses: 1) syntactic structures of distant language pairs are quit different. Without parallel supervision, it is very difficult for UNMT to learn the syntactic correspondence. 2) There are too few shared words/subwords in the distant language pair to learn the shared latent representation for UNMT. Finally, we will show some potential solutions, such as 1) syntactic methods (Eriguchi et al., 2016;Chen et al., 2017Chen et al., , 2018 and 2) artificial shared words/code-switching methods (Yang et al., 2020) and show the initial results.
Domain adaptation methods for UNMT have not been well-studied although UNMT has recently achieved remarkable results in some specific domains for several language pairs. For UNMT, addition to inconsistent domains between training data and test data for supervised NMT, there also exist other inconsistent domains between monolingual training data in two languages. Actually, it is difficult for some language pairs to obtain enough source and target monolingual corpora from the same domain in the real-world scenario.
In this tutorial, we will empirically show different scenarios for unsupervised domain-specific neural machine translation. Based on these scenarios, we will show and analyze several potential solutions including batch weighting, data selection, and fine tuning methods, to improve the performances of domain-specific UNMT systems (Sun et al., 2019a).
Efficiency. Compared with NMT, the training time of UNMT increased rapidly. In addition, learning sharing latent representations ties the performance of both translation directions, especially for distant language pairs, while denoising dramatically delays convergence by continuously modifying the training data. Efficient training of UNMT is also an issue that needs to be solved.

Relevance to the Computational Linguistics Community
This tutorial makes an attempt to review the latest progress on UNMT by introducing advances and challenges for UNMT. MT is a classic topic in the NLP community. Recently, UNMT has attracted great interest in the researchers in both the MT/NLP community and industry.
This tutorial is primarily towards researchers who have a basic understanding of deep learning based NLP. We believe that this tutorial would help the audience more deeply understand UNMT.

Tutorial Outlines
We will present our tutorial in three hours. The detailed tutorial outlines are shown in Table 1.

Specification of Any Prerequisites for the Attendees
This tutorial is primarily aimed at researchers who have a basic understanding of NLP and deep learning.

Small reading list
• Neural Machine Translation: the basic method "Neural machine translation by jointly learning to align and translate" (Bahdanau et al., 2015) and the related deep learning backgrounds "Deep learning" (LeCun et al., 2015).
• UNMT: the basic methods "Unsupervised neural machine translation" (Artetxe et al., 2018b) and "Unsupervised machine translation using monolingual corpora only" (Lample et al., 2018a). State-of-the-art UNMT systems (Marie et al., 2019;Li et al., 2020a). His research interest is natural language processing. He has published more than 120 papers in ACL, EMNLP, COLING, ICLR, AAAI, IJCAI, and IEEE TPAMI/TKDE/TASLP. He won the first places in several NLP shared tasks, such as CoNLL and SIGHAN Bakeoff and top ranking in remarkable machine reading comprehension task leaderboards such as SQuAD2.0 and RACE.

Presenters
He has taught the course "natural language processing" in SJTU for more than 10 years. He has given several tutorials, such as EACL-2021, EMNLP-2021, etc. He is ACL-2017 area chair on parsing, and ACL-2018/2019 (senior) area chairs on morphology and word segmentation.