Improving Multilingual Neural Machine Translation with Auxiliary Source Languages

Multilingual neural machine translation models typically handle one source language at a time. However, prior work has shown that translating from multiple source languages improves translation quality. Different from existing approaches on multi-source translation that are limited to the test scenario where parallel source sentences from multiple languages are available at inference time, we propose to improve multilingual translation in a more common scenario by exploiting synthetic source sentences from auxiliary languages. We train our model on synthetic multi-source corpora and apply random masking to enable ﬂexible inference with single-source or bi-source inputs. Extensive experiments on Chinese/English → Japanese and a large-scale multilingual translation benchmark show that our model outperforms the multilingual baseline signiﬁcantly by up to +4.0 BLEU with the largest improvements on low-resource or distant language pairs.


Introduction
Neural machine translation (NMT) has achieved the state-of-the-art performance across domains and language pairs (Wu et al., 2016;Bojar et al., 2018;Barrault et al., 2019). One of the advantages of NMT over statistical machine translation models is that it enables information sharing among high-resource and low-resource languages by training a multilingual model on the parallel data from multiple language pairs, which has been shown to improve translation quality, especially on low-resource language pairs (Firat et al., 2016a;Ha et al., 2016;Aharoni et al., 2019).
Although multilingual NMT models typically handle one language pair at a time during both training and inference (Ha et al., 2016;Johnson et al., 2017), prior work has shown that translating * Contribution during internship at Microsoft Research Asia.  Figure 1: An example of translating a Chinese sentence into English by using Japanese as the auxiliary language. Adding a synthetic source from Japanese helps to translate the red word "我孙子市" (Abiko, a city in Japan), which is often incorrectly translated into "my grandson city" by standard Chinese-English MT models, while other words can be translated more accurately from the Chinese source.
from multiple parallel source sentences can further improve translation quality (Och and Ney, 2001;Zoph and Knight, 2016;Garmash and Monz, 2016;Nishimura et al., 2018). They propose multi-source translation models to exploit multiple source inputs at inference time. However, these models are limited to the application scenario where the source sentence has already been manually translated into multiple languages. We argue that, in the more common scenario where only one source sentence is provided, we could also improve the translation quality of multilingual NMT models by augmenting the source input with a synthetic sentence generated by a translation model into another language. As shown by the example in Figure 1, the additional synthetic sentence can help translate low-frequency and domain-specific words that are difficult to translate directly from the source.
In this paper, we propose a novel bi-source multilingual NMT model that leverages a synthetic source sentence from an auxiliary language to better translate a source sentence into the target language. We train our bi-source NMT model on a synthetic multi-source translation corpus generated by translating the source side of the parallel data into other source languages using pre-trained NMT models. We contribute a novel training algorithm that 1) randomly selects the auxiliary language at each training iteration, which improves the multilinguality of the encoder representations, and 2) randomly masks out the auxiliary sentence during training, so that the model can perform inference flexibly in two different modes, including a) single-source inference where our model takes a single source as input, and b) bi-source inference where we first translate the original source to another language using an NMT model and then feed the two source sentences into our model to predict the target translation. This allows end users to balance between translation quality and latency by choosing different inference modes.
We experiment on the ASPEC Chinese and English to Japanese translation and a large-scale English-to-many translation benchmark that includes 10 language pairs from WMT. Results show that our method is simple yet effective -it improves English→Japanese translation on out-of-domain test sets and outperforms strong baselines by an average of +1.9 BLEU on the English-to-many translation benchmark. The largest improvements are on low-resource languages, where it brings up to +4.0 BLEU improvements. Further analysis confirms our hypothesis that bi-source inference helps the model disambiguate word senses during translation.

Bi-Source Multilingual NMT
Inspired by prior work on multi-source translation (Zoph and Knight, 2016;Nishimura et al., 2018), we hypothesize that multilingual translation models can benefit from additional synthetic source sentences that are automatically translated from the original source.

Model
Formally, the model computes the probability of target sentence y lt in language l t given the original source sentences x ls from language l s and a synthetic source sentencex la translated from x ls into an auxiliary language l a (l a = l s , l a = l t ) by an MT model: where Θ enc and Θ dec represent the encoder and decoder parameters, respectively, and f (·; Θ enc ) produces the encoder representations of the inputs.
Our encoder-decoder model is based on the Transformer architecture (Vaswani et al., 2017). As shown in Figure 2, we adopt techniques from context-aware machine translation (Voita et al., 2018) to integrate the additional source input into the model: Multi-Encoder Approach encodes the source sentences using separate encoders (Voita et al., 2018) to obtain the hidden representations f (x ls ; Θ enc ) = H s and f (x la ; Θ enc ) = H a . Then, the decoder can attend to H s and H a separately and apply a gating mechanism to obtain the fusion vector h i : where h tgt i represents the hidden state of the i-th target token, W g and b g are model parameters, and σ represents the logistic sigmoid function.
Single-Encoder Approach encodes the source sentences by concatenating them into a long sequence (Dabre et al., 2017;Tiedemann and Scherrer, 2017), which is then fed to an embedding layer 1 and a stack of self-attention and position-wise feedforward layers to produce a sequence of hidden representations f ([x ls ;x la ]; Θ enc ) = H. Then, we apply the encoder-decoder attention to the full sequence of encoder representations H: The single-encoder approach is simpler than the multi-encoder one and can be easily adapted to multiple auxiliary languages as inputs.

Training
Our bi-source multilingual model is trained on a combination of datasets D = ls∈S,la∈A,lt∈T D ls×la×lt , where S is the set of source languages, T is the set of target languages, A represents the set of auxiliary languages, and D ls×la×lt = (x ls ,x la , y lt ) is a bi-source translation dataset which can be formed by data augmentation via MT. The objective is to  Figure 2: An overview of the generation process of the auxiliary source sentence (a), the single-encoder (b) and multi-encoder (c) approaches for integrating the auxiliary source sentence in the translation model. In the multiencoder approach, we share the parameters of the two encoders to learn representations in a shared space. maximize the log-likelihood of the target sentences given the original and auxiliary source sentences: At each training iteration, we randomly pick a triplet of mutually distinct source, auxiliary, and target languages (l s , l a , l t ). Next, we randomly sample a batch of training examples (x ls ,x la , y lt ) from D ls×la×lt and maximize the log probability of the target sentence y lt given source sentence x ls and auxiliary sentencex la . To enable more flexible decoding and to improve model robustness, we randomly mask out the auxiliary sentences with probability p mask during training. 2 Creating Pseudo Training Data We adopt data augmentation techniques (Sennrich et al., 2016a;Nishimura et al., 2018) to construct the bi-source data using parallel data from multiple language pairs. More specifically, we first train a multilingual NMT model M S→A to translate between source and auxiliary languages. Next, we extend each parallel dataset (x ls , y lt ) to pseudo bisource datasets (x ls ,x la , y lt ) by translating x ls into auxiliary languages l a using M S→A . Finally, we combine all pseudo bi-source datasets into the training data D to train our bi-source model.

Inference
Prior work on multi-source NMT (Zoph and Knight, 2016;Nishimura et al., 2018) assumes access to multi-source inputs at inference time, which has limited their scope of application in the real 2 We set pmask = 0.5 in all our experiments.

Domain
Prov. #Sent Zh-Ja world. Instead, we test our model in a more realistic scenario where only a single source sentence for each test instance is provided. We experiment with two inference modes: 1) single-source inference where we provide our model with only a single source sentence during inference. 2) bi-source inference where we first augment the source sentence by translating it into an auxiliary language using the NMT model M S→A and then use our bi-source model to generate the target translation given the original and auxiliary source sentences.

Data
We evaluate our approach on two translation tasks, including Chinese/English→Japanese (Zh/En→Ja) and a large-scale En→X task that translates from English to 10 languages, including French (Fr), Czech (Cs), German (De), Finnish (Fi), Lat-   (Gu).
Zh/En→Ja We set the source and auxiliary language sets S = A = {Zh, En}, and the target language set T = {Ja}. The training data consists of 0.67M sentence pairs for Japanese-Chinese and 3.0M sentence pairs for Japanese-English from ASPEC corpus (Nakazawa et al., 2016). We use the provided development set and test the models on both in-domain test set from ASPEC and outof-domain test sets as shown in Table 1. To train the Chinese→English translation model for data augmentation, we use the training corpora (21.2M) from WMT18 (Bojar et al., 2018), newstest2017 as development set, and newstest2018 as test set.
En→X We set the source language set S = {En}, the auxiliary and target language sets A = T = {Fr, Cs, De, Fi, Lv, Et, Ro, Hi, Tr, Gu}. The training data are from the WMT corpus (Bojar et al., 2013(Bojar et al., , 2014(Bojar et al., , 2016(Bojar et al., , 2017(Bojar et al., , 2018Barrault et al., 2019). 3 We use all the available parallel data except for the WikiTitles released by WMT19. For French and Czech, we randomly sample 10M sentence pairs from the full data.

Preprocessing
Zh/En→Ja We tokenize the English sentences using Moses (Koehn et al., 2007) and segment Chinese and Japanese sentences using Jieba 4 and MeCab 5 respectively. We remove duplicated sentence pairs from the training corpora, filter them using langid 6 , and filter out sentence pairs whose length ratio exceeds 2.0 using clean-corpus-n.perl 7 . We apply byte-pair encoding (Sennrich et al., 2016b) to each language separately with 16K merging operations. Table 1 shows the number of sentence pairs after preprocessing.
En→X We follow the preprocessing steps in Wang et al. (2020): we remove duplicated sentence pairs and the pairs with the same source and target sequences from the training corpora and then tokenize all data using SentencePiece (Kudo and Richardson, 2018) with a shared vocabulary size of 64K tokens. Table 2 shows the training data size after preprocessing and the test set for each language pair.

Training
We use the Transformer models (Vaswani et al., 2017) implemented in fairseq. 8 Zh/En→Ja We use the Transformer base architecture with d model = 512, d hidden = 2048, n heads = 8, n layers = 6, and p dropout = 0.1. We apply label smoothing of 0.1. We adopt Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.0005, batch size of 48,000 tokens, and 4,000 warm-up updates for maximum 500,000 steps or 50 epochs. We select the best checkpoint based on validation perplexity. During inference, we use beam search with a beam size of 8 and length penalty of 1.0.  Table 3: BLEU scores on Zh/En→Ja translation task. We compare our models in the single-source and bi-source inference modes. We boldface the highest scores and underline their ties based on paired bootstrap with p < 0.05 (Clark et al., 2011). Our model with bi-source inference significantly outperforms both the Multilingual baseline and Multilingual + pseudo on En→Ja, and achieves on par performance on Zh→Ja.

Baselines and Evaluation
We compare our method against the following baselines: 1) Bilingual baseline: NMT model trained on each language pair separately. 2) Multilingual baseline: multilingual NMT model trained on Zh-Ja and En-Ja data for Zh/En→Ja, and all Englishcentric data for En→X. 3) Multilingual + pseudo: multilingual NMT model trained on the concatenation of the original parallel data (x ls , y lt ) and pseudo data (x la , y lt ) . 4) Multilingual + pivot: multilingual NMT model with pivot decoding (by first translating the source to the auxiliary language and then translating from the auxiliary to the target language). For all multilingual models, we add the target language tag and temperature-based sampling (Aharoni et al., 2019) with temperature τ = 5. We evaluate translation quality using sacreBLEU (Post, 2018). 9 For Japanese, we use MeCab tokenizer before computing BLEU. tilingual + pseudo by 0.7 BLEU on En→Ja query test set, with on par performance on other test sets. 10 The multi-encoder variant achieves competitive performance to the single-encoder model on Zh→Ja but obtains significantly lower BLEU on En→Ja. Using bi-source inference with our singleencoder model further improves BLEU by 0.3-0.4 over single-source inference. It significantly outperforms Multilingual baseline by 0.8-1.8 BLEU on En→Ja query and news (out-of-domain) test sets, while achieving on par performance on Zh→Ja and En→Ja science (in-domain) test set. This is probably because English and Japanese are more distant, thus adding a high-quality synthetic Chinese source sentence helps translate the domain-specific English words and phrases that are infrequent in the training data.

As shown in
To better understand the improvements in BLEU, we conduct the following analysis: Our model improves accuracy on lowfrequency words. We compute the target word F1 binned by frequency in the training data (Neubig et al., 2019) on the three out-of-domain test sets. As shown in Figure 3, on En→Ja where our model obtains the largest BLEU improvements, the largest improvements over the baseline models are on lowfrequency words -in the news domain, the largest improvements are on words with frequency between 10 and 50, while in the query domain, it improves more on words with frequency between 50 and 100. It also improves F1 on rare words with frequency below 10, but not as much as for words with frequency above 10. In addition, bi-source  inference improves over single-source inference more on low-frequency words on En→Ja news set. On Zh→Ja news set, the largest gain is on mediumfrequency words (Figure in the Appendix).
source objective (Eq.4) learns more aligned representations across languages, which explains its superiority over the multilingual baselines even with single-source inference.

Bi-Source Inference
As shown in Table 4, adding an auxiliary source sentence improves BLEU over single-source inference on most target languages except French, Czech, and Romanian. 11 It achieves an average improvement of +0.5 BLEU over single-source inference and outperforms the multilingual baselines by +1.9 BLEU on average and up to +4.0 BLEU on low-resource languages like Gujarati. Figure 5 shows the BLEU improvements from adding different auxiliary languages over singlesource inference. The choice of the auxiliary language has little impact on the BLEU improvement. To explain this phenomena, we conduct the following analysis to verify that the performance boosts are due to the additional source information provided by the auxiliary sentence. We compare single-source and bi-source inference on synthetic noisy test sets: we randomly mask τ % of the source words in each test set (τ ∈ {5, 10}). As show in Table 5, when using single-source inference, BLEU drops by -5.1 and -9.9 after mask- 11 We use the same model in single-source inference mode to generate the auxiliary sentences.
ing 5% and 10% of the source words, respectively. With the help of the auxiliary language, the drop in BLEU becomes smaller: the drop is reduced by 0.4 and 0.9 when 5% and 10% of the source words are masked, respectively. These results indicate that our model can effectively leverage the complementary information provided by the auxiliary sentence which remedies the missing source information. Furthermore, results suggest that the cross-lingual representations in our model are wellaligned which enables it to combine the information from both the source and auxiliary sentences. This also explains why the choice of the auxiliary language has little impact on BLEU -as the representations of the auxiliary sentences from different languages are close in the hidden space, they could complement the source context similarly.
Typological Analysis To better understand which target language benefits the most from bisource inference, we compute the Spearman's correlation between the average BLEU improvement on each target language and various types of features including 1) the training data size for each language pair, and 2) the linguistic distances between the source (English) and the target languages measured by the geographic distance, genetic distance based on the world language family tree, syntactic distance, and phonological distance from URIEL Typological Database (Littell et al., 2017).  Results show that the geographic distance correlates the best with the BLEU improvement with a correlation score of 0.74, which suggests that more distant language pairs benefit more from the auxiliary source sentences. In addition, the BLEU improvement correlates negatively with the training data size with a correlation score of -0.57, which suggests that lower-resource language pairs obtain a larger gain from bi-source inference. The genetic, syntactic, and phonological distances do not correlate well with the BLEU improvement. 12 Word Sense Disambiguation To test if bisource inference helps disambiguate word senses, we compare our En→X model with single-source and bi-source inference with the Multilingual baseline on the MuCoW test suite (Raganato et al., 2019), a word sense disambiguation test suite. Table 6 shows that our model with bi-source inference achieves higher coverage scores over its counterpart with single-source inference and Multilingual baseline on both En→Cs and En→De. This confirms our hypothesis that adding an auxiliary language input during inference helps disambiguate word senses.

Related Work
Since the recent success of the end-to-end NMT models (Sutskever et al., 2014;Bahdanau et al., 2015), multilingual NMT has become a promising research direction. Dong et al. (2015) propose to perform one-to-many translation using a dedicated decoder for each target language. Firat et al. (2016a) further extend it to support many-to-many translation using language-specific encoders and decoders with a shared attention module. Ha et al. (2016) and Johnson et al. (2017) show that train- 12 We assume that the correlation is weak if the absolute correlation score is below 0.4. ing a shared encoder-decoder model for many-tomany translation allows translation between unseen language pairs. More advanced techniques to further improve the translation quality include optimizing the parameter sharing strategies (Gu et al., 2018;Sachan and Neubig, 2018) and multi-stage fine-tuning to better improve low-resource translation (Dabre et al., 2019). Although we only focus on improving the overall translation quality of a shared multilingual NMT model in this paper, our approach can also be combined with the aforementioned techniques to build better language-specific NMT models via fine-tuning, which we will explore in future work.
Orthogonal to these techniques, multi-source translation (Och and Ney, 2001;Zoph and Knight, 2016;Garmash and Monz, 2016) has been shown to improve translation quality by exploiting the source sentences manually translated into multiple languages. Most studies assume access to multisource inputs during both training and inference. Choi et al. (2018) and Nishimura et al. (2018) introduce data augmentation methods to fill in the missing source in the training data. Firat et al. (2016b) explore translating the source into a pivot language and feeding both the original source and pivot sentences to a multilingual model to improve zero-resource translation. However, the pivot sentence is added only at inference time, thus the approach is better suited to the zero-resource setting. More recently, Taitelbaum et al. (2019) shows that translating the source word to auxiliary languages improves word translation.
Our work is also related to multi-task learning for machine translation. Tu et al. (2017) propose multi-task learning with an auxiliary reconstruction objective that reconstructs the source sentence from decoder hidden states. Niu et al. (2019) further show that adding a reconstruction objective by back-translating the target sentences to the source helps low-resource translation. Zhou et al. (2019) propose multi-task training with a denoising objective to improve the robustness of NMT models. Wang et al. (2020) show that multi-task learning with two additional denoising tasks on the monolingual data can effectively improve translation quality. Our training strategy can also be viewed as multi-task learning as we train our multilingual model on single-source and bi-source inputs jointly.

Conclusion
We introduced a novel bi-source multilingual translation model that exploits an additional source input from an auxiliary language to improve translation quality. Our model can flexibly perform single-source and bi-source inference, in which it takes both the original source and a synthetic source sentence from an auxiliary language as inputs. Experiments show that our method is simple yet effective -it improves the translation quality of multilingual models substantially, with the largest improvements on low-resource or distant language pairs. Further analysis indicates that adding an auxiliary language input during inference helps the model disambiguate source words. This work also sheds new light on multilingual NMT training, as our multi-source training strategy brings substantial improvements over the multilingual baseline without adding any auxiliary inputs at inference time.   Figure 6: Target word F1 score binned by word frequency in training data on Zh→Ja.