Language Tags Matter for Zero-Shot Neural Machine Translation

Multilingual Neural Machine Translation (MNMT) has aroused widespread interest due to its efficiency. An exciting advantage of MNMT models is that they could also translate between unsupervised (zero-shot) language directions. Language tag (LT) strategies are often adopted to indicate the translation directions in MNMT. In this paper, we demonstrate that the LTs are not only indicators for translation directions but also crucial to zero-shot translation qualities. Unfortunately, previous work tends to ignore the importance of LT strategies. We demonstrate that a proper LT strategy could enhance the consistency of semantic representations and alleviate the off-target issue in zero-shot directions. Experimental results show that by ignoring the source language tag (SLT) and adding the target language tag (TLT) to the encoder, the zero-shot translations could achieve a +8 BLEU score difference over other LT strategies in IWSLT17, Europarl, TED talks translation tasks.

Unlike bilingual NMT, language-specific signals should be accessible to the MNMT model so that the model can distinguish the translation directions. Ha et al., (2016) first introduced a universal encoder-decoder framework for MNMT models with language-specific coded vocabulary to indicate different languages. The encoder-decoder architecture is identical to bilingual models (Bahdanau et al., 2015;Vaswani et al., 2017). To further simplify the MNMT models, Johnson et al., (2017) propose to add language tags (LTs) to the beginning of input data to indicate the target language. Then a shared vocabulary could be learned for all languages. The training data of different languages could thus be mixed-up to train the MNMT model. Such a strategy greatly simplifies the training and decoding procedure. We call it the LT strategy. This paper focuses on investigating the impact of LT strategies for zero-shot translation directions in MNMT (zero-shot MNMT). We conduct translation experiments (Section 3) and visualization analysis (Section 4) on several multilingual benchmarks with different LT strategies. We observe that: • The TLT is more important than the SLT. The SLT even causes negative effects on the zeroshot translation.
• . T-ENC is identical to (Johnson et al., 2017), which adds the TLT to the encoder (source) side. T-DEC means placing the TLT on the decoder (target) side of model. S-ENC-T-ENC and S-ENC-T-DEC place the SLT on the encoder side, but the former also places the TLT on encoder side, while the latter on the decoder side.
find that the LT strategies are crucial for the zeroshot MNMT translation quality. Ignoring SLTs and placing the TLTs on the encoder side could achieve the best performance during our experiments. (ii) We conduct extensive visualization analysis to demonstrate that the proper LT strategy could enhance the consistency of semantic representation and alleviate the off-target issue (Zhang et al., 2020), thus improving the translation quality. To the best of our knowledge, this is the first paper to systematically study the importance of LT strategies for zero-shot translation quality.

Background and Notations
Improving the consistency of semantic representations and alleviating the off-target issue (Zhang et al., 2020) are effective ways to improve the zeroshot translation quality (Al-Shedivat and Parikh, 2019;Arivazhagan et al., 2019;Zhu et al., 2020). The semantic representations of different languages should be close to each other to get better translation quality (Ding et al., 2017). The off-target issue indicates that the MNMT model tends to translate input sentences to the wrong languages, which leads to low translation quality. Due to its simplicity and efficiency, LT strategy has become a fundamental strategy for MNMT (Dabre et al., 2020). Though previous work adopted different LT strategies Blackwood et al., 2018;Conneau and Lample, 2019;Liu et al., 2020b), the usages of LT strategies are intuitive and lack systematic study. In this paper, we investigate 4 popular LT strategies, namely T-ENC, T-DEC, S-ENC-T-ENC and S-ENC-T-DEC. Each of them only requires simple modifications to the input data. Table 1 comprehensively illustrates the strategies with an English to Spanish translation pair (Hello World! → ¡Hola Mundo!).

Experiment Settings
Datasets We carry out our experiments on the publicly available IWSLT17 (Cettolo et al., 2017), TED talks (Qi et al., 2018) and Europarl v7 (Koehn, 2005) datasets. Table 2 shows an overview of the datasets. We choose four different languages (English included) for both IWSLT17 and Europarl, and 20 languages for TED talks. All the training data are English-centric parallel data, which means either the source-side or target-side of the sentence pair is English. We have 6, 6, and 342 zero-shot translation directions and an average of 145k, 1.96M (M = million), and 187k sentence pairs per direction for the three datasets respectively. We choose the official tst2017, WMT newstest08, and the TED talks testsets (Qi et al., 2018) as our test sets, respectively. We learned a joint SentencePiece model (Kudo and Richardson, 2018) for sub-word training on all languages with 40,000 merge operations for each dataset. We limit the size of joint vocabulary to 40,000 for all three datasets. Settings We use the open-source implementation (Ott et al., 2019) of Transformer model (Vaswani et al., 2017).
Following the settings of (Liu et al., 2020a), we use a 5-layer encoder and 5-layer decoder variation of Transformer-base model (Vaswani et al., 2017) for TED and IWSLT17. For Europarl v7, we use a standard Transformer-big model (Vaswani et al., 2017) Table 3: Translation results on 3 datasets. The supervised and zero-shot column denote the averaged BLEU score of supervised or zero-shot directions. The off-target (%) denotes the averaged percentage of sentences being translated to wrong languages in zero-shot directions.
train each model for 100,000 steps to make sure it converges. We use beam search for heuristic decoding, and set the beam size to 4. We use SacreBLEU (Papineni et al., 2002;Post, 2018) to evaluate the translation results. To calculating the percentage of off-target translations, we use the langdetect 1 tool to detect the language of the translated sentences.

Experimental Results
We show the translation results on the IWSLT17, Europarl, and TED talks datasets in Table 3. For all three datasets, different strategies achieve comparable BLEU score on supervised directions. However, for the zero-shot directions, the BLEU score varies significantly using different LT strategies. One observation is that the T-ENC strategy consistently outperforms the other three strategies on all 1 https://github.com/Mimino666/langdetect datasets in terms of BLEU score with large margin, regardless of the corpus size and number of languages. In terms of off-target issue, T-ENC achieves the best performance in most cases. Besides, ignoring the SLT (T-ENC v.s. S-ENC-T-ENC) also helps the zeroshot BLEU score. The percentage of off-target translations reaches 94.14% in the IWSLT17 dataset by S-ENC-T-ENC strategy, while only 9.16% by T-ENC strategy. It indicates that the model translates almost all the sentences to the wrong languages in S-ENC-T-ENC, while to the right languages in T-ENC. It proves again the SLT hurts the zero-shot translation.
Another interesting observation is that placing the TLT on the encoder side also helps the zeroshot performance. Compared with T-ENC, both the translation quality and off-target performance are significantly worse in T-DEC. We will study the reasons behind the above observations by visualization analysis in Section 4.

Visualization Analysis
We conduct the visualizations on the TED talks data to analyze the impact of different LT strategies on the semantic representation consistency and the off-target issue in MNMT. Enhancing the Semantic Representation Consistency Figure 1 shows the kernel density estimation (KDE) (Parzen, 1962) of t-SNE (Van der Maaten and Hinton, 2008) reduced average encoder output on different languages. We randomly chose 5 source languages (nl, ro, fr, it, ru → zh) instead of all languages for clearer visualization. We choose 100 sentences for each language, and each sentence has its corresponding translation in the other 4 languages. The contour lines drawn by Kernel Density Estimation tools 2 was used to estimate the semantic distribution of the encoder outputs.  Figure 1c, we can see that ignoring SLT 2 https://seaborn.pydata.org/generated/seaborn.kdeplot.html greatly helps the model to learn more consistent representations. Comparing Figure 1a and Figure 1b, placing TLT to encoder side instead of the decoder side also helps the semantic consistency. Both comparisons validate that T-ENC could learn the most consistent and different semantic representations, thus achieves the best BLEU score. It might be why the shape of contour lines in T-ENC is significantly different from other strategies.
Alleviating the off-target Issue Figure 2 shows the attention visualization of a Russian to Italian translation example using different LT strategies. The x-axis is the Italian translation.
In Figure 2a1, T-ENC strategy pays attention to the TLT (in this case, the token it in the red background) during the whole translation procedure (left-to-right). Compared to T-ENC, both T-DEC and S-ENC-T-DEC pay less attention to the TLT after a few tokens are generated. It validates that placing the TLT on the encoder side would also help the model distinguish the target languages. The S-ENC-T-ENC pays nearly equal attention to both SLT and TLT, which might make the model confused about which one is the target language. Both comparisons prove that the T-ENC strategy has the best ability to distinguish the target languages, thus alleviates the off-target issue.
Combining Both Semantic Consistency and offtarget Issue Figure 3 visualizes the cosine similarity of the layer-wise encoder and decoder output of different languages in zero-shot setting (English excluded). We sampled 100 muti-way data from the test set and averaged the cosine similarity between each language.
In the many-to-one setting, we randomly select Russian as the target language and translate the other 18 languages to Russian to obtain the model outputs. The similarity improves from encoder layer 0 to 3 and decoder layer 0 to layer 4, which indicates that the semantic consistency improves as the layer goes up. Interestingly, the similarity drops from encoder layer 3 to layer 4. It might be because the decoder interacts with the encoder directly between encoder layer 4 and decoder layer 0, thus interferes with the top-layer encoder output. But the dropping trend is less rapid in T-ENC than in other strategies. The T-ENC achieves the highest similarity on the last layer, which shows that the T-ENC learns more consistent semantics representations. In the one-to-many setting, we treat Russian as the source language and translate Russian to the other 18 languages to get the model output. The semantic similarity drops as the layer goes up in all four strategies. It indicates that the model can distinguish different target languages as the layer goes up. T-ENC achieves the lowest similarity at the last layer output among all strategies. It shows again that the T-ENC has the best ability to alleviate the off-target issue.

Conclusion
We show that the language tags in MNMT are not just indicators for translation directions but also significantly impact the zero-shot translation quality. By extensive experiments and visualization analysis, we found that (i) ignoring the SLTs could help the models learn consistent semantic representations. (ii) Placing the TLTs on the encoder side could help the decoder pay more attention to the target language, thus alleviating the off-target issue. Zero-shot translation quality could be improved by investigating how to enhance the semantic representation consistency further and alleviate the off-target issue by optimizing LT strategies. We will conduct methods to optimize the LT strategy in our future work.

References
Maruan Al-Shedivat and Ankur Parikh. 2019. Consistency by agreement in zero-shot neural machine translation.