On the Off-Target Problem of Zero-Shot Multilingual Neural Machine Translation

While multilingual neural machine translation has achieved great success, it suffers from the off-target issue, where the translation is in the wrong language. This problem is more pronounced on zero-shot translation tasks. In this work, we find that failing in encoding discriminative target language signal will lead to off-target and a closer lexical distance (i.e., KL-divergence) between two languages' vocabularies is related with a higher off-target rate. We also find that solely isolating the vocab of different languages in the decoder can alleviate the problem. Motivated by the findings, we propose Language Aware Vocabulary Sharing (LAVS), a simple and effective algorithm to construct the multilingual vocabulary, that greatly alleviates the off-target problem of the translation model by increasing the KL-divergence between languages. We conduct experiments on a multilingual machine translation benchmark in 11 languages. Experiments show that the off-target rate for 90 translation tasks is reduced from 29\% to 8\%, while the overall BLEU score is improved by an average of 1.9 points without extra training cost or sacrificing the supervised directions' performance. We release the code at https://github.com/PKUnlp-icler/Off-Target-MNMT for reproduction.


Introduction
Multilingual NMT makes it possible to do the translation among multiple languages using only one model, even for zero-shot directions (Johnson et al., 2017;Aharoni et al., 2019).It has been gaining increasing attention since it can greatly reduce the MT system's deployment cost and enable knowledge transfer among different translation tasks, which is especially beneficial for lowresource languages.Despite its success, off-target is a harsh and widespread problem during zeroshot translation in existing multilingual models.For the zero-shot translation directions, the model translates the source sentence to a wrong language, which severely degrades the system's credibility.
As shown in Table 1, the average off-target rate on 90 directions is 29% and even up to 95% for some language pair (tr->gu) on WMT'10 dataset.
Researchers have been noticing and working on solving the problem from different perspectives.For model trained on English-centric dataset, a straight forward method is to add pseudo training data on the zero-shot directions through backtranslation (Gu et al., 2019;Zhang et al., 2020).Adding pseudo data is effective since it directly turns zero-shot translation into a weakly supervised task.Despite its effectiveness, it brings a lot more training cost during generating data and training on the augmented corpus and the supervised directions' performance is also reported to decrease due to the model capacity bottleneck (Zhang et al., 2020;Yang et al., 2021).Rios et al. (2020) finds that instead of regarding all languages as one during the vocabulary building process, language-specific BPE can alleviate the off-target problem, yet it still costs the supervised directions' performance.
In this work, we perform a comprehensive analysis of the off-target problem, finding that failure in encoding discriminative target language signal will lead to off-target and we also find a strong correlation between off-target rate of certain direction and the lexical similarity between the involved languages.A simple solution by separating the vocabulary of different languages in the decoder can decrease lexical similarity among languages and it proves to improve the zero-shot translation performance.However, it also greatly increases the model size (308M->515M) because a much larger embedding matrix is applied to the decoder.
For a better performance-cost trade-off, we further propose Language-Aware Vocabulary Sharing (LAVS), a novel algorithm to construct the multilingual vocabulary that increases the KL-divergence of token distributions among languages by splitting particular tokens into language-specific ones.
LAVS is simple and effective.It does not introduce any extra training cost and maintains the supervised performance.Our empirical experiments prove that LAVS reduces the off-target rate from 29% to 8% and improves the BLEU score by 1.9 points on the average of 90 translation directions.Together with back-translation, the performance can be further improved.LAVS is also effective on larger dataset with more languages such as OPUS-100 (Zhang et al., 2020) and we also observe that it can greatly improve the English-to-Many performance (+0.9 BLEU) in the large-scale setting.

Related Work
Off-Target Problem in Zero-Shot Translation Without parallel training data for zero-shot directions, the MNMT model is easily caught up in off-target problem (Ha et al., 2016;Aharoni et al., 2019;Gu et al., 2019;Zhang et al., 2020;Rios et al., 2020;Wu et al., 2021;Yang et al., 2021) where it ignores the target language signal and translates to a wrong language.Several methods are proposed to eliminate the off-target problem.Zhang et al. (2020); Gu et al. (2019) resort different back-translation techniques to generate data for non-English directions.Back-translation method is straight-forward and effective since it provides pseudo data on the zero-shot directions but it brings a lot more additional cost during generating data and training on the augmented corpus.Gu et al. (2019) introduced decoder pretraining to prevent the model from capturing spurious correlations, Wu et al. (2021) explored how language tag settings influence zero-shot translation.However, the cause for off-target still remains underexplored.
Vocabulary of Multilingual NMT Vocabulary building method is essential for Multilingual NMT since it decides how texts from different languages are turned into tokens before feeding to the model.Several word-split methods like Byte-Pair Encoding (Sennrich et al., 2016), Wordpiece (Wu et al., 2016) and Sentencepiece (Kudo and Richardson, 2018), are proposed to handle rare words using a limited vocab size.In the background of multilingual NMT, most current studies and models (Conneau et al., 2019;Ma et al., 2021;team et al., 2022) regard all languages as one and learn a shared vocabulary for different languages.Xu et al. (2021a) adopted optimal transport to find the vocabulary with most marginal utility.Chen et al. (2022) study the relation between vocabulary sharing and label smoothing for NMT.Closely related to our work, Rios et al. (2020) finds that training with languagespecific BPE that allows token overlap can improve the zero-shot scores at the cost of supervised directions' performance and a much larger vocab while our method does not bring any extra cost.
To the best of our knowledge, we are the first to explore how vocabulary similarity of different languages affects off-target in zero-shot MNMT and reveal that solely isolating vocabulary in the decoder can alleviate the off-target problem without involving extra training cost or sacrificing the supervised directions' performance.
3 Delving into the Off-Target Problem

Multilingual NMT System Description
We adopt the Transformer-Big (Vaswani et al., 2017) model as the baseline model.For multilingual translation, we add a target language identifier <XX> at the beginning of input tokens to combine direction information.We train the model on an English-centric dataset WMT'10 (Callison-Burch et al., 2010).Zero-shot translation performance is evaluated on Flores-101 (Goyal et al., 2021) dataset.We use a public language detector1 to identify the sentence-level language and compute the off-target rate (OTR) which denotes the ratio of translation that deviates to wrong languages.Full information about training can be found in Section 5.1.

Off-Target Statistics Safari
Off-Target Rate Differs in Directions We first train the multilingual NMT model in 10 EN-X directions and 10 inverse directions from WMT'10 An Off-Target Case Direction: FR -> DE Input: <DE> Un sondage effectué auprès de 1 400 personnes avant les élections fédérales de 2010 a révélé que le nombre d'opposants à la transformation de l'Australie en république avait augmenté de 8 % depuis 2008.Output: A survey of 1400 people prior to the 2010 federal elections revealed that the number of opponents of Australia's transformation into a republic had increased by 8 % since 2008.Gold: Von den 1.400 Personen, die vor den Bundeswahlen 2010 befragt wurden, hat der Anteil derjenigen, die sich dagegen aussprechen, dass Australien zur Republik wird, seit 2008 um 8 Prozent zugenommen.
Figure 1: A real Off-Target case observed in our multilingual NMT system.In this case, the output is literally English while the real target is German.
simultaneously.Then we test the model on 90 X-Y zero-shot directions using semantic parallel sentences from the previous 10 languages provided by Flores-101.We compute the off-target rate of all directions and list the result in Table 1.
In addition to the individual score, we next split the languages into High (cs, fr, de, fi), Mid (lv, et), and Low (ro, tr, hi, gu) resources according to data abundance degree.Then we compute the average OTR of High-to-High, High-to-Low, Low-to-High, and Low-to-Low directions and rank the result.The ranked result is: Low-to-Low (50.28%) > High-to-High (27.16%) > Low-to-High (23.18%) > Highto-Low (20.78%).Based on the observation, we can see that language with the lowest resource (gu) contributes to a large portion of off-target cases.This is reasonable since the model might not be familiar with the language identifier <GU> and the same situation goes for Low-to-Low translations.
However, it is surprising to see that translations between high-resource languages suffer from more severe off-target than those directions involving one low-resource language.There seem to be other factors influencing the off-target phenomena.
In other words, if data imbalance is not the key factor for off-targets between high-resource languages, what are the real reasons and possible solutions?To answer these questions, we need to delve deeper into the real off-target cases.
The Major Symptom of Off-Target When the model encounters an off-target issue, a natural question is which language the model most possibly deviates to.We find that among different directions, a majority (77%) of the off-target cases are wrongly translated to English, which is the centric language in the dataset.A small part (15%) of cases copy the the input sentence as output.Our observation also agrees with the findings of Zhang et al. (2020).It raises our interest that why most off-target cases deviate to English.

Failing in Encoding Discriminative Target
Language Signal Leads to Off-Target Considering the encoder-decoder structure of the model, we hypothesize that: The encoder fails to encode discriminative target language information to the hidden representations before passing to the decoder.
To test the hypothesis, we start by analyzing the output of the trained transformer's encoder: 1) We choose French as the source language and conduct a French-to-Many translation (including all languages in WMT'10) on Flores-101.
2) We collect all the pooled encoder output representations of the French-to-Many translation and project them to 2D space using TSNE.The visualization result is shown in Figure 2.
The visualization result justifies our hypothesis.We can tell from the distribution that only representations belonging to "fr-tr" and "fr-ro" directions have tight cluster structures with boundaries.The representations from high/mid-resource language pairs are completely in chaos and they are also mixed with fr-en representations.And those languages generally have a higher off-target rate in French-to-Many Translation according to Table 1.
The decoder cannot distinguish the target language signal from the encoder's output when it receives representations from the "chaos" area.Moreover, during the training process, the decoder generates English far more frequently than other lan-guages and it allocates a higher prior for English.
Passing hidden representation similar to English one will possibly confuse the decoder to generate English no matter what the given target language is.It could explain why most off-target cases deviate to English.The decoder struggles to tell the correct direction from the encoder's output.Now we have a key clue for the off-target issue.The left question is what causes the degradation of target language signal in some directions and whether we can make the representations of different target languages more discriminative to eliminate the off-target cases.

Language Proximity Correlates with
Zero-Shot Off-Target Rate To explore how off-target occurs differently in different language pairs, we conduct experiments using a balanced subset of WMT'10 dataset where we hope to preclude the influence of data size.We randomly sampled 500k sentences from different directions to form a balanced training set and remove the directions(hi, tr and gu) that do not have enough sentences.
Language Proximity is an Important Characteristic of Translation Direction Our motivation is intuitive that if two languages are rather close, the probability distribution of different n-grams in the two languages' tokenized corpus should be nearly identical.Considering a large number of different n-grams in the corpus, we only consider 1-grams to compute the distribution.We call the result "Token Distribution".We use Kullback-Leibler divergence from Token Distribution of Language B to Language A to reflect the degree of difficulty if we hope to encode sentence from B using A, which can also be interpreted as "Lexical Similarity".
where V denotes the shared vocabulary, A(x) is the probability of token x in language A. To avoid zero probability during computing Token Distribution, we add 1 to the frequency of all tokens in the vocabulary as a smoothing factor.

Lexical Similarity is related to Off-Target Rate
We compute the KL divergence between language pairs with the training data.After training on the balanced dataset, the zero-shot translation is conducted on the Flores-101 dataset.We visualize the result of the top-3 languages(fr,cs,de) with most resources in WMT'10 dataset for analysis.
As shown in Figure 3, we can observe from the statistics that language proximity is highly related to the off-target rate.The Pearson correlation coefficients between the off-target rate and the KL-Divergence from target to source of the three x-tomany translations are -0.75±0.02,-0.9±0.03 and -0.92±0.03.The average Pearson correlation of all x-to-many directions is -0.77±0.11.It indicates that language pair which has higher lexical similarity from target to source may have a higher chance to encounter off-target than those language pairs which has less similar languages.

Shared Tokens in the Decoder Might Bias the Zero-Shot Translation Direction
Previous section shows a correlation between the lexical similarity and off-target rate within certain language pair.We are more interested in whether the lexical similarity will cause the representation degradation in Figure 2, which further causes offtarget.In fact, larger lexical similarity suggests more shared tokens between languages and will let the decoder output more overlapped tokens during supervised training.The token overlap for different target in output space is harmful for zero-shot translation.During training, the decoder might not be aware of the language it's generating directly from the output token because of the existence of shared tokens.In other words, the relation between target language and output tokens is weakened because of the shared tokens among different target languages, which might cause representation degradation in the encoder and further lead to off-target in zero-shot test.

Separating Vocab of Different Languages is Effective yet Expensive
Based on the previous discussion, we now have an idea that maybe we can ease the off-target problem by decreasing the lexical similarity among languages, i.e. decreasing the shared tokens.When building the vocab for multilingual NMT model, most work regard all languages as one and learn a unified tokenization model.We argue that this leads to low divergence of token distribution since many sub-words are shared across languages.
There is an easy method to decrease the shared tokens without changing the tokenization.We can  separate the vocab of different languages as shown in Figure 9 from Appendix.Under such condition, no two languages share the same token.
As shown in Table 2, with separate decoder vocab the average off-target rate in 90 directions is reduced from 29% to 5% and the BLEU score is raised from 10.2 to 12.4.We conduct the same probing experiment on encoder representation with the original WMT'10 dataset.As shown in Figure 4, representations for different target are divided.The "chaos" area does not exist anymore.
We also train the model with separated en-coder&decoder vocab and finds it suffers from worse zero-shot performance compared to baseline.This also agrees to Rios et al. ( 2020)'s findings.
We think that without any vocabulary sharing among languages, the model will learn a wrong correlation between input language and output language and ignore the target language identifier during the English-centric training process.
The experiment result justifies our assumption in Section 3.5 that the shared tokens in the decoder will lead to the representation problem.Though achieving great improvement by isolating all vocabulary, it is much more parameter-consuming.In fact, in our experiment, the number of parameters increases from 308M to 515M.Based on previous observation, lexical similarity will cause the representation degradation problem and further lead to off-target.Thus, our goal is to decrease the lexical similarity.We can achieve it without changing the original tokenizer by splitting the shared tokens into language-specific ones.
As shown in Figure 5, instead of splitting all shared tokens, we can choose specific tokens to Algorithm 1 Language-Aware Vocabulary Sharing Input: Shared vocabulary set V ′ , language list L, language's token distributions P and the number of extra languagespecific tokens N .Output: Vout is the output vocabulary set.1: MaxFreqs = PriorQueue(length=N ) ▷ queue that ranks the input elements E from high to low based on E Sentencepiece Tokenization: I love sing @@ing .LAVS Tokenization: I_en love_en sing @@ing_en .
Sentencepiece Detokenization : I love singing.split.After decoding, we could simply remove all language-specific tags to restore the literal output sentence.By adding language-specific tokens, the number of shared tokens between different languages decreases and makes the token distribution more different thus increasing the KL Divergence.

Optimization Goal
Given original vocab set V ′ and language list L, we aim at creating new vocab V to maximize the average KL divergence within each language pair under the new vocabulary with the restriction of adding N new language-specific tokens.Thus, our objective becomes: (2) where P V m denotes the m-th language's token distribution on vocabulary V , add-one smoothing is applied to avoid zero probability.It is a combinatorial optimization problem.The searching space of V has an astronomical size of

Greedy Selection Algorithm that Maximizes Divergence Increment
Based on the previous discussion, we propose the Language-Aware Vocabulary Sharing algorithm as listed in Algorithm 1 to add language-specific tokens.Intuitively, LAVS algorithm prefers to split those shared tokens that have high frequency among different languages, which directly reduces the appearance of shared tokens in the decoder to the maximum extent.First, we adopt a prior queue to keep the token candidates.Second, for each token in the shared vocabulary, we compute the shared token frequency in each language pair and add the (frequency, lan-guageA, languageB, token) tuple to the queue.Last, since the queue ranks the elements by frequency, we create language-specific tokens for the top N tuples and return the new vocab.We give more details about the algorithm in Appendix B.
The whole tokenization process with LAVS is illustrated in Figure 6.In practice, given an original shared vocab with M tokens, we can always first learn a vocab with M − N tokens and conduct LAVS to add N language-specific tokens to maintain the vocab size M unchanged.

Vocabulary Building
Vocab Sharing We adopt Sentencepiece (Kudo and Richardson, 2018) as the tokenization model.We randomly sample 10M examples from the training corpus with a temperature of 5 (Arivazhagan et al., 2019) on different directions and learn a shared vocabulary of 64k tokens.
Separate Vocab Based on the sharing vocab of the baseline model, we separate the vocab of each language forming a 266k vocab.
LAVS We first learn a 54k vocabulary using the same method as the baseline model's and add 10k language-specific tokens using LAVS.

Training Details of MNMT
Architecture We use the Transformer-big model (Vaswani et al., 2017) implemented by fairseq (Ott et al., 2019) with d model = 1024, d hidden = 4096, n heads = 16, n layers = 6.We add a target language identifier <XX> at the beginning of input tokens to indicate the translation directions as suggested by Wu et al. (2021).
Optimization We train the models using Adam (Kingma and Ba, 2015), with a total batch Table 4: The zero-shot translation performance (Off-Target Rate, BLEU and BERT-Score) on average x-to-many and many-to-x directions using LAVS (Dec) compared to baseline.
size of 524,288 tokens for 100k steps in all experiments on 8 Tesla V100 GPUs.The sampling temperature, learning rate and warmup steps are set to 5, 3e-4 and 4000.
Back-Translation Back-Translation method is effective in improving zero-shot performance by adding pseudo parallel data generated by the model (Gu et al., 2019;Zhang et al., 2020).For simplicity, we apply off-line back-translation to both the baseline and LAVS.With the trained model, we sample 100k English sentences and translate them to other 10 languages, which creates 100k parallel data for every zero-shot language pair and results in a fully-connected corpus of 9M sentence pairs.We add the generated data to the training set and train the model for another 100k steps.
Evaluation We report detokenized BLEU using sacrebleu2 .We also report the Off-Target rate with language detector3 and conduct model-based evaluation using Bert-Score4 (Zhang* et al., 2020).

Results
LAVS improves zero-shot translation by a large margin.Table 3 and 4 list the overall results on both zero-shot and supervised directions.According to Table 3, we can see that LAVS improves all the x-to-many and many-to-x directions with a maximum average improvement of -61.6% OTR, +3.7 BLEU and +0.036 Bert-Score compared to the baseline vocab.It gains an average of -21%  OTR, +1.9 BLEU and +0.02 Bert-Score improvement on 81 zero-shot directions.Compared with the Separate Vocab (Dec) method which also leads to significant improvement in x-y directions, LAVS does not increase any model size.
LAVS with Back-Translation further improves the zero-shot performance.As shown in Table 5, as expected, our back-translation method can improve the zero-shot performance by a large margin.Under such setting, LAVS also outperforms Vocab Sharing by 0.4 average BLEU score on zero-shot directions.
We also observe performance degradation in English-to-Many directions for both models comparing to not using back-translation, which also agrees to the result of Zhang et al. (2020); Rios et al. (2020).We think a possible reason is that the English-to-Many performance will be interfered with the increase of translation tasks.

How does LAVS calibrate the direction?
We visualize the encoder-pooled representations for model with LAVS(dec) in Figure 7.The representations' distribution is similar to Figure 4 where representations for different target are almost divided, suggesting that LAVS work similarly to separating all the vocabulary for different languages.We also give a case study as shown in Section 6.2.
We further visualize the language identifiers' hidden output during among high-resource languages and compare the results of the original Vocabulary Sharing and LAVS.As shown in Figure 10 from Appendix, it turns out that LAVS encodes more discriminative target language information into the <XX> token's hidden output.

Case Study
We compare different model's outputs as shown in Figure 8.The baseline output has off-target problem while LAVS output generates in the correct language.From the direct token output of LAVS, we can see that many of which are language-specific tokens.Models with LAVS could learn the relation between the target language signal and corresponding language-specific tokens, which further decreases the probability of off-target.The baseline model off-target to English.Tokens in blue belong to language-specific tokens.

Scalability of LAVS
As shown in Table 6, we explore how the number of language specific(LS) tokens influence the zeroshot performance.The result shows that the OTR keeps decreasing when the number of LS tokens increases.It suggests that more LS tokens can better relieve the off-target issue without harming the supervised performance.
To test how LAVS generalizes in dataset with more languages, we compare LAVS and VS on OPUS-100 (Zhang et al., 2020).More details of the experiment can be found in Appendix D To alleviate the inference burden, we select all 42 languages with 1M training data for evaluation, which results in 1722 zero-shot directions and 84 supervised directions (en-x and x-en).As shown in Table 7, it turns out that LAVS can improve the zero-shot performance(-14% OTR, detailed results in Table 12 from appendix) under such setting.Yet, the overall performance is much lower comparing to training on WMT'10.With more languages, the lack of supervision signal would become more problematic for zero-shot translation.LAVS improves the en-x performance by a large margin (+0.9 BLEU, detailed scores in Table 13 from appendix), we think separate the vocab of different languages on decoder might have positive influence on general en-x performance.

LAVS's Compatibility with Masked Constrained Decoding
We propose another method to prevent off-target, which is through masked constrained decoding (MCD).During decoding, the decoder only considers tokens that belong to the target vocab in softmax.The target vocab could be computed using the training corpus.We implement MCD for both original vocab sharing and LAVS.We list the detail of the size of different target vocabs in from appendix.
As shown in Table 8, it turns out that the method can further improve the zero-shot performance for LAVS (+1.2 BLEU for de-cs, +0.6 BLEU for frde).It is worth noticing that, in some direction like FR->DE, the benefit of MCD is rather small for the baseline model (+0.1 BLEU).We think the reason is that the original vocab sharing generates many shared tokens between languages, which will weaken the influence of the constraint.Thus, with more language-specific tokens, LAVS can work better with constrained decoding.

Conclusion
In this paper, we delve into the hidden reason for the off-target problem in zero-shot multilingual NMT and propose Language-Aware Vocabulary Sharing (LAVS) which could significantly alleviate the off-target problem without extra parameters.Our experiments justify that LAVS creates a better multilingual vocab than the original Vocabulary Sharing method for multiple languages.

Limitation
LAVS is proposed to overcome the off-target problem among languages that share alphabets because those languages tend to have more sharing tokens after the sub-word tokenization process.As for language pair that does not have shared tokens, LAVS might not have a direct influence on the zero-shot translation though it can also increase the overall performance for those languages, which might need further exploration.

A Method for Completely Separating Vocab
It is easy to turn a shared vocabulary into a separate vocabulary for different languages.As shown in Figure 9, we can split the shared token into language specific token if it appears in more than one language.

B Separating Tokens by Frequency
We can also view LAVS from the optimization goal's perspective.We start from only two languages J and Q and compute KL-divergence's change if we only split one shared token to two language-specific tokens.
∆D i KL = −J(i)log (3) where we will have two i-th tokens for the different languages from the original vocabulary.λ is the smoothing factor that can be seen as a constant.According to equation 3, splitting token that has more similar occurrence probability in the two languages will lead to higher increment in language's KL-Divergence (If J(i)! = Q(i), either the J(i)−Q(i) term or the log term will be negative, and the multiply result will also be negative.If J(i) = Q(i) it will be zero, thus reaching the maximum).Also considering the fact that the tokens with high frequency influence the training process much more than the near-zero ones, we should first split the tokens that appear in two or more languages all with high frequency.

D Experiment on OPUS-100 dataset
OPUS-100 (Zhang et al., 2020) is an Englishcentric dataset consisting of parallel data between 9553 English and 100 other languages.We removed 5 languages (An, Dz, Hy, Mn, Yo) from OPUS, since they are not paired with a dev or testset and train the models with all remaining data.The training configuration is the same as our experiment on WMT'10 dataset.The baseline vocab size is 64k.
We also implement the baseline model with a larger vocab (256k) but the performance is much lower than the 64k version so we keep the vocab size to 64k.For LAVS, We set the number of languagespecific token to 150k instead of 10k because of the increase of languages.We evaluate the supervised and zero-shot performance on Flores-101 dataset.

E Visualize the language identifiers' representation
During zero-shot translation, the language identifier token "<XX>" is the only element indicating the correct direction.Similar to the visualization in Section 3.3, as shown in Figure 10, we visualize the <XX> tokens' hidden output(instead of the pooled result from all input tokens) during French-to-Many translation among high-resource languages and compare the results of the original Vocabulary Sharing and LAVS.It turns out that LAVS encodes more discriminative target language information into the <XX> token's hidden output, while the original Vocabulary Sharing fails on that.
In original Vocabulary Sharing the mapping between the target language identifier <XX> and output token is Many-to-One since different language could share output tokens.While for LAVS, the mapping becomes One-to-One for a part of tokens, impulsing the encoder to learn more discriminative representations for the target language identifier.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.

D
Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 2 :
Figure 2: Encoder pooled output visualization using TSNE for French-to-Many translations.The input French sentences are the same for all directions.Note that there are only French sentences in the encoder side.

Figure 3 :
Figure3: Scatter plot of off-target rate and KLdivergence for different language pairs.We draw the linear regression result with 95% confidence interval.

Figure 4 :
Figure 4: Encoder pooled output visualization using TSNE for French-to-Many translation using separate vocab.The result is comparable to Figure 2, which shows result with shared vocab.

Figure 5 :
Figure 5: Illustration of LAVS.Tokens with higher shared frequency are split into language-specific ones.

Figure 6 :
Figure 6: Illustration of tokenization and detokenization process with Language-Aware Vocabulary Sharing.

Following
Wang et al. (2020), we collect WMT'10 datasets for training.The devtest split of Flores-101 is used to conduct evaluation.Full information of datasets is in Appendix C.
Back Translation also brings much extra cost.The total training time for the model with Back-Translation is almost twice as long as the model with vanilla training.Only applying LAVS brings no extra training cost and does not influence the supervised performance.

Figure 7 :
Figure 7: The encoder-pooled representations learned by multilingual NMT with LAVS on fr-x directions.

Figure 8 :
Figure 8: Case study of DE->FR zero-shot translation.The baseline model off-target to English.Tokens in blue belong to language-specific tokens.

Figure 9 :
Figure 9: Illustration of completely separating vocabulary of different languages.Note that we don't need to learn a new vocab.Given the original shared vocab, we can split those tokens that are shared by two or more languages into language-specific ones and get a fully separate vocab for each language.

Table 1 :
Zero-shot off-target rate of the model with traditional vocab sharing on WMT'10 dataset.High values are in red and low values are in blue.The average OTR of 90 zero-shot directions is about 29%.

Table 2 :
Average zero-shot result for models with different vocab.(Dec) means only the decoder uses the separate vocab.(Enc,Dec) means both the encoder and the decoder use the separate vocab.

Table 5 :
Results with Back-Translation.

Table 6 :
Exploration in number of Language-Specific tokens in LAVS(dec) and the Off-Target Rate on Flores-101.We report the average OTR on zero-shot directions and average BLEU on supervised directions.

Table 7 :
Results in OPUS dataset.We evaluate 1722 zero-shot directions and 84 supervised-directions.

Table 12 :
Detailed zero-shot OTR of X-to-Many experiment on OPUS-100.Each score denotes the average OTR from X to other 41 languages.