Mitigating Data Imbalance and Representation Degeneration in Multilingual Machine Translation

Despite advances in multilingual neural machine translation (MNMT), we argue that there are still two major challenges in this area: data imbalance and representation degeneration. The data imbalance problem refers to the imbalance in the amount of parallel corpora for all language pairs, especially for long-tail languages (i.e., very low-resource languages). The representation degeneration problem refers to the problem of encoded tokens tending to appear only in a small subspace of the full space available to the MNMT model. To solve these two issues, we propose Bi-ACL, a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model. We define two modules, named bidirectional autoencoder and bidirectional contrastive learning, which we combine with an online constrained beam search and a curriculum learning sampling strategy. Extensive experiments show that our proposed method is more effective both in long-tail languages and in high-resource languages. We also demonstrate that our approach is capable of transferring knowledge between domains and languages in zero-shot scenarios.


Introduction
Multilingual neural machine translation (MNMT) makes it possible to train a single model that supports translation from multiple source languages into multiple target languages.This has attracted a lot of attention in the field of machine translation (Johnson et al., 2017;Aharoni et al., 2019;Fan et al., 2021).MNMT is appealing for two reasons: first, it can transfer the knowledge learned by the model from high-resource to low-resource languages, especially in zero-shot scenarios (Gu et al., 2019;Zhang et al., 2020a); second, it uses 1 Our source code is available at https://github.com/lavine-lmu/Bi-ACL Figure 1: Method Overview.Our approach mainly consists of three parts: online constrained beam search, bidirectional autoencoder and bidirectional contrastive learning.Our approach explores the scenarios of using only target-side monolingual data and a bilingual dictionary to simultaneously alleviate the data imbalance and representation degeneration issues in large-scale MNMT model.only one unified model to translate between multiple language pairs, which saves on training and deployment costs.
Although significant improvements have been made recently, we argue that there are still two major challenges to be addressed: i) MNMT models suffer from poor performance on long-tail languages (i.e., very low-resource languages), for which parallel corpora are insufficient or nonexisting.We call this the data imbalance problem.For instance2 , 21% of the language pairs in the m2m_100 model (Fan et al., 2021) have a BLEU score of less than 1 and more than 50% have a BLEU score of less than 5.Only 13% have a BLEU score over 20.For example, the average BLEU score for the language pairs with Irish as the target language is only 0.09.ii) Degeneration of MNMT models stems from the anisotropic distribution of token representations, i.e., their representations reside in a narrow subset of the entire space (Zhang et al., 2020b).This is called the representation degeneration problem.It can lead to a prevalent issue in large-scale MNMT: the model copies sentences from the source sentence or translates them into the wrong language (off-target problem; Zhang et al., 2020a).
To address the data imbalance problem, prior work has attempted to improve the performance of a machine translation model without using any parallel data.On the one hand, unsupervised machine translation (Lample et al., 2018a,b) attempts to learn models relying only on monolingual data.On the other hand, bilingual dictionaries have shown to be helpful for machine translation models (Duan et al., 2020;Wang et al., 2022).What these approaches have in common is that they only require data that is both more accessible and cheaper than parallel data.As an example, 70% of the languages in the world have bilingual lexicons or word lists available (Wang et al., 2022).
Representation degeneration is a prevalent problem in text generation (Gao et al., 2018) and machine translation models (Kudugunta et al., 2019).Contrastive learning (Hadsell et al., 2006) aims to bring similar sentences in the model close together and dissimilar sentences far from each other in the representation space.This is an effective solution to the representation problem in machine translation (Pan et al., 2021;Li et al., 2022).However, the naïve contrastive learning framework that utilizes random non-target sequences as negative examples is suboptimal, because they are easily distinguishable from the correct output (Lee et al., 2020).
To address both problems mentioned above, we present a novel multilingual NMT approach which leverages plentiful data sources: target-side monolingual data and a bilingual dictionary.Specifically, we start by using constrained beam search (Post and Vilar, 2018) to construct pseudo-parallel data in an online mode.To overcome the data imbalance problem, we propose training a bidirectional autoencoder, while to address representation degeneration, we use bidirectional contrastive learning.Finally, we use a curriculum learning (Bengio et al., 2009) sampling strategy.This uses the score given by token coverage in the bilingual dictionary to rearrange the order of training examples, such that sentences with more tokens in the dictionary are seen earlier and more frequently during training.
In summary, we make the following contributions: i) We propose a novel approach that uses only target-side monolingual data and a bilingual dictionary to improve MNT performance.ii) We define two modules, bidirectional autoencoder and bidirectional contrastive learning, to address the data imbalance and representation degeneration prob-lem.iii) We show that our method demonstrates zero-shot domain transfer and language transfer capability.iv) We also show that our method is an effective solution for both the repetition (Fu et al., 2021) and the off-target (Zhang et al., 2020a) problems in large-scale MNMT models.

Related Work
Multilingual Neural Machine Translation.MNMT is rapidly moving towards developing large models that enable translation between an increasing number of language pairs.Fan et al. (2021) proposed m2m_100 model that enables translation between 100 languages.Siddhant et al. (2022) and Costa-jussà et al. (2022) extend the current MNMT models to support translation between more than 200 languages using supervised and self-supervised learning methods.
Autoencoder.An autoencoder (AE) is a generative model that is able to generate its own input.There are many variants of AE that can be useful for machine translation.Zhang et al. (2016) and Eikema and Aziz (2019) propose using a variational autoencoder to improve the performance of machine translation models.A variant of the same generative model is the denoising autoencoder, which is an important component of unsupervised machine translation models (Lample et al., 2018a).However, the utility of autoencoders has not been fully explored for MNMT.To the best of our knowledge, we are the first to propose training an autoencoder using only target-side monolingual data and a bilingual dictionary to improve low-resource MNMT.Contrastive Learning.Contrastive learning is a technique that clusters similar data together in a representation space while it simultaneously separates the representation of dissimilar sentences.It is useful for many natural language processing tasks (Zhang et al., 2022).Recently, Pan et al. (2021) and Vamvas and Sennrich (2021) used contrastive learning to improve machine translation and obtained promising results.However, these methods use the random replacing technique to construct the negative examples, which often leads to a significant divergence between the semantically similar sentences and the ground-truth sentence in the model representation space.This large changes makes the model more difficult to distinguish correct sentence from incorrect ones.We use small perturbations to construct negative examples, ensuring their proximity to the ground-truth sentence within the semantic space, which significantly mitigates the aforementioned issue.

Method
Our goal is to overcome the data imbalance and representation degeneration issues in the MNMT model.We aim to improve the performance of MNMT without using parallel data, instead relying only on target-side monolingual data and a bilingual dictionary.Our approach contains three parts: online pseudo-parallel data construction (Section 3.1), bidirectional autoencoder (Section 3.2) and bidirectional contrastive learning (Section 3.3).Figure 1 illustrates the overview of our method.The architectures of the bidirectional autoencoder (left) and bidirectional contrastive learning (right) are presented in Figure 2.

Online Pseudo-Parallel Data Construction
Let us assume that we want to improve performance when translating from source language ℓ s to target language ℓ t .We start with a monolingual set of sentences from the target language, denoted as D ℓ t mono , a bilingual dictionary, denoted as D ℓt→ℓs dict , and a target monolingual sentence with tt tokens, denoted as mono .We use lexically constrained decoding (i.e., constrained beam search; Post and Vilar, 2018) to generate a pseudo source language sentence X ℓs i = {x 1 , ..., x ss } in an online mode: where gen(•) is the lexically constrained beam search function and θ denotes the parameters of the model.It is worth noting that parameters θ will not be updated during the generation process, but will be updated in the following steps (Section 3.2 and Section 3.3).

Bidirectional Autoencoder
An autoencoder (Vincent et al., 2008) first aims to learn how to efficiently compress and encode data, then to reconstruct the data back from the reduced encoded representation to a representation that is as close as possible to the original input.
We propose performing autoencoding using only target-side monolingual data.This is different from prior work on UNMT, which uses both source and target-side data (Lample et al., 2018a).Our bidirectional autoencoder contains two parts: backward autoencoder (Section 3.2.1)and forward autoencoder (Section 3.2.2).

Backward Autoencoder
After we obtain X ℓs i from Eq. 1, we have the pseudo-parallel pairs (X ℓt i , X ℓs i ) ∈ D pse .Then, we feed X ℓt i to the MNMT model to get the contextual output embedding Z ℓs bkd .Formally, the encoder generates a contextual embedding H ℓt i given X ℓt i and l t as input, which is in turn given as input to the decoder (together with l s ) to generate Z ℓs bkd : Finally, the backward autoencoder loss is formulated as follows:

Forward Autoencoder
Given Z ℓs bkd from Eq. 2, we feed it to the MNMT model and get the contextual output denoted as Z ℓ t fwd : The forward auto-encoder loss is given by: (5)

Bidirectional Contrastive Learning
The main challenge in contrastive learning is to construct the positive and negative examples.Naive contrastive learning (Pan et al., 2021)  we add the perturbation to ensure that the resulting embedding space is not already in a close proximity or far apart from the original embedding space.More details on how to generate the perturbation of δ i and ζ i can be found in Appendix A.

Backward Contrastive Learning
Given pseudo-parallel pairs (X ℓt i ′ , X ℓs i ′ ) ∈ D pse from Eq. 1, we first feed X ℓt i ′ to the MNMT model to generate the contextual embedding H ℓt i ′ .Then, we add a small perturbation δ bkd after H ℓt i ′ to form the negative example, denoted as H −ℓ t i ′ .Finally, the contextual output of the decoder C ℓs bkd is generated by feeding H −ℓt i ′ to the decoder, and the positive ex-ample H +ℓs i ′ is generated by adding another small perturbation ζ Finally, the backward contrastive learning loss is formulated as follows:

Forward Contrastive Learning
After we get C ℓs bkd from Eq. 6, we feed C ℓs bkd and the small perturbation δ fwd to the MNMT model to obtain the contextual output denoted as C ℓ t fwd and the negative example H −ℓs i ′ .Then, we feed C ℓ t fwd and another small perturbation ζ f wd to generate a positive example denoted as H +ℓ t i ′ .
Finally, the forward contrastive learning loss is given by the following equation:

Curriculum Learning
Curriculum Learning (Bengio et al., 2009) suggests starting with easier tasks and progressively gaining experience to process more complex tasks, which has been proved to be useful in machine translation (Stojanovski and Fraser, 2019;Zhang et al., 2019;Lai et al., 2022b).In our training process, we first compute token coverage for each monolingual sentence using the bilingual dictionary.This score is used to determine a curriculum to sample the sentences for each batch, so that higher-scored sentences are selected early on during training.

Training Objective
The model can be trained by minimizing a composite loss from Eq. 3, 5, 7 and 9 as follows: Where λ is the balancing factor between the autoencoder and contrastive learning component.

Experiments
Datasets.We conduct three group of experiments: bilingual setting, multilingual setting and highresource setting.In the bilingual setting, we focus on improving the performance on a specific long-tail language pair.We choose 10 language pairs at random that have BLEU < 2.5 in the original m2m_100 model and a considerable amount of monolingual data in the target language in newscrawl. 4The language pairs cover the following languages (ISO 639-1 language code5 ): en, ta, kk, ar, ca, ga, bs, ko, ka, tr, af, hi, jv, ml.In the multilingual setting, we aim to improve the performance on long-tail language pairs, which share the same target language.We randomly select 10 languages where the average BLEU score on the language pairs with the same target language is less than 2.5.For the languages not covered from news-crawl, we use the monolingual data from CCAligned6 (El- Kishky et al., 2020).The languages we use are: ta, hy, ka, be, kk, az, mn, gu.For the high-resource setting, we aim to validate whether our proposed method also works for high-resource languages.
We randomly select 6 language pairs that cover the following language codes: en, de, fr, cs.
Dictionaries.We extract bilingual dictionaries using the wiktextextract7 tool.For pairs not involving English, we pivot through English.Given a source language ℓ s and a target language ℓ t , the intersection of the two respective bilingual dictionaries with English creates a bilingual dictionary D ℓs→ℓt dict from ℓ s to ℓ t .The statistics of the dictionaries can be seen in Appendix D.1.Data Preprocessing.For the monolingual data, we first use a language detection tool8 (langid) to filter out sentences with mixed language.We proceed to remove the sentences containing at least 50% punctuation and filter out duplicated sentences.To control the influence of corpus size on our experimental results, we limit the monolingual data of all languages to 1M.For dictionaries, we also use langid to filter out the wrong languages both on the source and target side.
Baselines.We compare our methods to the following baselines: i) m2m: Using the original m2m_100 model (Fan et al., 2021) to generate translations.ii) pivot_en: Using English as a pivot language, we leverage m2m_100 to translate targetside monolingual data to English and then translate English to the source language.Following this method, we finetune the m2m_100 model using the pseudo-parallel data.iii) BT: Back-Translate (Sennrich et al., 2016) target-side monolingual data using m2m_100 model to generate the pseudo sourcetarget parallel dataset, then finetune the m2m_100 model using this data.iv) wbw_lm: Use a bilingual dictionary, cross-lingual word embeddings and a target-side language model to translate word-byword and then improve the translation through a  (Koehn, 2004).
target-side denoising model (Kim et al., 2018).v) syn_lexicon: Replace the words in the target monolingual sentence with the corresponding source language words in a bilingual dictionary and use the pseudo-parallel data to finetune the m2m_100 model (Wang et al., 2022).
Implementation.We use m2m, released in the HuggingFace repository9 (Wolf et al., 2020).For the wbw_lm baseline, monolingual word embeddings are directly obtained from the fasttext website10 and cross-lingual embeddings are trained using a bilingual dictionary as a supervision signal.
We set λ = 0.7 in all our experiments (the effect of different λ can be find in Appendix E.5).
Evaluation.We measure case-sensitive detokenized BLEU and statistical significant testing as implemented in SacreBLEU11 All results are computed on the devtest dataset of Flores10112 (Goyal et al., 2022).To evaluate the isotropy13 of the MNMT model, we adopt the I 1 and I 2 isotropy measures from Wang et al. (2019), with I 1 (W) ∈ [0, 1] and I 2 (W) ≥ 0, where W is the model matrix from the whole model parameter θ.
Larger I 1 (W) and smaller I 2 (W) indicate a more isotropic embedding space in the MNMT model.Please refer to Appendix B for more details on I 1 and I 2 .

Results
Table 1 shows the main results on low-resource language pairs in a bilingual and multilingual setting.
Table 3 shows results on high-resource languagepairs in a bilingual setting, while Table 2 presents an isotropic embedding space analysis for the bilingual setting.
Low-Resource Language Pairs in a Bilingual Setting.As shown in Table 1, the baselines perform poorly and several of them are worse than the original m2m_100 model.This can be attributed to the fact that their performance depends on the translation quality in the direction of source language to English and English to target language (pivot_en), the quality in the reverse direction (BT), the quality of cross-lingual word-embeddings (wbw_lm) and the token coverage in bilingual dictionary (syn_lexicon).Our method outperforms the baselines across all language pairs, even when ar→ta ta→tr de→fr Encoder Decoder Encoder Decoder Encoder Decoder the performance of the language pair is poor in the original m2m_100 model.In addition, using the curriculum learning sampling strategy further improves our model's performance.

Low-Resource Language Pairs in a Multilingual
Setting.In the middle part of Table 1, we show the average BLEU scores of all language pairs with the same target language.Our approach consistently shows promising results across all languages.
Based on the results shown in the lower part of the same Table , we notice that the BLEU scores obtained in the multilingual setting on a specific language pair outperform the scores obtained in the bilingual setting.For example, we get 3.68 BLEU points for af→ta in the bilingual setting, while we get 4.16 in the multilingual setting.This confirms our intuition that knowledge transfer between different languages in the MNMT model when using a multilingual setting is beneficial (see more details in Section 6.2).

High-Resource Language Pairs in a Bilingual
Setting.As shown in Table 3, baseline systems do not perform well on all high-resource pairs due to the same reasons as in the long-tail languages setting.Our approach outperforms the baselines on all high-resource pairs.In addition, curriculum learning takes full advantage of the original model in the high-resource setting, with stronger gains in performance than in the low-resource setting.In-terestingly, our findings reveal that back translation does not yield optimal results in both low and high resource settings.In low-resource languages, the performance of the language pair and its reverse direction in the original m2m_100 model is significantly poor (i.e., nearly zero).Consequently, the use of back-translation results in a performance that is inferior to that of m2m_100.For high-resource languages, the language pairs already exhibit strong performance in the original m2m_100 model.This makes it challenging to demonstrate that the incorporation of additional pseudo-parallel data can outperform the non-utilization of the pseudo-corpus.Another potential concern is that the large amount of monolingual data we employ, coupled with the substantial amount of pseudo-parallel data derived from back translation, may disrupt the pre-trained model.This observation aligns with the findings of Liao et al. (2021) and Lai et al. (2021).
Statistical Significance Tests.The use of BLEU in isolation as the single metric for evaluating the quality of a method has recently received criticism (Kocmi et al., 2021).Therefore, we conduct statistical significance testing in the low-resource setting to demonstrate the difference as well as the superiority of our method over other baseline systems.As can be seen in Table 1, our method outperforms the baseline by significant differences, which is even more evident in the case study in Table 10.This is because the baseline system faces the serious problems of generating duplicate words (repeat problem) and translating to the wrong language (off-target problem), while our method avoids these two problems.
Isotropy Analysis.It is clear from Table 2 that the embedding space on the encoder side is more isotropic than on the decoder side.This is because we only use the target-side monolingual data to improve the decoder of the MNMT model.Compared to other baseline systems, we get a higher I 1 and lower I 2 score, which shows a more isotropic embedding space in our methods.An interesting finding is that the difference in isotropic space between high-resource language pairs is not significant.This phenomenon is because the original m2m_100 model already performs very well on high-resource language pairs and the representation degeneration is not substantial for those language pairs.In addition, the phenomenon is consistent with the findings in Table 4.

Analysis
In this section, we conduct additional experiments to better understand the strengths of our proposed methods.We first investigate the impact of four components on the results through an ablation study (Section 6.1).Then, we evaluate the zeroshot domain transfer ability and language transfer ability of our method (Section 6.2).Finally, we evaluate some impact factors (the quality of bilingual dictionary and the amount of monolingual data) on our proposed method (Section 6.3) and present a case study to show the strengths of our approach in solving the repetition problem and offtarget issues in MNMT model.

Ablation Study
Our training objective function, shown in Eq. 10, contains four loss functions.We perform an ablation study on en→ta, ta→ar and en→de translation tasks to understand the contribution of each loss function.The experiments in Table 4 are divided into four groups, each group representing the number of loss functions.We have the following three findings: i) #1 clearly shows that the bidirectional autoencoder losses (L AE_bkd and L AE_f wd ) play a more critical role than the bidirectional contrastive learning losses (L CL_bkd and L CL_f wd ) in terms of BLEU score.However, bidirectional contrastive losses are more important than bidirectional autoencoder losses in terms of I 1 and I 2 score.This could be the case because contrastive learning aims to improve the MNMT model's isotropic embedding space rather than the translation from source language to target language.ii) Using forward direction losses results in a better translation quality compared to backward direction losses (#1).This is because our goal is to improve the performance from source language to target language, which is the forward direction in the loss functions.iii) The more loss functions there are, the better the performance.The combination of all four loss functions yields the best performance.iv) We show that the I 1 and I 2 scores in high-resource language pairs (en→de) do not have a significant change as the original embedding space is already isotropic.

Domain Transfer and Language Transfer
Motivated by recent work on the domain and language transfer ability of MNMT models (Lai et al., 2022a), we conduct a number of experiments with extensive analysis to validate the zero-shot domain transfer ability, as well as the language transfer ability of our proposed method.We have the following findings: i) Our proposed method works well not only on the Flores101 datasets (domains similar to training data of the original m2m_100 model), but also on other domains.This supports the domain transfer ability of our proposed method.ii) We show that the transfer ability is more obvious in the multilingual setting than in the bilingual setting, which is consistent with the conclusion from Table 6 in the multilingual setting.More details can be found in Appendix E.3 and E.4.

Further Investigation
To investigate two other important factors in our proposed methods, we conducted additional experiments to evaluate the impact of the quality of the dictionary and the amount of monolingual data.In general, we observe that better performance can be obtained by utilizing a high-quality bilingual dictionary.In addition, the size of the monolingual data used is not proportional to the performance improvement.More details can be found in Appendix E.1 and E.2.Also, compared with the baseline models, our method has strengths in solving repetition and off-target problems, which are two common issues in large-scale MNMT models.More details can be found in Appendix E.6.

Conclusion
To address the data imbalance and representation degeneration problem in MNMT, we present a framework named Bi-ACL which improves the performance of MNMT models using only target-side monolingual data and a bilingual dictionary.We employ a bidirectional autoencoder and bidirectional contrastive learning, which prove to be effective both on long-tail languages and high-resource languages.We also find that Bi-ACL shows language transfer and domain transfer ability in zeroshot scenarios.In addition, Bi-ACL provides a paradigm that an inexpensive bilingual lexicon and monolingual data should be fully exploited when there are no bilingual parallel corpora, which we believe more researchers in the community should be aware of.

Limitations
This work has two main limitations.i) We only evaluated the proposed method on the machine translation task, however, Bi-ACL should work well on other NLP tasks, such as text generation or question answering task, because our framework only depends on the bilingual dictionary and monolingual data, which can be easily found on the internet for many language pairs.ii) We only evaluated Bi-ACL using m2m_100 as a pretrained model.However, we believe that our approach would also work with other pretrained models, such as mT5 (Xue et al., 2021) and mBART (Liu et al., 2020).Because the two components (bidirectional autoencoder and bidirectional contrastive learning) we proposed can be seen as plugins, they could be easily added to any pretrained model.

A Contrastive Learning
Our approach is different from traditional contrastive learning, which takes a ground-truth sentence pair as a positive example and a random nontarget sentence pair in the same batch as a negative example.Motivated by Lee et al. (2020), we construct positive and negative examples automatically.

A.1 Negative Example Formulation
As described in Section 3.3, to generate a negative example, we add a small perturbation δ i = {δ 1 . . .δ T } to the H i , which is the hidden representation of the source-side sentence.As seen in Eq. 6, the negative example is denoted as H −ℓ t i ′ , and is formulated as the sum of the original contextual embedding H ℓ t i ′ of the target language sentence X ℓt i ′ and the perturbation δ bkd is formulated as the conditional log likelihood with respect to δ. δ bkd is semantically very dissimilar to X ℓt i ′ , but very close to the hidden representation H −ℓ t i ′ in the embedding space.
where ϵ ∈ (0, 1] is a parameter that controls the perturbation and θ denotes the parameters of the MNMT model.

A.2 Positive Example Formulation
As shown in Eq. 6, we create a positive example of the target sentence by adding a perturbation which is the hidden state of the target-side sentence.The objective of the perturbation ζ bkd is to minimize the KL divergence between the perturbed conditional distribution and the original conditional distribution as follows: ) where θ * is the copy of the model parameter θ.As a result, the positive example is semantically similar to X ℓs i ′ and dissimilar to the contextual embedding of the target sentence in the embedding space.

B Evaluation of Isotropy
We use I 1 and I 2 scores from Wang et al. (2019) to characterize the isotropy of the output embedding Figure 3: BLEU score statistics of the m2m_100 model (Fan et al., 2021) on Flores101 dataset (Goyal et al., 2022) for 102 × 101 = 10302 language pairs.Each bar denotes the number of language pairs in the interval of the BLEU score. space.
where Z(s) = n i=1 exp s T w i is close to some constant with high probability for all unit vectors s if the embedding matrix W is isotropic (w i ∈ W ). S is the set of eigenvectors of W ⊤ W. I 2 is the sample standard deviation of Z(s) normalized by its average Z(s)).We have I 1 (W ) ∈ [0, 1] and I 2 (W ) ≥ 0. Larger I 1 (W ) and smaller I 2 (W ) indicate more isotropic for word embedding space.
In this work, we randomly select 128 sentences from Flores101 benchmark to compute these two criteria.The results are shown in Table 2.

C Model Configuration
We use the m2m_100 model with 418MB parameters implemented in Huggingface.In our experiments, we use the AdamW (Loshchilov and Hutter, 2018) optimizer and the learning rate are initial to 2e − 5 with a dropout probability 0.1.We trained our models on one machine with 4 NVIDIA V100 GPUs.The batch size is set to 8 per GPU during training.To have a fair comparison, all experiments are trained for 3 epochs.

D.1 Statistics of BLEU scores in m2m_100
Figure 3 shows the BLEU scores of the m2m_100 model on all 10302 supported language pairs.We see that 21% of the langauge pairs have a BLEU score of almost 0 and more than 50% have a BLEU score of less than 5.

D.2 Statistics of Bilingual Dictionaries
Table 7 shows the size of the bilingual dictionaries used in a bilingual setting.For the multilingual setting, we will publish our code to generate the bilingual dictionary for any language pair.

E Further Analysis E.1 Quality of the Bilingual Dictionary
To investigate whether the quality of bilingual dictionary affects the performance of our method, we conduct additional experiments using the Panlex dictionary 14 , a big dataset that covers 5,700 lan-  guages.We evaluate the performance on en→ta, ca→ta, ga→bs and ta→tr translation tasks.
As seen in Table 8, using the dictionary mined from wikitionary results in a better performance than using the panlex dictionary.The reason for this is that, while Panlex supports bilingual dictionaries for many language pairs, we discovered that the quality of them is quite low, especially when English is not one of the two languages in the language pair.

E.2 Amount of Monolingual Data
As described in Section 3.4, we use the bilingual dictionary coverage ϕ as the curriculum to order the training batch.In this section, we aim to investigate how the number of monolingual data affects the experimental results.A smaller ϕ means a larger number of monolingual data.We conduct experiments on en→ta, en→kk, ar→ta and ca→ta translation tasks with a different ϕ.
As seen in  of monolingual data is not proportional to the experimental performance.This is because a large percentage of words in a sentence are not covered by lexicons in the bilingual dictionaries, the performance of constrained beam search is limited.This phenomenon is consistent with the conclusion that the effect of the size of the pseudo parallel corpus in data augmentation (Fadaee et al., 2017) and back-translation (Sennrich et al., 2016) on the experimental results, i.e., that the performance of machine translation is not proportional to the size of the pseudo parallel corpus.

E.3 Domain Transfer
To investigate the domain transfer ability of our approach, we first conduct experiments on en→ta, ka→ar, ta→tr translation tasks, then evaluate the performance in a zero-shot setting on three different domains (TED, QED and KDE) which are publicly available datasets from OPUS15 (Tiedemann, 2012) and on the Flores101 benchmark.The results is shown in Table 5.
According to Table 5, the performance of the baseline systems is even worse than the original m2m_100 model, which suggests that they do not show domain robustness nor domain transfer ability due to poor performance (see Table 1).In contrast, our proposed method works well not only on the Flores101 datasets (domains similar to training data of the original m2m_100 model), but also on other domains.

E.4 Language Transfer
To investigate the language transfer ability, we use the model trained on a specific language (pair) to generate text for another language (pair) both in the bilingual and multilingual settings.For the bilingual setting, we run experiments to assess the language transfer ability between en→ta and ar→ta translation tasks.For the multilingual setting, we focus on translation scores between ta and be.The results is shown in Table 6.
As indicated in Table 6, we observe that the performance in our method outperforms the other baselines both in the bilingual and in the multilingual setting.We also discover that the transfer ability is more obvious in the multilingual setting than in the bilingual setting.This phenomenon is consistent with the conclusion from Table 1 in the multilingual setting.We believe that this can be attributed to the fact that in a multilingual setting, the language is used for all language pairs that share Example 1 (Repetition Problem) Source (English): "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.
pivot_en ெபaDபாbD இ+cடாகிராமி< உ'ள இ7த Glen Cushing of the United States Geological Survey (USGS) Astrogeology Team and of Northern Arizona University located in Flagstaff, Arizona.
Table 10: Case study the same target language, which can be seen as common information for all language pairs.E.5 The effect of λ In Section 3.5, we set a λ to balance the importance of both autoencoding and contrastive loss to our model.From Figure 4, we show that the autoencoding loss plays a more important role than contrastive loss in terms of BLEU.When λ = 0.7, we got the best performance both in long-tail language pair and high-resource langauge pair.

E.6 Case Study
We now present qualitative results on how our method addresses the repetition and off-target problems.For the first example in Figure 10, we find that other baseline systems suffer from a severe repetition problem.This is attributed to a poor decoder.
In contrast, our method does not have a repetition problem, most likely because we enhanced the representation of the decoder through a bidirectional contrastive loss.For the second example, we show that while the off-target problem is prevalent in baseline systems, our method seems to provide an effective solution to it.

Table 1 :
Main Results: BLEU scores for low-resource language pairs in the bilingual, multilingual setting, and 10 randomly selected language pairs in the multilingual setting.Language pairs with * ar→ta * ca→ta * af→ta * el→ta en→kk * hi→kk * * in the multilingual setting are covered by the bilingual setting.∆ denotes improvement over the original m2m_100 model, while Φ shows improvement over the bilingual setting.Best results are shown in bold.† and ‡ denotes significant over original m2m_100 model at 0.05/0.01,evaluated by bootstrap resampling

Table 2 :
Main Results: Isotropic embedding space analysis in ar→ta and ta→tr translation task.The definitions of I 1 and I 2 can be found in Appendix B.

Table 3 :
Main Results: BLEU scores for high-resource language pairs in the bilingual setting.

Table 4 :
Ablation study of four loss functions on en→ta and ta→ar translation task." √ " means the loss function is included in the training objective while "×" means it is not.Both I 1 and I 2 score are computed in the decoder side.

Table 5 :
Domain transfer: BLEU scores on en→ta, ka→ar, ta→tr in different domains.

Table 7 :
Statistics of bilingual dictionaries.

Table 8 :
The effect of bilingual dictionary quality on experimental performance in terms of BLEU score.
Table 9, we observe that the amount

Table 9 :
The effect of monolingual corpus size in experimental results in terms of BLEU score.The smaller the value of ϕ (bilingual dictionary coverage) the larger the monolingual corpus.