Contrastive Learning for Many-to-many Multilingual Neural Machine Translation

Existing multilingual machine translation approaches mainly focus on English-centric directions, while the non-English directions still lag behind. In this work, we aim to build a many-to-many translation system with an emphasis on the quality of non-English language directions. Our intuition is based on the hypothesis that a universal cross-language representation leads to better multilingual translation performance. To this end, we propose mRASP2, a training method to obtain a single unified multilingual translation model. mRASP2 is empowered by two techniques: a) a contrastive learning scheme to close the gap among representations of different languages, and b) data augmentation on both multiple parallel and monolingual data to further align token representations. For English-centric directions, mRASP2 achieves competitive or even better performance than a strong pre-trained model mBART on tens of WMT benchmarks. For non-English directions, mRASP2 achieves an improvement of average 10+ BLEU compared with the multilingual baseline


Introduction
Transformer (Vaswani et al., 2017) has achieved decent performance for machine translation with rich bilingual parallel corpora. Recent work on multilingual machine translation aims to create a single unified model to translate many languages (Johnson et al., 2017;Aharoni et al., 2019;Siddhant et al., 2020). Multilingual translation models are appealing for two reasons. First, they are model efficient, enabling easier deployment (Johnson et al., Encoder Decoder <Fr> Je t'aime. <Fr> C'est la vie.
<En> I love you. L ctr -Positive Negative L ce Figure 1: The proposed mRASP2. It takes a pair of parallel sentences (or augmented pseudo-pair) and computes normal cross entropy loss with a multi-lingual encoder-decoder. In addition, it computes contrastive loss on the representations of the aligned pair (positive example) and randomly selected non-aligned pair (negative example). 2017). Further, parameter sharing across different languages encourages knowledge transfer, which benefits low-resource translation directions and potentially enables zero-shot translation (i.e. direct translation between a language pair not seen during training) (Ha et al., 2017;Gu et al., 2019;Ji et al., 2020).
Despite these benefits, challenges still remain in multilingual NMT. First, previous work on multilingual NMT does not always perform well as their corresponding bilingual baseline especially on rich resource language pairs . Such performance gap becomes larger with the increasing number of accommodated languages for multilingual NMT, as model capacity necessarily must be split between many languages (Arivazhagan et al., 2019). In addition, an optimal setting for multilingual NMT should be effective for any language pairs, while most previous work focus on improv-ing English-centric 1 directions (Johnson et al., 2017;Aharoni et al., 2019;. A few recent exceptions are  and , who trained many-to-many systems with introducing more non-English corpora, through data mining or back translation.
In this work, we take a step towards a unified many-to-many multilingual NMT with only English-centric parallel corpora and additional monolingual corpora. Our key insight is to close the representation gap between different languages to encourage transfer learning as much as possible.
As such, many-to-many translations can make the most of the knowledge from all supervised directions and the model can perform well for both English-centric and non-English settings. In this paper, we propose a multilingual COntrastive Learning framework for Translation (mCOLT or mRASP2) to reduce the representation gap of different languages, as shown in Figure 1.
The objective of mRASP2 ensures the model to represent similar sentences across languages in a shared space by training the encoder to minimize the representation distance of similar sentences. In addition, we also boost mRASP2 by leveraging monolingual data to further improve multilingual translation quality. We introduce an effective aligned augmentation technique by extending RAS (Lin et al., 2020) -on both parallel and monolingual corpora to create pseudo-pairs. These pseudo-pairs are combined with multilingual parallel corpora in a unified training framework.
Simple yet effective, mRASP2 achieves consistent translation performance improvements for both English-centric and non-English directions on a wide range of benchmarks. For Englishcentric directions, mRASP2 outperforms a strong multilingual baseline in 20 translation directions on WMT testsets. On 10 WMT translation benchmarks, mRASP2 even obtains better results than the strong bilingual mBART model. For zeroshot and unsupervised directions, mRASP2 obtains surprisingly strong results on 36 translation directions 2 , with 10+ BLEU improvements on average. 1 "English-centric" means that having English as the source or target language 2 6 unsupervised directions + 30 zero-shot directions 2 Methodology mRASP2 unifies both parallel corpora and monolingual corpora with contrastive learning. This section will explain our proposed mRASP2. The overall framework is illustrated in Figure 1

Multilingual Transformer
A multilingual neural machine translation model learns a many-to-many mapping function f to translate from one language to another. To distinguish different languages, we add an additional language identification token preceding each sentence, for both source side and target side. The base architecture of mRASP2 is the state-of-theart Transformer (Vaswani et al., 2017). A little different from previous work, we choose a larger setting with a 12-layer encoder and a 12-layer decoder to increase the model capacity. The model dimension is 1024 on 16 heads. To ease the training of the deep model, we apply Layer Normalization for word embedding and pre-norm residual connection following Wang et al. (2019a) for both encoder and decoder. Therefore, our multilingual NMT baseline is much stronger than that of Transformer big model. More formally, we define L = {L 1 , . . . , L M } where L is a collection of M languages involving in the training phase. D i,j denotes a parallel dataset of (L i , L j ), and D denotes all parallel datasets. The training loss is cross entropy defined as: where x i represents a sentence in language L i , and θ is the parameter of multilingual Transformer model.

Multilingual Contrastive Learning
Multilingual Transformer enables implicitly learning shared representation of different languages. mRASP2 introduces contrastive loss to explicitly bring different languages to map a shared semantic space.
The key idea of contrastive learning is to minimize the representation gap of similar sentences and maximize that of irrelevant sentences. Formally, given a bilingual translation pairs (x i , x j ) ∈ D, (x i , x j ) is the positive example and we randomly choose a sentence y j from language L j to form a negative example 3 (x i , y j ).  Figure 2: Aligned augmentation on both parallel and monolingual data by replacing words with the same meaning in synonym dictionaries. It either creates a pseudo-parallel example (left) or a pseudo self-parallel example (right).
The objective of contrastive learning is to minimize the following loss: (2) where sim(·) calculates the similarity of different sentences. + and − denotes positive and negative respectively. R(s) denotes the averagepooled encoded output of an arbitrary sentence s. τ is the temperature, which controls the difficulty of distinguishing between positive and negative examples 4 . In our experiments, it is set to 0.1. The similarity of two sentences is calculated with the cosine similarity of the averagepooled encoded output. To simplify implementation, the negative samples are sampled from the same training batch. Intuitively, by maximizing the softmax term sim + (R(x i ), R(x j )), the contrastive loss forces their semantic representations projected close to each other. In the meantime, the softmax function also minimizes the non-matched pairs sim − (R(x i ), R(y j )).
During the training of mRASP2, the model can be optimized by jointly minimizing the contrastive training loss and translation loss: where λ is the coefficient to balance the two training losses. Since L ctr is calculated on the sentencelevel and L ce is calculated on the token-level, therefore L ctr should be multiplied by the averaged sequence length |s|.

Aligned Augmentation
We then will introduce how to improve mRASP2 with data augmentation methods, including the introduction of noised bilingual and noised monolingual data for multilingual NMT. The above two 4 Higher temperature increases the difficulty to distinguish positive sample from negative ones. types of training samples are illustrated in Figure 2. Lin et al. (2020) propose Random Aligned Substitution technique (or RAS 5 ) that builds codeswitched sentence pairs (C(x i ), x j ) for multilingual pre-training. In this paper, we extend it to Aligned Augmentation (AA), which can also be applied to monolingual data.
For a bilingual or monolingual sentence pair (x i , x j ) 6 , AA creates a perturbed sentence C(x i ) by replacing aligned words from a synonym dictionary 7 . For every word contained in the synonym dictionary, we randomly replace it to one of its synonym with a probability of 90%.
For a bilingual sentence pair (x i , x j ), AA creates a pseudo-parallel training example (C(x i ), x j ). For monolingual data, AA takes a sentence x i and generates its perturbed C(x i ) to form a pseudo self-parallel example (C(x i ), x i ). (C(x i ), x j ) and (C(x i ), x i ) is then used in the training by calculating both the translation loss and contrastive loss. For a pseudo self-parallel example (C(x i ), x i ), the contrastive loss is basically the reconstruction loss from the perturbed sentence to the original one.

Experiments
This section shows that mRASP2 can achieve substantial improvements over previous many-tomany multilingual translation on a wide range of benchmarks. Especially, it obtains substantial gains on zero-shot directions.

Settings and Datasets
Parallel Dataset PC32 We use the parallel dataset PC32 provided by Lin et al. (2020). It con-  (Lin et al., 2020) 44  Results for Transformer-6 (6 layers for encoder and decoder) are from Lin et al. (2020). Results for Transformer-12 (12 layers for encoder and decoder separately) are from Liu et al. (2020). (*) Note that for En→Ro direction, we follow the previous setting to calculate BLEU score after removing Romanian dialects. (**) For mRASP w/o finetune we report the results implemented by ourselves, with 12 layers encoder and decoder and our data. Both m-Transformer and our mRASP2 have 12 layers for encoder and decoder.
tains a large public parallel corpora of 32 Englishcentric language pairs. The total number of sentence pairs is 97.6 million. We apply AA on PC32 by randomly replacing words in the source side sentences with synonyms from an arbitrary bilingual dictionary provided by (Lample et al., 2018) 8 . For words in the dictionaries, we replace them into one of the synonyms with a probability of 90% and keep them unchanged otherwise. We apply this augmentation in the pre-processing step before training.

Monolingual Dataset MC24
We create a dataset MC24 with monolingual text in 24 languages 9 . It is a subset of the Newscrawl 10 dataset by retaining only those languages in PC32, plus three additional languages that are not in PC32 (Nl, Pl, Pt). In order to balance the volume across different languages, we apply temperature sam- where n i is the number of sentences in ith language. Then we apply AA on monolingual data. The total number of sentences in MC24 is 1.01 billion. The detail of data volume is listed in the Appendix.
We apply AA on MC24 by randomly replacing words in the source side sentences with synonyms from a multilingual dictionary. Therefore the source side might contain multiple language tokens (preserving the semantics of the original sentence), and the target is just the original sentence. The replace probability is also set to 90%. We apply this augmentation in the pre-processing step before training. We will release the multilingual dictionary and the script for producing the noised monolingual dataset.
Evaluation Datasets For supervised directions, most of our evaluation datasets are from WMT and IWSLT benchmarks, for pairs that are not available in WMT or IWSLT, we use OPUS-100 instead.
For zero-shot directions, we follow  and use their proposed OPUS-100 zero-shot testset. The testset is comprised of 6 languages (Ru, De, Fr, Nl, Ar, Zh), resulting in 15 language pairs and 30 translation directions.
We report de-tokenized BLEU with Sacre-   BLEU (Post, 2018). For tokenized BLEU, we tokenize both reference and hypothesis using Sacremoses 11 toolkit then report BLEU using the multi-bleu.pl script 12 . For Chinese (Zh), BLEU score is calculated on character-level.

Experiment Results
This section shows that mRASP2 provides consistent performance gains for supervised and unsupervised English-centric translation directions as well as for non-English directions.

English-Centric Directions
Supervised Directions As shown in Table 1, mRASP2 clearly improves multilingual baselines by a large margin in 10 translation directions. Previously, multilingual machine translation underperforms bilingual translation in rich-resource scenarios. It is worth noting that our multilingual machine translation baseline is already very competitive. It is even on par with the strong mBART bilingual model, which is fine-tuned on a large scale unlabeled monolingual dataset. mRASP2 further improves the performance. We summarize the key factors for the success training of our baseline 13 m-Transformer: a) The batch size plays a crucial role in the suc-  cess of training multilingual NMT. We use 8 × 4 NVIDIA V100 with update frequency 50 to train the models and each batch contains about 3 million tokens. b) We enlarge the number of layers from 6 to 12 and observe significant improvements for multilingual NMT. By contrast, the gains from increasing the bilingual model size is not that large. mBART also uses 12 encoder and decoder layers. c) We use gradient norm to stable the training. Without this regularization, the large scale training will collapse sometimes.
Unsupervised Directions In Table 2, we observe that mRASP2 achieves reasonable results on unsupervised translation directions. The language pairs of En-Nl, En-Pt, and En-Pl are never observed by m-Transformer. m-Transformer sometimes achieves reasonable BLEU for X→En, e.g. 10.7 for Pt→En, since there are many similar languages in PC32, such as Es and Fr. Not surprisingly, it totally fails on En→X directions. By contrast, mRASP2 obtains +14.13 BLEU score on an average without explicitly introducing supervision signals for these directions. Furthermore, mRASP2 achieves reasonable BLEU scores on Nl↔Pt directions even though it has only been trained on monolingual data of both sides. This indicates that by simply incorporating monolingual data with parallel data in the unified framework, mRASP2 successfully enables unsupervised translation through its unified multilingual representation.

Zero-shot Translation for non-English Directions
Zero-shot Translation has been an intriguing topic in multilingual neural machine translation. Previous work shows that the multilingual NMT model can do zero-shot translation directly. However, the translation quality is quite poor compared with pivot-based model. We evaluate mRASP2 on the OPUS-100  zero-shot test set, which contains 6 languages 14 and 30 translation directions in total. To make the comparison clear, we also report the results of several different baselines. mRASP2 w/o AA only adopt contrastive learning on the basis of m-Transformer. mRASP2 w/o MC24 excludes monolingual data from mRASP2.
The evaluation results are listed in Appendix and we summarize them in Table 3. We find that our mRASP2 significantly outperforms m-Transformer and substantially narrows the gap with pivot-based model. This is in line with our intuition that bridging the representation gap of different languages can improve the zero-shot translation.
The main reason is that contrastive loss, aligned augmentation and additional monolingual data enable a better language-agnostic sentence representation. It is worth noting that,  achieves BLEU score improvements on zero-shot translations at sacrifice of about 0.5 BLEU score loss on English-centric directions. By contrast, mRASP2 improves zero-shot translation by a large margin without losing performance on English-Centric directions. Therefore, mRASP2 has a great potential to serve many-to-many translations, including both English-centric and non-English directions.

Analysis
To understand what contributes to the performance gain, we conduct analytical experiments in this section. First we summarize and analysis the performance of mRASP2 in different scenarios. Second we adopt the sentence representation of mRASP2 to retrieve similar sentences across languages. This is to verify our argument that the improvements come from the universal language representation learned by mRASP2. Finally we visualize the sentence representations, mRASP2 indeed draws the representations closer.

Ablation Study
To make a better understanding of the effectiveness of mRASP2, we evaluate models of different settings. We summarize the experiment results in Table 4: • 1 v.s. 3 : 3 performs comparably with m-Transformer in supervised and unsupervised scenarios, whereas achieves a substantial BLEU improvement for zero-shot translation. This indicates that by introducing contrastive loss, we can improve zero-shot translation quality without harming other directions.
• 2 v.s. 4 : 2 performs poorly for zero-shot directions. This means contrastive loss is crucial for the performance in zero-shot directions.
• 5 : mRASP2 further improves BLEU in all of the three scenarios, especially in unsupervised directions. Therefore it is safe to conjecture that by accomplishing with monolingual data, mRASP2 learns a better representation space.

Similarity Search
In order to verify whether mRASP2 learns a better representation space, we conduct a set of similarity search experiments. Similarity search is a task to find the nearest neighbor of each sentence in another language according to cosine similarity. We argue that mRASP2 benefits this task in the sense that it bridges the representation gap across languages. Therefore we use the accuracy of similarity search tasks as a quantitative indicator of cross-lingual representation alignment.
We conducted comprehensive experiments to support our argument and experiment on mRASP2 and mRASP2 w/o AA .We divide the experiments into two scenarios: First we evaluate our method on Tatoeba dataset (Artetxe and Schwenk, 2019), which is English-centric. Then we conduct similar similarity search task on non-English language pairs. Following Tran et al. (2020), we construct a multi-way parallel testset (Ted-M) of 2284 samples by filtering the test split of ted 15 that have translations for all 15 languages 16 .
Under both settings, we follow the same strategy: We use the average-pooled encoded output as the sentence representation. For each sentence from the source language, we search the closest sentence in the target set according to cosine similarity.

English-Centric: Tatoeba
We display the evaluation results in Table 5. We detect two trends: (i) The overall accuracy follows the rule: m-Transformer < mRASP2 w/o AA < mRASP2. (ii) mRASP2 brings more significant improvements for languages with less data volume in PC32. The two trends mean that mRASP2 increases translation BLEU score in a sense that it bridges the representation gap across languages.
Non-English: Ted-M It will be more convincing to argue that mRASP2 indeed bridges the representation gap if similarity search accuracy increases on zero-shot directions. We list the averaged top-1 accuracy of 210 non-English directions 17 in   Table 6: Non-English: The averaged sentence similarity search top-1 accuracy on Ted-M testset. m-Transformer < mRASP2 w/o AA < mRASP2, which is consistent with the results in English-centric scenario.
that our method generally narrows the representation gap across languages.
To better understanding the specifics beyond the averaged accuracy, we plot the accuracy improvements in the heat map in Figure 3. mRASP2 w/o AA brings general improvements over m-Transformer. mRASP2 especially improves on Dutch(Nl). This is because mRASP2 introduces monolingual data of Dutch while mRASP2 w/o AA includes no Dutch data.

Visualization
In order to visualize the sentence representations across languages, we retrieve the sentence representation R(s) for each sentence in Ted-M, resulting in 34260 samples in the high-dimensional space.
To facilitate visualization, we apply T-SNE dimension reduction to reduce the 1024-dim representations to 2-dim. Then we select 3 representative languages: English, German, Japanese and depict the bivariate kernel density estimation based on the 2-dim representations. It is clear in Figure 4 that m-Transformer cannot align the 3 languages. By contrast, mRASP2 draws the representations across 3 languages much closer.

Related Work
Multilingual Neural Machine Translation While initial research on NMT starts with build-ing translation systems between two languages, Dong et al. (2015) extends the bilingual NMT to one-to-many translation with sharing encoders across 4 language pairs. Hence, there has been a massive increase in work on MT systems that involve more than two languages (Chen et al., 2018;Choi et al., 2018;Chu and Dabre, 2019;Dabre et al., 2017). Recent efforts mainly focuses on designing language specific components for multilingual NMT to enhance the model performance on rich-resource languages (Bapna and Firat, 2019; Kim et al., 2019;Wang et al., 2019b;Escolano et al., 2020). Another promising thread line is to enlarge the model size with extensive training data to improve the model capability (Arivazhagan et al., 2019;Aharoni et al., 2019;. Different from these approaches, mRASP2 proposes to explicitly close the semantic representation of different languages and make the most of cross lingual transfer. Zero-shot Machine Translation Typical zeroshot machine translation models rely on a pivot language (e.g. English) to combine the sourcepivot and pivot-target translation models (Chen et al., 2017;Ha et al., 2017;Gu et al., 2019;Currey and Heafield, 2019). Johnson et al. (2017) shows that a multilingual NMT system enables zero-shot translation without explicitly introducing pivot methods. Promising, but the performance still lags behind the pivot competitors. Most following up studies focused on data augmentation methods.  improved the zero-shot translation with online back translation. Ji et al. (2020); Liu et al. (2020) shows that large scale monolingual data can improve the zero-shot translation with unsupervised pre-training.  proposes a simple and effective data mining method to enlarge the training corpus of zero-shot directions. Some work also attempted to explicitly learn shared semantic representation of different languages to im- prove the zero-shot translation. Lu et al. (2018) suggests that by learning an explicit "interlingual" across languages, multilingual NMT model can significantly improve zero-shot translation quality. Al-Shedivat and Parikh (2019) introduces a consistent agreement-based training method that encourages the model to produce equivalent translations of parallel sentences in auxiliary languages. Different from these efforts, mRASP2 attempts to learn a universal many-to-many model, and bridge the cross-lingual representation with contrastive learning and m-RAS. The performance is very competitive both on zero-shot and supervised directions on large scale experiments.
Contrastive Learning Contrastive Learning has become a rising domain and achieved significant success in various computer vision tasks (Zhuang et al., 2019;Tian et al., 2020;He et al., 2020;Misra and van der Maaten, 2020 (2019) proposes TLM objective that simply concatenates parallel sentences as input. By contrast, mRASP2 leverages the supervision signal by pulling closer the representations of parallel sentences.

Conclusion
We demonstrate that contrastive learning can significantly improve zero-shot machine translation directions. Combined with additional unsupervised monolingual data, we achieve substantial improvements on all translation directions of multilingual NMT. We analyze and visualize our method, and find that contrastive learning tends to close the representation gap of different languages. Our results also show the possibilities of training a true many-to-many Multilingual NMT that works well on any translation direction. In future work, we will scale-up the current training to more languages, e.g. PC150. As such, a single model can handle more than 100 languages and outperforms the corresponding bilingual baseline.

A Case Study
We plot the location of multi-way parallel sentences in the representation space of mRASP2 in Figure 5 and list sentences number 1 and 100 in Table 7 B Details of Evaluation Results We list detailed results of evaluation on a wide range of test sets.

B.1 Results on OPUS-100
Detailed results on OPUS-100 zero-shot evaluation set are listed in Table 8 B

.2 Results on WMT
Detailed results on WMT evaluation set are listed in Table 9 C Example of AA

D Details of MC24
We describe the detail of MC24 in Table 10