Improving Zero-Shot Multilingual Translation with Universal Representations and Cross-Mappings

The many-to-many multilingual neural machine translation can translate between language pairs unseen during training, i.e., zero-shot translation. Improving zero-shot translation requires the model to learn universal representations and cross-mapping relationships to transfer the knowledge learned on the supervised directions to the zero-shot directions. In this work, we propose the state mover's distance based on the optimal theory to model the difference of the representations output by the encoder. Then, we bridge the gap between the semantic-equivalent representations of different languages at the token level by minimizing the proposed distance to learn universal representations. Besides, we propose an agreement-based training scheme, which can help the model make consistent predictions based on the semantic-equivalent sentences to learn universal cross-mapping relationships for all translation directions. The experimental results on diverse multilingual datasets show that our method can improve consistently compared with the baseline system and other contrast methods. The analysis proves that our method can better align the semantic space and improve the prediction consistency.


Introduction
The many-to-many multilingual neural machine translation (NMT) (Ha et al., 2016;Firat et al., 2016;Johnson et al., 2017;Gu et al., 2018;Fan et al., 2020;Zhang et al., 2020a) model can support multiple translation directions in a single model.The shared encoder encodes the input sentence to the semantic space, and then the shared decoder decodes from the space to generate the translation of the target language.This paradigm allows the model to translate between language pairs unseen during training, i.e., zero-shot translation.
Zero-shot translation can improve the inference efficiency and make the model require less bilingual training data.Performing zero-shot translation requires universal representations to encode the language-agnostic features and cross-mapping relationships that can map the semantic-equivalent sentences of different languages to the particular space of the target language.In this way, the model can transfer the knowledge learned in the supervised translation directions to the zero-shot translation directions.However, the existing model structure and training scheme cannot ensure the universal representations and cross-mappings because of lacking explicit constraints.Specifically, the encoder may map different languages to different semantic subspaces, and the decoder may learn different mapping relationships for different source languages, especially when the model possesses high capacity.
Many researchers have made their attempts to solve this problem.Pham et al. (2019) propose to compress the output of the encoder into a consistent number of states to only encode the languageindependent features.Arivazhagan et al. (2019) add a regularizing loss to maximize the similarities between the sentence representations of the source and target sentences.Pan et al. (2021) propose contrastive learning schemes to minimize the sentence representation gap of similar sentences and maximize that of irrelevant sentences.All the above work tries to minimize the representation discrepancies of different languages at the sentence level, bringing two problems for NMT.Firstly, these work usually get the sentence-level representation of the encoder output by max-pooling or averaging, which may potentially ignore the sentence length, word alignment relationship, and other token-level information.Secondly, regularizing sentence representation mismatches to the working paradigm of the NMT model, because the decoder directly performs cross attention on the whole state sequences rather than the sentence representation.Besides, all the above work focuses on the encoder side and cannot help learn the universal mapping relationship for the decoder.
Given the above, we propose a method to learn the universal representations and cross-mappings to improve the zero-shot translation performance.Based on the optimal transport theory, we propose state mover's distance (SMD) to model the differences of two state sequences at the token level.To map the semantic-equivalent sentences from different languages to the same place of the semantic space, we add an auxiliary loss to minimize the SMD of the source and target sentences.Besides, we propose an agreement-based training scheme to learn universal mapping relationships for the translation directions with the same target language.We mixup the source and target sentences to obtain a pseudo sentence.Then, the decoder makes predictions separately conditioned on this pseudo sentence and the corresponding source or target sentences.We try to improve the prediction consistency by minimizing the KL divergence of the two output distributions.The experimental results on diverse multilingual datasets show that our method can bring 2~3 BLEU improvements over the strong baseline system and consistently outperform other contrast methods.The analysis proves that our method can better align the semantic space and improve the prediction consistency.

Background
In this section, we will give a brief introduction to the TRANSFORMER (Vaswani et al., 2017) model and the many-to-many multilingual translation.

The transformer
We denote the input sequence of symbols as x = (x 1 , . . ., x nx ) and the ground-truth sequence as y = (y 1 , . . ., y ny ).The transformer model is based on the encoder-decoder architecture.The encoder is composed of N identical layers.Each layer has two sublayers.The first is a multi-head self-attention sublayer, and the second is a fully connected feed-forward network.Both of the sublayers are followed by a residual connection operation and a layer normalization operation.The input sequence x will be first converted to a sequence of vectors.Then, this sequence of vectors will be fed into the encoder, and the output of the N -th layer will be taken as source state sequences.We denote it as H x .The decoder is also composed of N iden- tical layers.In addition to the same kind of two sublayers in each encoder layer, the cross-attention sublayer is inserted between them, which performs multi-head attention over the output of the encoder.We can get the predicted probability of the k-th target word conditioned by the source sentence and the k − 1 previous target words.The model is optimized by minimizing a cross-entropy loss of the ground-truth sequence with teacher forcing: where n y is the length of the target sentence and θ denotes the model parameters.

Multilingual Translation
We define L = {l 1 , . . ., l M } where L is a collection of M languages involved in the training phase.Following Johnson et al. (2017), we share all the model parameters for all the languages.Following Liu et al. (2020), we add a particular language id token at the beginning of the source and target sentences, respectively, to indicate the language.

Method
The main idea of our method is to help the encoder output universal representations for all the languages and help the decoder map the semanticequivalent representation from different languages to the target language's space.We propose two approaches to fulfill this goal.The first is to directly bridge the gap between the state sequences that carry the same semantics.The second is to force the decoder to make consistent predictions based on the semantic-equivalent sentences.Figure 1 shows the overall training scheme.

Optimal Transport
Earth Mover's Distance Based on the optimal transport theory (Villani, 2009;Peyré et al., 2019), the earth mover's distance (EMD) measures the minimum cost to transport the probability mass from one distribution to another distribution.Assuming that there are two probability distributions µ and µ ′ , that are defined as: where each data point w i ∈ R d has a probability mass m i (m i > 0).There are n data points in µ.We define a cost function c(w i , w ′ j ) that determines the cost of per unit between two points w i and w ′ i .Given above, the EMD is defined as: (3) T ij denotes the mass transported from µ to µ ′ .State Mover's Distance Following EMD, we define the state mover's distance (SMD) to measure the minimum 'travel cost' between two state sequences.Given a pair of translations x = (x 1 , . . ., x nx ), and y = (y 1 , . . ., y ny ), we can get their corresponding state sequences after feeding them to the encoder, which are denoted as: where nx and ny denote the sentence length of the source and target sentences.We can regard H x as a discrete distribution on the space R d , where the probability only occurs at each specific point h i .Next, several previous studies (Schakel and Wilson, 2015;Yokoi et al., 2020) have confirmed that the embedding norm is related to the word importance, and the important words have larger norms.Inspired by these findings, we also observe that the state vector has similar properties.The state vectors of essential words, such as content and medium-frequency words, have larger norms than unimportant ones, such as function words, high-frequency words.Therefore, we propose to use the normalized vector norm as the probability mass for each state point: where | • | denotes the norm of the vector.
Given above, we can convert the state sequences to distributions: Then, the SMD is formally defined as follows: As illustrated before, we want decoder to make consistent predictions conditioned on the equivalent state sequences.Considering that the vector norm and direction both have impacts on the crossattention results of decoder, we use the Euclidean distance as the cost function.We didn't use the cosine similarity based metric, because it only considers the impact of vector direction.The proposed SMD is a fully unsupervised algorithm to align the contextual representations of the two semanticequivalent sentences.Approximation of SMD The exact computation to SMD is a linear programming problem with typical super O(n 3 ) complexity, which will slow down the training speed greatly.We can obtain a relaxed bound of SMD by removing one of the two constraints, respectively.Following Kusner et al. (2015), we remove the second constraints: The above approximation must yield a lower bound to the exact SMD distance.The accurate SMD solution that satisfies both of the two constraints must also satisfy the first constraint.Given the approximation, the optimal solution for each state vector h i is to move all its probability mass to the most similar state vector h ′ j .Therefore, the approximation also enables the many-to-one alignment relationships during training.We have also tried some approximation algorithms that can get a more accurate estimation of SMD, e.g., Sinkhorn algorithm (Cuturi, 2013), IPOT (Xie et al., 2020).However, we have not observed consistent improvements in our preliminary experiments, and these algorithms also slow down the training speed significantly.
Objective Function We define a symmetrical loss to minimize the SMD of both sides:

Agreement-based Training
Theoretical Analysis In zero-shot translation, the decoder should map the semantic representations from different languages to the target language space, even if it has never seen the translation directions during training.This ability needs the model to make consistent predictions based on the semantic-equivalent sentences, whatever the input language is.To improve the prediction consistency of the model, we propose an agreement-based training method.Because the source sentence x and target sentence y are semantically equivalent, the probability of predicting any other sentence z based on them should be always equal theoretically, which is denoted as: Specifically, the predicted probabilities of the k-th target word conditioned by the first k − 1 words of z and the source and target sentences is equal: where θ denotes the model parameters.Optimizing Equation 11 can not only help the encoder produce universal semantic representations but also help the decoder map different source languages to the particular target language space indicated by z.
Mixup for z Although Equation 11 is theoretically attractive, the choice of sentence z has a significant influence on the above optimization.If we use a random sentence as z, which is not related to x and y, the prediction makes no sense, and the model learns helpful nothing.If we use either x or y directly, this will cause information leakage on one side of Equation 11.As a result, the prediction difficulty between the two sides differs significantly, and it is hard for one side to catch up with the other side.Given the above, we need a inter-sentence that is "between" x and y.Inspired by the success of mixup technique in NLP (Zhang et al., 2020b;Cheng et al., 2021), we generate a pseudo sentence by hard mixuping x and y at token-level.We truncate the longer sentences of x and y to make them equal in length.Since these two sentences are translation pairs, their sentence lengths are usually close, truncating will not significantly reduce the length of the longer sentence and will not enhance the decoder learn shorter outputs.We denote the truncated sentence as x ′ and y ′ , and their length as n ′ .
Then we can generate z as: where g ∈ {0, 1} n ′ , ⊙ denotes the element-wise product.Each element in g is sampled from Bernoulli(λ), where the parameter λ is sampled from Beta(α, β), and α and β are two hyperparameters.The language tag in z, which determines the translation direction, is either come from x or y.
Objective Function Similar to Equation 9, we define another symmetrical loss based on the KL divergence of the model prediction distributions: We omit the model parameters for convenience.

The Final Loss
The final loss consists of three parts, the cross entropy loss (Equation 1), the optimal transport loss based on SMD (Equation 9) and the KL divergence loss for the agreement-based training (Equation 13): where γ 1 and γ 2 are two hyperparameters that control the contributions of the two regularization loss terms.Since L OT is calculated on the sentencelevel and the other two losses are calculated on the token-level, we multiply the averaged sequence length |x| to L OT .Among these three losses, the first term dominates the parameter update of the model, and determines the model performance mostly.The latter two regularization loss terms only slightly modify the directions of the gradients.
Because the first loss term does not depend on H y , we apply the stop-gradient operation to H y (Figure 1), which means that the gradients will not pass through H y to the encoder.

Data Preparation
We conduct experiments on the following multilingual datasets: IWSLT17, PC-6, and OPUS-7.The brief statistics of the training set are in Table 1.We put more details in the appendix.IWSLT17 (Cettolo et al., 2017) We simulate two scenarios.The first (IWSLT) is English-pivot, where we only retain the parallel sentences from/to English.The second (IWSLT-b) has a chain of pivots, where two languages are connected by a chain of pivot languages.Each translation direction has about 0.22M sentence pairs.Both of the two scenarios have eight supervised translation directions and twelve zero-shot translation directions.We use the official validation and test sets.

PC-6
The PC-6 dataset is extracted from the PC-32 corpus (Lin et al., 2020).The data amount of different language pairs is unbalanced, ranging from 0.12M to 1.84M.This dataset has ten supervised and twenty zero-shot translation directions.We use the validation and test sets collected from WMT16~19 for the supervised directions.The zero-shot validation and test sets are extracted from the WikiMatrix (Schwenk et al., 2021), each containing about 1K~2K sentences pairs.OPUS-7 The OPUS-7 dataset is extracted from the OPUS-100 corpus (Zhang et al., 2020a).The language pairs come from different language families and have significant differences.This dataset has twelve supervised translation directions and thirty zero-shot translation directions.We use the standard validation and test sets released by Zhang et al. (2020a).We concatenate the zero-shot test sets with the same target language for convenience.
We use the Stanford word segmenter (Tseng et al., 2005;Monroe et al., 2014) to segment Arabic and Chinese, and the Moses toolkit (Koehn et al., 2007) to tokenize other languages.Besides, integrating operations of 32K is performed to learn BPE (Sennrich et al., 2016).

Systems
We use the open-source toolkit called Fairseqpy (Ott et al., 2019) as our Transformer system.We implement the following systems: • Zero-Shot (ZS) The baseline system which is trained only with the cross-entropy loss (Equation 1).Then the model is tested directly on the zero-shot test sets.
• Pivot Translation (PivT) (Cheng et al., 2017) The same translation model as ZS.The model first translates the source language to the pivot language and then generates the target language.
•Sentence Representation Alignment (SRA) (Arivazhagan et al., 2019) This methods adds an regularization loss to minimize the discrepancy of the source and target sentence representations.
where 'Dis' denotes the distance function and 'Enc(•)' denotes the sentence representations.We use the averaged sentence representation and Euclidean distance function because we find they work better.We vary the hyperparameter γ from 0.1 to 1 to tune the performance.
• Softmax Forcing (SF) (Pham et al., 2019) This method enable the decoder to generate the target sentence from itself by adding an extra loss: Table 2: The overall BLEU scores on the test sets."Zero Avg." and "Sup.Avg." denote the average BLEU scores on the zero-shot and supervised directions.The "x" in the third table denotes all languages except for the target language.The highest scores are marked in bold for all models except for the "PivT" system in each column.
• Contrastive Learning (CL) (Pan et al., 2021) This method adds an extra contrastive loss to minimize the representation gap of similar sentences and maximize that of irrelevant sentences: w e sim − (R(s),R(w))/τ , (17) where + and − denote positive and negative sample pairs, R(•) denotes the averaged state representations.We set τ as 0.1 as suggested in the paper and tune γ as in the 'SRA' system.
• Disentangling Positional Information (Dis-Pos) (Liu et al., 2021) This method removes the residual connections in a middle layer of the en-coder to get the language-agnostic representations.
• Target Gradient Projection (TGP) (Yang et al., 2021b) This method projects the training gradient to not conflict with the oracle gradient of a small amount of direct data.
• Language Model Pre-training (LMP) (Gu et al., 2019) This method strengthens the decoder language model prior to machine translation training.
The following systems are implemented based on our method: • ZS+OT We only add the optimal transport loss (Equation 9) during training.We vary the hyperparameter γ 1 from 0.1 to 1, and we find that it can constantly improve the performance whatever γ 1 is.The detailed results and the final setting about the hyperparameter are put in the appendix.
• ZS+AT We only add the agreement-based training loss (Equation 13) during training.The α and β in the beta distribution are set as 6 and 3, respectively.We vary the hyperparameter γ 2 from 10 −4 to 0.1.
• ZS+OT+AT (Ours) The model is trained with the complete objective function (Equation 14).The hyperparameters are set according to the searched results of the above two systems and are listed in the appendix.Implementation Details All the systems are implemented as the base model configuration in Vaswani et al. (2017) strictly.We employ the Adam optimizer with β 1 = 0.9 and β 2 = 0.98.We use the inverse square root learning scheduler and set the warmup_steps = 4000 and lr = 0.0007.We set dropout as 0.3 for the IWSLT datasets and 0.1 for the for the PC-6 and OPUS-7 datasets.All the systems are trained on 4 RTX3090 GPUs with the update frequency 2. The max token is 4096 for each GPU.For the IWSLT data sets, we first pretrain the model with the cross-entropy loss (Equation 1) for 20K steps and then continually train the model combined with the proposed loss terms for 80K steps.For the PC-6 and OPUS-7 datasets, the pretraining steps and continual-training steps are both 100k.

Main Results
All the results (including the intermediate results of the 'PivT' system) are generated with beam size = 5 and length penalty α = 0.6.The translation quality is evaluated using the case-sensitive BLEU (Papineni et al., 2002) with the SacreBLEU tool (Post, 2018).We report the tokenized BLEU for Arabic, char-based BLEU for Chinese, and detokenized BLEU for other languages 1 .The main results are shown in Table 2.We report the averaged BLEU with the same target language on the PC-6 and OPUS-7 dataset for display convenience, and the detailed results are in the appendix.The 'Ours' system significantly improves over the 'ZS' baseline system and outperforms other zero-shot-based systems on all datasets.The two proposed methods, OT and AT, can both help the model learn  universal and cross mappings , so they both can improve the model performance independently.These two methods also complement each other and can further improve the performance when combined together.Besides, 'Ours' system can even exceed the 'PivT' system when the distant language pairs in the IWSLT-b or the low-resource language pairs in the PC-6 bring severe error accumulation problems.We also compare the training speed and put the results in the appendix.

Analysis
In this section, we try to understand how our method improves the zero-shot translation.

Sentence Representation Visualization
To verify whether our method can better align different languages' semantic space, we visualize each model's encoder output with the IWSLT test sets.We first select three languages: Germany, Italian, and Dutch.Then we filter out the overlapped sentences of the three languages from the corresponding test sets and create a new three-wayparallel test set.Next, we feed all the sentences to the encoder of each model and average the encoder output to get the sentence representation.Last, we apply dimension reduction to the representation with t-SNE (Van der Maaten and Hinton, 2008).The visualization result in Figure 2(a) shows that the 'ZS' system cannot align the three languages well, which partly confirms our assumption that the conventional MNMT cannot learn universal representations for all languages.As a contrast, the 'Ours' system (d) can draw the representation closer and achieve comparative results as the 'CL' system (c) without large amounts of negative instances to contrast.The visualization results confirm that our method can learn good universal representation for different languages.

Inspecting Prediction Consistency
To verify whether our method can help map the semantic representation from different languages to the same space of the target language, we inspect the prediction consistency of the models when the model is fed with synonymous sentences from different languages.Precisely, we measure the pair-wise BLEU on the above IWSLT three-wayparallel test set.We choose one language as the target language, e.g., German, and then translate the other two languages, e.g., Italian and Dutch, to the target language.After obtaining these two translation files, we use one file as the reference, the other as the translation to calculate the BLEU, and then we swap the role of these two files to calculate the BLEU again.We average the BLEU scores to get the pair-wise BLEU, and the results in Table 3 show that our method can achieve higher results, which proves that our method can improve the prediction consistency.

Inspecting Spurious Correlations
The zero-shot translation usually suffers from capturing spurious correlations in the supervised directions, which means that the model overfits the mapping relationship from the input language to the output language observed in the training set (Gu et al., 2019).This problem often causes the off-target prediction phenomenon where the model generates translation in the wrong target languages.To check whether our method can alleviate this phenomenon, we use the Langdetect2 toolkit to identify the target language and calculate the prediction accuracy as 1 − n of f −target /n total .We also compare our method with the 'SRA' and 'CL' methods.The results are shown in Table 4.The 'ZS' baseline system can achieve high prediction accuracy on the IWSLT dataset, but the performance begin to decline as the amount of data becomes unbalanced and the languages become more unrelated.On all the datasets, our method achieves higher prediction accuracy and outperforms all the contrast methods.We can conclude from the results that our method can reduce the spurious correlation captured by the model.2019) first translate the source and target languages to a third language and then make consistent predictions based on this pseudo sentence.Zhang et al. (2020a) propose random online back translation to enforce the translation of unseen training language pairs.Chen et al. ( 2021) fuse the pretrained multilingual model to the NMT model.Compared with these works, our method does not need additional data or additional time to generate pseudo corpus.If necessary, our method can also be combined with these works to further improve the zero-shot performance of the model.Yang et al. (2021a) propose to substitute some fragments of the source language with their counterpart translations to get the code-switch sentences.Compared to this work, our agreement-based method mixups the translation pairs to generate the pseudo sentence as the decoder input and then help the model to make consistent predictions.

Conclusion
In this work, we focus on improving the zero-shot ability of multilingual neural machine translation.To reduce the discrepancy of the encoder output, we propose the state mover's distance based on the optimal transport theory and directly minimize the distance during training.We also propose an agreement-based training method to help the decoder make consistent predictions based on the semantic-equivalent sentences.The experimental results show that our method can get consistent im-provements on diverse multilingual datasets.Further analysis shows that our method can better align the semantic space, improve the prediction consistency, and reduce the spurious correlations.

Limitations
Although our method can improve the performance of the zero-shot translation directions, it has limited benefits for the supervised translation performance.On the one hand, the vanilla MNMT model has already been able to learn a lot of language shared knowledge.On the other hand, the language-specific knowledge learned by the model can also help the model achieve good translation performance in the supervised translation directions.Therefore, our method is limited to improving the supervised translation performance.Besides, some reviewers pointed out that our method degraded the supervised translation performance according to the results of the main experiments.This is because we select the checkpoints based on the performance of the zero-shot valid sets, which may cause a slight decline in the performance of the supervised directions.If we select checkpoints based on the the supervised valid sets, our method can improve the zero-shot performance without degrading the BLEU of the supervised directions.Ru

Figure 1 :
Figure 1: The training scheme of our method.x and y denote a pair of translations; H x and H y denote the corresponding state sequences.z is the pseudo sentence by mixuping x and y. 'Dec' denotes the decoder and there is only one decoder in the model.'stop-grad' denotes the stop-gradient operation during back propagation.L CE , L OT , and L AT denote the cross entropy loss, optimal transport loss, and agreement-based training loss.

Figure 2 :
Figure 2: The visualization of sentence representation after dimension reduction on the IWSLT three-way-parallel test sets.The blue line denotes Germany, the orange line denotes Italian, and the green line denotes Dutch.

Table 1 :
The statics of our datasets.

Table 3 :
The pair-wise BLEU on the IWSLT three-wayparallel test sets.

Table 4 :
The target language prediction accuracy.

Table 5 :
The results of each zero-shot translation direction on the PC-6 corpus.The notations denote the same meaning as in Table2.

Table 6 :
The statistics about the PC-6 corpus.

Table 8 :
The averaged BLEU with different α and β for the 'ZS+AT' system.

Table 9 :
The training speed on the IWSLT dataset.