Synergy with Translation Artifacts for Training and Inference in Multilingual Tasks

Translation has played a crucial role in improving the performance on multilingual tasks: (1) to generate the target language data from the source language data for training and (2) to generate the source language data from the target language data for inference. However, prior works have not considered the use of both translations simultaneously. This paper shows that combining them can synergize the results on various multilingual sentence classification tasks. We empirically find that translation artifacts stylized by translators are the main factor of the performance gain. Based on this analysis, we adopt two training methods, SupCon and MixUp, considering translation artifacts. Furthermore, we propose a cross-lingual fine-tuning algorithm called MUSC, which uses SupCon and MixUp jointly and improves the performance. Our code is available at https://github.com/jongwooko/MUSC.


Introduction
Large-scale pre-trained multilingual language models (Devlin et al., 2019;Conneau and Lample, 2019;Huang et al., 2019;Conneau et al., 2020;Luo et al., 2021) have shown promising transferability in zeroshot cross-lingual transfer (ZSXLT), where pretrained language models (PLMs) are fine-tuned using a labeled task-specific dataset from a richresource source language (e.g., English or Spanish) and then evaluated on zero-resource target languages.Multilingual PLMs yield a universal representation space across different languages, thereby improving multilingual task performance (Pires et al., 2019;Chen et al., 2019).Recent work has enhanced cross-lingual transferability by reducing the discrepancies between languages based on translation approaches during fine-tuning (Fang et al., 2021;Zheng et al., 2021;Yang et al., 2022).Our paper focuses on when translated datasets are available for cross-lingual transfer (XLT).* Equal contribution Conneau et al. (2018) provided two translationbased XLT baselines: translate-train and translate-test.The former fine-tunes a multilingual PLM (e.g., multilingual BERT) using the original source language and machine-translated target languages simultaneously and then evaluates it on the target languages.Meanwhile, the latter fine-tunes a source language-based PLM (e.g., English BERT) using the original source language and then evaluates it on the machine-translated source language.Both baselines improve the performance compared to ZSXLT; however, they are sensitive to the translator, including translation artifacts, which are characteristics stylized by the translator (Conneau et al., 2018;Artetxe et al., 2020).Artetxe et al. (2020) showed that matching the types of text (i.e., origin or translationese1 ) between training and inference is essential due to the presence of translation artifacts under translate-test.Recently, Yu et al. (2022) proposed a training method that projects the original and translated texts into the same representation space under translate-train.However, prior works have not considered the two baselines simultaneously.
In this paper, we combine translate-train and translate-test using a pre-trained multilingual BERT, to improve the performance.Next, we identify that fine-tuning using the translated target dataset is required to improve the performance on the translated source dataset due to translation artifacts even if the languages for training and inference are different.Finally, to consider translation artifacts during fine-tuning, we adopt two training methods, supervised contrastive learning (SupCon; Khosla et al. 2020) and MixUp (Zhang et al., 2018) and propose MUSC, which combines them and improves the performance for multilingual sentence classification tasks.We use the given translated datasets T MT trn for PAWSX and XNLI.However, for MARC and ML-Doc, the translated datasets are not given.Therefore, we use an m2m_100_418M translator (Fan et al., 2021) from the open-source library EasyNMT2 to create the translated datasets.T MT trn is translated from S trn (i.e., S trn → T MT trn ), and T BT trn is backtranslated from T trn (i.e., T trn → S MT trn → T BT trn ; Sennrich et al. 2016).Similarly, for inference, S MT tst is translated from T tst .The notations used in this paper are listed in Table 1.
We use the pre-trained cased multilingual BERT (Devlin et al., 2019) from HuggingFace Transformers (Wolf et al., 2020) and use accuracy as a metric.Detailed information for fine-tuning is provided in Appendix B. In this section, we demonstrate that the two baselines, translate-train and translate-test, are easily combined to improve performance, which we call it translate-all.Table 2 describes the differences between algorithms.
Table 3 presents the results according to the inference dataset when the models are fine-tuned using S trn and T MT trn .Inference on T tst is a general way to evaluate the models, i.e., translate-train.
In addition, we evaluate the models on S MT tst like translate-test.Furthermore, we ensemble the two results from different test datasets by averaging the predicted predictions, i.e., translate-all, because averaging the predictions over models or data points is widely used to improve predictive performance and uncertainty estimation of models (Gontijo-Lopes et al., 2022;Kim et al., 2020a).
From Table 3, it is shown that even if the multilingual PLMs are fine-tuned with S trn and T MT trn , the performance on the translated source data S MT tst is competitive with that on the target data T tst .Furthermore, ensemble inference increases the performance on all datasets.This can be interpreted as the effectiveness of the test time augmentation (Kim et al., 2020a;Ashukha et al., 2021), because the results on the two test datasets, T tst and S MT tst (augmented from T tst ), are combined.
To explain the changes in inferences via test time augmentation, we describe the predicted probability values on the correct label when the models are evaluated on T tst and S MT tst , as depicted in Figure 1.The green and orange dots represent the benefits  and losses via the ensemble, respectively.The improved performance through the ensemble means that the number of green samples is greater than the number of orange samples in Figure 1.
To analyze where the performance gain comes from, we focus on the green samples.The green samples are concentrated around the right down corner, which implies that wrong predictions on T tst can be right predictions with high confidence on S MT tst .In fact, this phenomenon is the opposite of what we expected; the samples are expected to be concentrated around the y = x line, because the semantic meaning between T tst and S MT tst is similar even though the languages are different.This implies that semantic meaning is not the main factor explaining the performance gain of the ensemble.

Translation Artifacts for Training
To find the main factor of performance gain, we hypothesize that matching the types of text (i.e., original or translated) between training and inference is important even if the languages used for training and inference are different, by expanding on Artetxe et al. (2020).For the analysis, we use MARC and MLDoc because they provide T trn , which has no artifacts.
Table 4 describes the results according to the matching between texts for training and inference.Well-matched texts are better than badly matched ones.In particular, the results that T BT trn -S MT tst is better than T trn -S MT tst support our hypothesis.This implies that biasing training and inference datasets using the same translator can lead to performance improvement, and that translation artifacts can change wrong predictions on T tst into right predictions on S MT tst when the models are trained using T MT trn , as shown in Section 3.

Proposed Method: MUSC
We propose an XLT method called MUSC, by applying SupCon (Khosla et al., 2020) and MixUp (Zhang et al., 2018) jointly.Namely, our method is contrastive learning with mixture sentences in supervised settings.Several works have attempted to employ the idea of mixtures on unsupervised contrastive learning (Kim et al., 2020b;Shen et al., 2022); however, ours is the first to leverage the label information in a mixture.In this section, the loss functions are formulated at batch level with a batch size of N , and ↑ and ↓ indicate the normal and reverse order, respectively, in a batch.All methods are designed upon the translate-all.SupCon.We adopt SupCon, which makes the samples in the same class closer (Gunel et al., 2021), to reduce the discrepancies between original and translated texts.Namely, SupCon helps models to learn both originality of S trn and artifacts of T MT trn comprehensively.The loss function of SupCon (L sc ) with I ≡ [1, . . ., 2N ] is as follows: where Z = [Z S ; Z T ] ∈ R 2N ×dp is the projections of [CLS] token representations through an encoder f and a projector g, i.e., g(f (E) [CLS] ), and z i indicates the i-th row of Z. Z is concatenated along with the batch dimension and d p is the dimension of projections.The positive set of the sample i, P (i), is defined as {j|y ′ j = y ′ i , j ∈ I \ {i}}, where [y ′ 1 , . . ., y ′ N ] = [y ′ N +1 , . . ., y ′ 2N ] = y.MixUp.We adopt MixUp to densify original and translated texts, respectively.MixUp is performed on the word embeddings by following Chen et al.

Mix operation
Figure 2: Overview of MUSC.E S and E T are the embeddings of the paired source and target languages, and each row indicates one sentence.Note that in the mix operation, addition and multiplication are operated elementwisely.f , g, and h are the encoder, projector, and classifier, respectively.g [CLS] (f (E)) and h [CLS] (f (E)) means g(f (E) [CLS] ) and h(f (E) [CLS] ), respectively.g(f (E) [CLS] ) is expressed as Z.In this figure, it is assumed that the batch size is four and that the blue-and green-colored samples have the same class.
(2020), because it is infeasible to directly apply MixUp to discrete word tokens.MixUp with α ∈ [0, 1] is as follows: where E ↑ = XW e ∈ R N ×L×d is the output of the embedding layer for a given batch X ∈ R N ×L×|V | with weight matrix W e ∈ R |V |×d .L, |V |, and d indicate maximum sequence length, vocab size, and dimension of word embeddings, respectively.E ↓ is reversed along with the batch dimension.We apply MixUp between the same language to densify each type of text.For convenience in implementation, we mix a normal batch (↑) and a reversed batch (↓), following Shen et al. (2022).The mixing process is conducted elementwisely.The loss function of MixUp (L mu ) with cross-entropy (L ce ) is as follows: where Q = h(f ( Ẽ) [CLS] ) is the logits of [CLS] token for the mixed embeddings, with an encoder f and a classifier h.y is a set of labels in the same batch.MUSC.We replace the original projected representations in L sc with mixture ones, i.e., Z S → ZS or Z T → ZT , to use MixUp and SupCon jointly.The loss functions of MUSC (L musc ) are as follows: We calculate L musc by decomposing it in two opposite orders, similar to L mu .Finally, the total loss function (L), descried in Figure 2, is as follows: Table 5 describes the ablation study according to the applied loss functions.− denotes baseline which only applies L ce .Other methods include L ce and additionally apply the corresponding loss, respectively.It is shown that SupCon (L sc ) and MixUp (L mu ) improve performance on most datasets even when they are used separately.The effectiveness of these losses is powerful when dataset size is small.Moreover, our total loss (L), which includes learning a model using Sup-Con and MixUp jointly (L musc ), outperforms both SupCon and MixUp on all datasets.In addition, our total loss (L) brings more performance gains than the simple conjunction of SupCon and MixUp (L sc + L mu ) for all datasets except for MARC dataset.These results demonstrate that our proposed MUSC effectively collaborates the SupCon and MixUp.The optimized hyperparameters are reported in Appendix B.

Conclusion
In this paper, we showed that translate-train and translate-test are easily synergized from the test time augmentation perspective and found that the improved performance is based on translation artifacts.Based on our analysis, we propose MUSC, which is supervised contrastive learning with mixture sentences, to enhance the generalizability on translation artifacts.Our work highlighted the role of translation artifacts for XLT.

Limitations
Our work addressed the role of translation artifacts for cross-lingual transfer.Limitation of our work is that we experimented for sentence classification tasks using multilingual BERT, because it is almost impossible to get token-level ground truths using translator.

Ethics Statement
Our work does not violate the ethical issues.Furthermore, we showed that a new baseline, translate-all, could achieve higher performance, and proposed MUSC designed upon the translate-all approach.We believe that various algorithms can be developed based on the translate-all for multilingual tasks.

D Additional Related Works
Cross-lingual Transfer.As the recent advances in NLP demonstrate the effectiveness of pre-trained language models (PLMs) like BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), the performances of XLT rapidly improve by extending the monolingual PLMs to the multilingual settings (Conneau and Lample, 2019;Conneau et al., 2020).While these multilingual PLMs show state-of-the-art performances in ZSXLT, one promising approach for improving the crosslingual transferability is instance-based transfer by translation such as translate-train and translate-test (Conneau et al., 2018).Due to the effectiveness and acceptability of translation, most recent works (Fang et al., 2021;Zheng et al., 2021;Yang et al., 2022) focus on better utilization of translation.Test-time augmentation.Data augmentation, which expands a dataset by adding transformed copies of each example, is a common practice in supervised learning.While the data augmentation is also widely used in XLT (Zheng et al., 2021) during training models, it can also be used at the test time to obtain greater robustness (Prakash et al., 2018), improved accuracy (Matsunaga et al., 2017), and estimates of uncertainty (Smith and Gal, 2018).Test time augmentation (TTA) combines predictions from a multi-viewed version of a single input to get a "smoothed" prediction.We also point out that using translation with XLT can be viewed as TTA, which can get performance gain from a different view of original and translation sentences.In this direction of the necessity of study for TTA (Kim et al., 2020a), we propose better utilization of translation artifacts in XLT.Translation artifacts."Translationese" can be referred to as characteristics in a translated text that differentiate it from the original text in the same language.While the effect of translationese has been widely studied in translation tasks (Graham et al., 2020;Freitag et al., 2020), the efficacy of translationese in XLT is under-explored.Artetxe et al. (2020) and Kaneko and Bollegala (2021) investigate the effect of translationese in translate-test and ZSXLT settings, however, these are apart from general training approach of XLT.Recently, Yu et al. (2022) firstly attempt to study translate-train, which focuses on single QA task.

Figure 1 :
Figure 1: Predicted probability values on correct label when the models are evaluated on T tst and S MT tst .The colors indicate right or wrong predictions: right on T tst and right on Ens.(blue), right on T tst and wrong on Ens.(orange), wrong on T tst and right on Ens.(green), and wrong on T tst and wrong on Ens.(red).
(Hu et al., 2020)tst 2 Scope of the StudyIn this study, four datasets are used: MARC and MLDoc for single sentence classification, and PAWSX and XNLI from XTREME(Hu et al., 2020)for sentence pair classification.The details of datasets are provided in Appendix A. Each dataset consists of the source dataset for training S trn and the target dataset for inference T tst , where S trn is original and T tst is original (for MARC and MLDoc) or human-translated (for PAWSX and XNLI).For MARC and MLDoc, the original target dataset for training T trn is additionally given.

Table 3 :
Results according to the inference datasets (Acc. in %).S trn and T MT trn are used for training.The number in the parenthesis of MLDoc is the number of training samples.'Ens.' indicates the ensemble of results on the two different test datasets in the inference.XNLI results are reported in Appendix C.

Table 4 :
Results according to the matching between types of text for training and inference (Acc. in %).S trn is also used for training.Dataset Training Inference EN ZH FR DE RU ES IT JA Avg.

Table 5 :
Results according to the losses.S trn and T MT trn are used for training and T tst and S MT tst are used for ensemble inference, i.e., under translate-all.− denotes baseline which only applies L ce .L ce is basically added for all methods.XNLI results are reported in Appendix C. Dataset Method EN ZH FR DE RU ES IT KO JA Avg.

Table 8 :
XNLI results according to the inference datasets.

Table 9 :
XNLI results according to the training methods.