Effective Fine-Tuning Methods for Cross-lingual Adaptation

Large scale multilingual pre-trained language models have shown promising results in zero- and few-shot cross-lingual tasks. However, recent studies have shown their lack of generalizability when the languages are structurally dissimilar. In this work, we propose a novel fine-tuning method based on co-training that aims to learn more generalized semantic equivalences as a complementary to multilingual language modeling using the unlabeled data in the target language. We also propose an adaption method based on contrastive learning to better capture the semantic relationship in the parallel data, when a few translation pairs are available. To show our method’s effectiveness, we conduct extensive experiments on cross-lingual inference and review classification tasks across various languages. We report significant gains compared to directly fine-tuning multilingual pre-trained models and other semi-supervised alternatives.


Introduction
Self-supervised pre-trained models (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019;Lan et al., 2019) have revolutionized natural language processing (NLP). Such pre-training with language modeling objectives provides a useful initial point for model parameters that adapt well to new tasks with supervised fine-tuning. Building on the success of monolingual pre-trained language models (LM) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), multilingual models like mBERT (Devlin et al., 2019), XLM , XLM-R (Conneau et al., 2020) have pushed the state-of-the-art on cross-lingual tasks by pre-training large Transformer (Vaswani et al., 2017) models jointly on many languages.
The multilingual pre-trained LMs support zeroshot transfer from a source language to target lan-guages, meaning that fine-tuning the pre-trained LM on the source-language labeled data such as English, could transfer well to other languages. Recent research (Wu and Dredze, 2019;K et al., 2020;Pires et al., 2019) has shown that the transfer capability of these multilingual LMs mainly relies on the structural similarity between the source and target languages. When the target language is structurally dissimilar to the source, the transfer ability is shown to be low in the zero-shot setting.
Since the multilingual LMs are generally trained with a self-supervised masked language modeling (MLM) objective without considering parallel information or semantic equivalences, they cannot capture well semantic similarity across languages as reflected by their low Tatoeba score (Phang et al., 2020). This could also potentially harm their zero-shot transfer performance on the tasks as Dufter and Schütze (2020) show that injecting cross-lingual signals by replacing masked tokens with semantically similar words from other languages improves mBERT's multilinguality and zero-shot cross-lingual inference (XNLI) results.
Concurrently, the multilingual embedding models such as LASER (Artetxe and Schwenk, 2019) and LaBSE (Feng et al., 2020) use parallel data to learn language invariant sentence representation by encoding texts from different languages into a shared embedding space. These models can capture semantic similarity well as often shown by their high Tatoeba scores and their success in tasks that involve cross-lingual similarity such as crosslingual retrieval and bitext mining. However, it has been shown that these models generally lag behind the multilingual LMs on zero-shot cross-lingual classification tasks like XNLI (Wang, 2019). We hypothesize source information might be necessary to achieve better zero-shot transfer as shown empirically by Phang et al. (2020) with intermediate task fine-tuning in the source language (English).
In this work, we argue that the multilingual em-bedding models and multilingual LMs are complementary to each other -the task adaptability of the multilingual LMs can be complemented by the semantic awareness of the sentence embedding models and vice versa. For this, we first propose a co-training (Blum and Mitchell, 1998) framework that facilitates the multilingual LMs and sentence embedding models to learn from each other by using unlabeled data from target languages. Secondly, we aim to improve the multilingual LM fine-tuning when there are small amounts of parallel pairs within the training datasets for tasks. Compared to the existing translate-train methods (Singh et al., 2020;Conneau et al., 2018) that simply use translation as a data-augmentation method to generate labeled data in the target language, we propose a novel language adaptation approach based on contrastive learning that aligns the parallel data to model the semantic relationship between the translation pairs for effective fine-tuning.
We performed extensive experiments on XNLI (Conneau et al., 2018) and Multilingual Amazon Review Corpus (MARC) (Keung et al., 2020) datasets. The experimental results demonstrate that our approach outperforms previous methods for various classification tasks across different languages by a good margin. In particular, on XNLI, our proposed co-training method improves over the original mBERT and XLM-R by 2.3% and 1.7% on average for zero-shot cross-lingual transfer. On MARC, our approach gets on average 8% and 1.1% gains for mBERT and XLM-R, respectively.

Related Works
Since the introduction of the transformer network Vaswani et al. (2017), it has become a common model of choice for language representation learning. Pre-trained transformer-based models such as mBERT (Devlin et al., 2019) have proven effective in learning cross-lingual information. mBERT was pre-trained on raw Wikipedia texts in 104 languages using masked language modeling (MLM) and next sentence prediction (NSP) tasks with no explicit cross-lingual objective. XLM-R (Conneau et al., 2020) improves over mBERT by training longer with more data from CommonCrawl, and without the NSP objective.
Meanwhile, several studies examine what makes these pre-trained language models multilingual, and why it works well for cross-lingual transfer. Pires et al. (2019) hypothesize that the cross-lingual capability of mBERT benefits from having a shared (sub-word) vocabulary for all languages, which helps to bind the languages by mapping the token representations into a shared space. K et al. (2020) point out that the contribution from shared subwords is minimal. On the other hand, the structural similarity (e.g., word order, word frequency, etc.) is more important for effective cross-lingual transfer.
Another line of work on multilingual pretraining focuses on generating multilingual sentence embeddings such that semantically similar sentences across different languages will be closer in a shared vector space. LASER (Artetxe and Schwenk, 2019) uses an encoder-decoder architecture (Sutskever et al., 2014). It trains on large parallel data to learn multilingual fixed-length sentence embedding for 93 languages on a translation task. Multilingual Universal Encoder (mUSE) (Chidambaram et al., 2019) uses a dual-encoder architecture that is trained on one billion crawled question-answering pair with a translation ranking task: given a sentence from the source language and a group of candidate text from target languages, the model needs to recognize the corresponding translation of the source-language text from the candidates. LaBSE (Feng et al., 2020) is based on the BERT architecture using the same translation ranking task with mUSE but is trained on a much larger dataset of six billion translation pairs. Some researchers also tried to introduce crosslingual alignment from parallel data as an auxiliary objective of the original MLM in the pre-training. Cao et al. (2020) align mBERT embeddings in a post-hoc manner. They first apply a statistical word aligner to align tokens in the parallel sentences. Then, mBERT is tuned via minimizing the meansquared error between the embedding of the English words and the corresponding words in other languages. Chi et al. (2020) tried to minimize the vector-space distance between a source language sentence and its translation during the pre-training. The problem with these kinds of methods is that they either need to pre-train a new model from scratch (Chi et al., 2020;, or need to do a second round of pre-training on top of the original multilingual LM (Cao et al., 2020). From both the computation and data perspective, the cost is very high.
In another line of work, researchers use data augmentation to solve language adaptation problem in cross-lingual tasks. For example, Bari et al. (2021) use XLM-R's mask language model to augment the data with vicinal samples. Liu et al. (2021) proposed a labeled sequence translation method to translate source-language NER training data to target languages and train a generation-based multilingual data augmentation method. These methods are orthogonal to our methods.
Summary Current work on multilingual pretraining either does not consider sentence-level cross-lingual alignment in the pre-training (making them sacrifice transfer capability to structurally dissimilar language), or they only consider alignment signals, which makes them expensive to train. In contrast, in our work we focus on utilizing the parallel information in the fine-tuning phase with minimum costs to avoid building a pre-trained model from scratch or using large amounts of parallel data. Our approach can also naturally co-operate with different multilingual models based on different pre-training objectives.

Methods
In this section, we describe our co-training framework for cross-lingual transfer ( §3.1) followed by the contrastive language adaptation ( §3.2).

Co-training based Model Transfer
The goal of our approach is to make use of the cross-lingual semantic information from the multilingual embedding model to improve the zero-shot classification performance on downstream crosslingual tasks when fine-tuning the multilingual LM with source language data.
Background Co-training (Blum and Mitchell, 1998) is one of the widely used semi-supervised methods, where two complementary classifiers utilize unlabeled data to bootstrap the performance of each other iteratively. Within the co-training framework, each classifier is trained on a unique view of the data. In each iteration, the algorithm selects high confidence data using each of the classifiers to form a pseudo-labeled dataset. The intuition is that if one classifier can confidently predict the class of an example that is very similar to some of the labeled ones, it can provide one more training data for the other classifier. If this data appears easy to be classified by the first classifier, it does not mean that it will be easy for the second classifier. So, the second classifier will get useful information to improve itself and vice versa. Co-training also avoids Algorithm 1: Co-training for cross-lingual task adaptation the confirmation bias issue (Tarvainen and Valpola, 2017) with single model self-training, where the model accumulates its own errors.
Proposed Co-Training Framework In our proposed co-training framework (Fig. 1), we have two cross-lingual classifiers, which use two separate pre-trained multilingual models to get the cross lingual representation of a text. The first classifier is based on a multilingual LM like XLM-R (Conneau et al., 2020) that captures the structural similarity across languages by pre-training on MLM. We denote it as f lm θ . The second one is based on the multilingual sentence embedding model LaBSE (Feng et al., 2020) that is pre-trained on parallel texts to capture the semantic similarity across languages. We denote this model as f emb θ . Every input text will get a cross-lingual representation h ∈ R d after being encoded by the pretrained multilingual models f lm θ or f emb θ , where d is the dimension of the sentence representation. Subsequently, each model has a task-specific classification module for task fine-tuning, which consists of a dense layer followed by a Softmax that maps h ∈ R d to Y, where Y is the set of target classes.
We first fine-tune the two classifiers based on two different pre-trained models for K epochs using the labeled data where n is the amount of labeled data, x i is the text from the source language and y i is the corresponding ground truth label. The next step is to make predictions on the unlabeled data from the target languages: with m being the number of unlabeled samples. Consequently, given an arbitrary unlabeled data x u , the two classifiers f lm θ and f emb θ will yield pseudo labels y lm u and y emb u as well as confidence scores c lm u and c emb u , respectively. We then set a threshold t and randomly select a subset of the pseudo-labeled data with a confidence score larger than t. The items selected by f lm θ is merged into the original labeled data D for f emb θ to re-train the model with. Similarly, the training dataset for f lm θ also gets updated with the pseudo labeled data selected by f emb θ . We perform this process iteratively. See Alg. 1 for a pseudocode.

Contrastive Language Adaptation
Our co-training method in §3.1 exploits unlabeled target data. In practical scenarios, it is easy to acquire some translations of the source texts either via machine translation (MT) or human translators even for low-resource languages. Let T represent a system (or human) that can translate a source text x i into arbitrary target languages. With this, we can create a new dataset is the translation of x i into language t using T . In our case, we translate only a small portion of D to build D bi to make the setup realistic, as getting good translations could be expensive.
One straightforward way to fine-tune f lm θ using D and D bi is to optimize the following crossentropy loss, where p lm y i (x i ) is the predicted probability from f lm θ for the ground truth label y i .
However, such method does not fully benefit from the parallel information in D bi . So, we further develop a language adaptation approach to effectively use the commonality between the source and the target language data in terms of their label space and semantic relationship to deal with the limited size of the parallel training data when fine-tuning the multilingual LM (f lm θ ), as we present below. Label Alignment (LA) We encourage data with the same class label to be nearby in the embedding space. We utilize the supervised contrastive method (Gunel et al., 2020) to capture the similarities between examples of the same class and contrast them with the examples from the other classes. Specifically, given a batch of training data B from D ∪ D bi containing examples of various classes, we optimize f lm θ using the following loss.
where h k and h j respectively indicate the L 2 normalized [CLS] representations (encoded by f lm θ ) of x k and x j drawn from the same batch B as x i , N y i denotes the amount of data in B that have the same class of y i , and τ is a temperature parameter.
Combined with the cross-entropy loss (Eq. 1), the final loss function for fine-tuning f lm θ is: Where λ LA is the hyper-parameter for tuning the importance of the label alignment loss. Note that the label alignment loss can be applied to only D or D bi or both. In our model, we apply it to both (i.e., D ∪ D bi ) to effectively use the commonality between the source and target languages data in their label space.
Semantic Alignment (SA) When fine-tuning with the parallel data, we encourage the source text and their corresponding translations in different languages to be nearby in the embedding space (they have the same label). Concurrently, for the texts with different labels from different languages, we encourage their embeddings to be far apart. For each source language labeled instance (x i , y i ) in D bi , we construct a batch B of size |B| = 2b. The first half of the batch contains parallel texts of x i from different languages: ; they have the same label y i and are considered as positive pairs. The second half of the batch {(x i , x j )} 2b j=b+1 are constructed by the source language texts with a different label (i.e., y j = y i ), and considered as negative pairs. The contrastive loss for one data point (x i , y i ) in D bi is: Note that the negative pairs {(x i , x j )} 2b j=b+1 can be sampled from only D or only D bi or both. In our model we only sample negative pairs from D bi .
Combined with the cross-entropy loss, the final loss function for fine-tuning f lm θ on D bi is: where λ SA is the hyper-parameter for tuning the importance of the semantic alignment loss.
Overall, when fine-tuning f lm θ on D and D bi , we first train it with Eq. 3 on D ∪D bi . Then, we train it on D bi using Eq. 5 with a smaller learning rate. See Fig. 2 for an intuitive illustration of the proposed language adaptation method.

Experiments
To show the effectiveness of our model, we evaluate our proposed methods on two cross-lingual classification tasks as we describe below.

Evaluation Tasks & Datasets
Multilingual Amazon Review Corpus (MARC) MARC (Keung et al., 2020) 2 is a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese. The corpus is balanced across the five possible star ratings, so each rating constitutes 20% of the reviews in each language. We test our model on the binarized classification task from Keung et al. (2020), where we predict whether the reviewer gave a negative review (1-2 stars) or a positive review (4-5 stars). We drop the 3-star reviews in the training and evaluation data. We use only the review body for training and testing. The training data for target languages are used as unlabeled data.
XNLI XNLI (Conneau et al., 2018) is an evaluation benchmark for cross-lingual NLI that covers 15 languages. The dataset is created by translating (by human) the development and test sets of the English MultiNLI dataset . Given a sentence pair of premise and hypothesis, the task is to classify their relationship as entailment, contradiction, and neutral. On XNLI, we directly use the translation that comes with the dataset 3 as unlabeled dataset for co-training.

Experimental Setup
On both XNLI and MARC datasets, we experiment with three different setups: (i) We train the model using 1.2% percent sampled data from the original XNLI and MARC English training set which we denote as 1.2% zero-shot. This is to investigate our co-training method's performance when there are only a few labeled data in the source language for zero-shot transfer. (ii) we add translations from English to target languages for some (200 samples) of the data in the 1.2% zero-shot setting to show the effectiveness of our proposed language adaption method; we denote it as 1.2% few-shot. (iii) we also report our co-training method's performance using full English training dataset from XNLI and  MARC, which we denote as 100% zero-shot. The details of the settings are shown in Table 1 Multilingual Pre-trained Model In the experiments, we consider two multilingual language models as f lm θ : mBERT 4 (Devlin et al., 2019) and XLM-R 5 (Conneau et al., 2020). We use base versions for both mBERT and XLM-R. We use LaBSE 6 (Feng et al., 2020) as the multilingual embedding model f emb θ in the co-training framework.
Baselines. We compare our co-training method with self-training (Dong and de Melo, 2019). Instead of using multilingual embedding model to generate pseudo label for f lm θ , self-training uses f lm θ 's own prediction on the unlabeled data to obtain training data with pseudo labels. We also compare our language adaption method with translatetrain on mBERT and XLM-R.
Training Details We use the AdamW (Loshchilov and Hutter, 2019) optimizer with 0.00005 initial learning rate, 0.01 weight decay rate, and a linear learning rate scheduler for all our experiments. We use a batch size of 16 and a max sequence length of 128 when fine-tuning on both MARC and XNLI datasets. We fine-tune five epochs on XNLI, and we fine-tune for two epochs on MARC.
Co-training We have a pre-defined threshold t. During each iteration of the co-training, we randomly choose 1 3 n pseudo labels from qualified candidates (pseudo labels with confidence score larger than t), where n is the raw size of labeled data. If the amount qualified candidates amounts is less than 1 3 n, we then choose all the qualified candidates. 4 https://github.com/google-research/ bert/blob/master/multilingual.md 5 https://github.com/pytorch/fairseq/ tree/master/examples/xlmr 6 https://tfhub.dev/google/LaBSE/1 Figure 3: Accuracy after each iteration during selftraining and co-training for LaBSE and mBERT.

Results and Analysis
For evaluation, we report the results on the entire test sets of MARC and XNLI. Table 2 shows the results on MARC measured by accuracy. We can see that both self-training and co-training methods can use unlabeled data to improve the model's performance. Compared to self-training, co-training can further enhance the model's performance. We observe that under 1.2% zero-shot setup, the improvement from selftraining is minor. However, our co-training can still learn better knowledge from LaBSE. On average, co-training outperforms self-training by 12% and 2.4% for mBERT and XLM-R, respectively. To show that the multilingual embedding model and multilingual LM are complementary to each other, we report the accuracy after each iteration during co-training and self-training for mBERT and LaBSE under the 1.2% zero-shot setup in Fig. 3. Although LaBSE outperforms mBERT by a large margin (18%), it can still learn more useful information from mBERT than from itself in self-training, as it gets about 1% gains in co-training compared to self-training. On the full (100%) setup, our co-training method on average gives 8% gain for mBERT and 1.1% for XLM-R compared to their respective fine-tuning versions.

MARC
Finally, under the 1.2% few-shot setup (i.e., with translation data), our language adaptation method improves mBERT by 0.9% on average and is also 0.4% better than XLM-R. We also demonstrate that our language adaptation method and cotraining framework can be combined to further improve mBERT's performance by 5.1% and improve XLM-R by 1.7% on average, respectively.  Table 2: MARC accuracy score for English (en), French (fr), German (de), Chinese (zh), Japaneses (ja). We report the mean accuracy running on three different random seeds  Table 3: XNLI accuracy for English (en), French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw) and Urdu (ur). We report the mean accuracy running on three different random seed. Table 3 shows the results in accuracy of the experiments on XNLI. Overall we observe that our model outperforms the baselines on almost all 15 test languages in the three experimental setups.

XNLI
Under the 1.2% zero-shot setup, our co-training method gives a sizeable improvement of 6.4% and 6.6% for mBERT and XLM-R, respectively, compared to their fine-tuning versions. Our co-training method also gives an average increase of 3.4% and 1.7% on self-training. Specifically, we observe over 2% gain for ur, bg, ar, th, zh and tr when we compare our co-training with self-training. At the full (100%) setup, our co-training method yields an average gain of 2.3% for mBERT and 1.7% for XLM-R compared to their respective finetuning versions. This shows that even with a large amount of labeled data from the source language, the model can still benefit from multilingual embedding model through our co-training method.
We also show the importance of target language data in our experiment by adding a small number (200 in our case) of translation pairs in the Few-shot Cross-lingual Transfer (1.2%) setting. The traditional translate-train method can give mBERT and XLM-R 2.5% and 1% average gains over all languages, respectively. We further improve this gap to 3.4% and 1.4% by adopting language adaptation. Similar to the experimental results on MARC, combining language adaption and co-training method, we achieve the best performance in this setup.

Ablation
To better understand the contribution from different optimization objectives, we perform an ablation study on the MARC dataset by ablating one component at a time from the complete model.
From the results in Table 4, we observe that generally removing one of the objectives would reduce the performance on average, indicating that all objectives contribute to the overall performance.
When we remove the semantic alignment loss (-SA), we observe an accuracy drop of 0.5% in mBERT compared to the full system. The accuracy drop for XLM-R is 0.1%.
Removing label alignment loss on the source language data (-LA(D)) leads to ∼0.7% accuracy drop across the board. Removing label alignment on translation data (-LA(D bi ) leads to 1.4% accuracy drop on mBERT and 0.4% accuracy drop on XLM-R. This observation shows positive effects of the label alignment loss on both source language data D and target language translation data D bi .

Conclusion & Future Work
In this paper, we have proposed an effective finetuning method to improve cross-lingual transfer capability of multilingual pre-trained LMs. In contrast to previous work, our proposed co-training framework can make multilingual pre-trained LMs learn cross-lingual semantic relationships from the multilingual embedding model. Moreover, we propose a novel language adaptation approach based on contrastive learning. When there exist translation pairs within the training dataset, our language adaption approach can better model the semantic relationship across languages on translation pairs for effective fine-tuning. Extensive experiments have been conducted on the XNLI and Amazon multilingual review dataset, which show that our method outperforms previous methods on both zero-shot transfer and few-shot transfer.
For future studies, we will investigate the data selection policies for the co-training methods. In some cases, the distribution of the labeled data could be different from that of the unlabeled data. It may yield a sampling bias in the training iterations of co-training that shifts towards the unlabeled set, thus hurting the model performance. A more robust data selection policy could solve this problem.