XeroAlign: Zero-Shot Cross-lingual Transformer Alignment

The introduction of pretrained cross-lingual language models brought decisive improvements to multilingual NLP tasks. However, the lack of labelled task data necessitates a variety of methods aiming to close the gap to high-resource languages. Zero-shot methods in particular, often use translated task data as a training signal to bridge the performance gap between the source and target language(s). We introduce XeroAlign, a simple method for task-specific alignment of cross-lingual pretrained transformers such as XLM-R. XeroAlign uses translated task data to encourage the model to generate similar sentence embeddings for different languages. The XeroAligned XLM-R, called XLM-RA, shows strong improvements over the baseline models to achieve state-of-the-art zero-shot results on three multilingual natural language understanding tasks. XLM-RA's text classification accuracy exceeds that of XLM-R trained with labelled data and performs on par with state-of-the-art models on a cross-lingual adversarial paraphrasing task.


Introduction
In just a few years, transformer-based (Vaswani et al., 2017) pretrained language models have achieved state-of-the-art (SOTA) performance on many NLP tasks (Wang et al., 2019a). Transfer learning enabled the self-supervised pretraining on unlabelled datasets to learn linguistic features such as syntax and semantics in order to improve tasks with limited training data (Wang et al., 2019b). Pretrained cross-lingual language models (PXLMs) have soon followed to learn general linguistic features and properties of dozens of languages (Lample and Conneau, 2019;Xue et al., 2020). For multilingual tasks, however, adequate labelled data is usually only available for a few well-resourced languages such as English. Zero-shot approaches were Figure 1: The XeroAligned XLM-R model (called XLM-RA) for cross-lingual NLU. The XeroAlign loss is added to the otherwise unaltered training to encourage the sentence embeddings in different languages to be similar, enabling zero-shot reuse of the classifier(s). introduced to transfer the task knowledge to languages without the requisite training data. To this end, we introduce XeroAlign, a conceptually simple, efficient and effective method for task-specific alignment of sentence embeddings generated by PXLMs, aimed at effective zero-shot cross-lingual transfer. XeroAlign is an auxiliary loss function, which uses translated data (typically from English) to bring the zero-shot performance in the target language closer to the source (labelled) language, as illustrated in Figure 1. We apply our proposed method to the publicly available XLM-R transformer  but instead of pursuing large-scale model alignment with general parallel corpora such as Europarl (Koehn, 2005), we show that a simplified, task-specific model alignment is an effective and efficient approach to zero-shot transfer for cross-lingual natural language understanding (XNLU). We evaluate our method on 4 datasets that cover 11 unique languages. The XeroAligned XLM-R model (XLM-RA) achieves SOTA scores on three XNLU datasets, exceeds the text classification performance of XLM-R trained with labelled data and performs on par with SOTA models on an adversarial paraphrasing task.

Related Work
In order to cluster prior work, we formulate an approximate taxonomy in Table 1 for the purposes of positioning our approach in the most appropriate context. The relevant zero-shot transfer methods can generally be grouped by a) whether the alignment is targeted at each task, i.e. is task-specific [TS] or is task-agnostic [TA] and b) whether the alignment is applied to the model [MA] or data [DA]. Our contribution falls mostly into the [MA,TS] category although close methodological similarities are also found in the [MA,TA] group.

Groups
Task-Specific Task  Transformer-based PXLMs For transformerbased PXLMs, two basic types of representations are commonly used: 1) A sentence embedding for tasks such as text classification (Conneau et al., 2018) or sentence retrieval (Zweigenbaum et al., 2018), which use the [CLS] representation of the full input sequence, and 2) Token embeddings, which are used for structured prediction (Pan et al., 2017)  [MA,TA] Hu et al. (2020a) have proposed two objectives for cross-lingual zero-shot transfer a) sentence alignment and b) word alignment. While CL is not mentioned, the proposed sentence alignment closely resembles contrastive learning with one encoder (e.g. SimCLR). Taking the average of the contextualised token representations as the input representation (as an alternative to the [CLS] token), the model predicts the correct translation of the sentence within a batch of negative samples. An improvement is observed for text classification tasks and sentence retrieval but not structured prediction. The alignment was applied to a 12-layer multilingual BERT and the scores are comparable to the translate-train baseline (translate data and train normally). Instead, we use one of the best publicly available models, XLM-R from Huggingface, as our starting point since an improvement in a weaker baseline is not guaranteed to work in a stronger model that may have already subsumed those upgrades during pretraining.
Contrastive alignment based on MoCo with two PXLM encoders was proposed by Pan et al. (2020). Using an L2 normalised [CLS] token with a nonlinear projection as the input representation, the model was aligned on 250K to 2M parallel sentences with added Translation Language Modelling (TLM) and a code-switching augmentation. No ablation for MoCo was provided to estimate its effect although the combination of all methods did provide improvements with multilingual BERT as the base learner. Another model inspired by CL is InfoXLM (Chi et al., 2020). InfoXLM is pretrained with TLM, multilingual Masked Language Modelling (mMLM) and Cross-lingual Contrastive Learning called XLCo. Like MoCo, they use two encoders that use the [CLS] token (or the layer average) as the sentence representation, taken from layers 8 (base model) and 12 (large model). Ablation showed a 0.2-0.3 improvement in accuracy for XNLI and MLQA (Lewis et al., 2019). Reminiscent of earlier work (Hermann and Blunsom, 2014), the task-agnostic sentence embedding model (Feng et al., 2020) called LaBSe (Language-agnostic BERT sentence embeddings) uses the [CLS] representations of two BERT encoders (compared to our single encoder) with a margin loss and 6 billion parallel sentences to generate multilingual representations. While similarities exist, our multi-task alignment is an independently devised, more efficient, task-specific and a simplified version of the aforementioned approaches.
[DA,TS] Zero-shot cross-lingual models often use machine translation to provide a training signal. This is a straightforward data transformation for text classification tasks given that adequate machine translation models exist for many language pairs. However, for structured prediction tasks such as Slot Filling or Named Entity Recognition, the non-trivial task of aligning token/data labels can lead to an improved cross-lingual transfer as well. One of the most used word alignment methods is fastalign (Dyer et al., 2013). Frequently used as a baseline, it involves aligning the word indices in parallel sentences in an unsupervised manner, prior to regular supervised learning. In some scenarios, fastalign can approach SOTA scores for slot filling (Schuster et al., 2018), however, the quality of alignment varies between languages and can even degrade performance  below baseline. An alternative data alignment approach called CoSDA (Qin et al., 2020) uses code-switching as data augmentation. Random words in the input are translated and replaced to make model training highly multilingual, leading to improved crosslingual transfer. Attempts were also made to automatically learn how to code-switch . While improvements were reported, it's uncertain how much SOTA models would benefit.
[MA,TS] Continuing with label alignment for slot filling,  tried to predict and align slot labels jointly during training instead of modifying data labels explicitly before fine-tuning. While soft-align improves on fastalign, the difficulty of label alignment makes it challenging to improve on the SOTA. For text classification tasks such as Cross-lingual Natural Language Inference (Conneau et al., 2018), an adversarial cross-lingual alignment was proposed by Qi and Du (2020). Adding a self-attention layer on top of multilingual BERT (Devlin et al., 2018) or XLM (Lample and Conneau, 2019), the model learns the XNLI task while trying to fool the language discriminator in order to produce language-agnostic input representations. While improvements over baselines were reported, the best scores were around 2-3 points behind the standard XLM-R model.

Methodology
We introduce XeroAlign, a conceptually simple, efficient and effective method for task-specific alignment of sentence embeddings generated by PXLMs, aimed at effective zero-shot cross-lingual transfer. XeroAlign is an auxiliary loss function that is jointly optimised with the primary task, e.g. text classification and/or slot filling, as shown in Figure 1. We use standard architecture for each task and only add the minimum required number of new parameters. For text classification tasks, we use the [CLS] token of the PXLM as our pooled sentence representation. A linear classifier (hidden size x number of classes) is learnt on top of the [CLS] embedding using cross-entropy as the loss function (TASK A in Figure 1). For slot filling, we use the contextualised representations of each token in the input sequence. Once again, a linear classifier (hidden size x number of slots) is learnt with a cross-entropy loss (TASK B in Figure 1).
Algorithm 1 shows a standard training routine augmented with XeroAlign. Let PXLM be a pretrained cross-lingual transformer language model, X be the standard English training data and U be the machine translated parallel utterances (from X). Those English utterances were translated into each target language using our internal machine translation service. A public online translator e.g. Google Translate can also be used. For the PAWS-X task, we use the public version of the translated data 1 . We then obtain the CLS S and CLS T embeddings for (x s , y), x t ∈ X, U do 7: task loss ← task loss f n(x s , y) 8: 10: align loss ← sim(CLS S , CLS T ) 11: total loss ← task loss + align loss 12: # update model parameters 13: end for by taking the first token of the PXLM output sequence for the source x s and target x t sentences respectively. Using a Mean Squared Error loss function as our similarity function sim, we compute the distance/loss between CLS S and CLS T . The sum of the losses (total loss) is then backpropagated normally. We have conducted all XeroAlign training as multi-task learning for the following reason. When the PXLM is aligned first, followed by primary task training, the PXLM exhibits poor zero-shot performance. Similarly, learning the primary task first, followed by XeroAlign fails as the primary task is partially unlearned during alignment. This is most likely due to the catastrophic forgetting problem in deep learning (Goodfellow et al., 2013) hence the need for joint optimisation.

Experimental Setup
In order to make our method easily accessible and reproducible 2 , we use the publicly available XLM-R transformer from Huggingface (Wolf et al., 2019) built on top of PyTorch (Paszke et al., 2019). We set a single seed for all experiments and a single learning rate for each dataset. No hyperparameter sweep was conducted to ensure a robust, lowresource, real-world deployment and to make a fair comparison with SOTA models. XLM-R was XeroAligned over 10 epochs and optimised using Adam (Kingma and Ba, 2014) and a OneCycleLR (Smith and Topin, 2019) scheduler.

Datasets
We evaluate XeroAlign with four datasets covering 11 unique languages (en, de, es, fr, th, hi, ja, ko, zh, 2 Email Milan Gritta to request code and/or data. tr, pt) across three tasks (intent classification, slot filling, paraphrase detection). ), a binary classification task for identifying paraphrases. Examples were sourced from Quora Question Pairs 3 and Wikipedia, chosen to mislead simple 'word overlap' models. PAWS-X contains 4,000 random examples from PAWS, for the development and test set, covering seven languages (en, de, es, fr, ja, ko, zh), totalling 48,000 human translated paraphrases. We use the multilingual train sets that contain approximately 49K machine translated examples.

PAWS-X (Yang et al., 2019) is a multilingual version of PAWS
MTOD is a Multilingual Task-Oriented Dataset provided by Schuster et al. (2018). It covers three domains (alarm, weather, reminder) and three languages of different sizes: English (43K), humantranslated Spanish (8.3K) and Thai (5K). MTOD comprises two correlated NLU tasks, intent classification and slot filling. The SOTA scores are reported by  and Schuster et al. (2018).
MTOP is a Multilingual Task-Oriented Parsing dataset provided by  that covers interactions with a personal assistant. We use the standard flat version, which has the highest reported zero-shot SOTA scores by . A treelike compositional version of the data designed for nested queries is also provided. MTOP contains 100K+ human-translated examples in 6 languages (en, de, es, fr, th, hi) spanning 11 domains.
MultiATIS++ by  is an extension of the Multilingual version of ATIS (Upadhyay et al., 2018), initially translated into Hindi and Turkish only. Six new (human-translated 4 ) languages (de, es, fr, zh, ja, pt) were added with ∼4 times as many examples each (around 6K per language) for 9 languages in total. Both of these datasets are based on the original English-only ATIS (Price, 1990) featuring users interacting with an automated air travel information service (via intent recognition and slot filling tasks).
3 https://www.quora.com/q/quoradata/ First-Quora-Dataset-Release-Question-Pairs 4 We have encountered some minor issues with slot annotations. Around 60-70 entities across 5 languages (fr, zh, hi, ja, pt) had to be corrected as the number of slot tags did not agree with the number of tokens in the sentence. However, this only concerns a tiny fraction of the ∼400k+ tags/tokens covered by those languages. We are happy to share the corrections, too.

Metrics
We use standard evaluation metrics, that is, accuracy for paraphrase detection and intent classification, F-Score 5 for slot filling.

Results and Analysis
We use 'XLM-R Target' to refer to model performance on the labelled target language. We provide zero-shot scores (denoted 'XLM-R 0-shot'), the XLM-RA results and the reported SOTA figures. For PAWS-X, we provide a second baseline called 'Translate-Train', which comprises the union of Target and English train data. Scores are given for the large 6 model unless specified otherwise.

Zero-shot Text Classification
The intent classification accuracy of our Xe-roAligned XLM-R exceeds that of XLM-R trained with labelled data, averaged across three task-oriented XNLU datasets and 15 test sets (Tables 2, 5 and 6). Starting from an already competitive baseline model, XeroAlign improves intent classification by ∼5-10 points (larger for XLM-R-base, see Table 7 in Section 4.4). The benefits of cross-lingual alignment are particularly evident in low-resource languages (tr, hi, th), which is encouraging for real-world applications with limited resources. Zero-shot paraphrase detection is another instance of text classification. We report XLM-RA accuracy in Table 3, which exceeds both Target and the Translate-Train averages by over 1 point and by almost 3 points over the zero-shot XLM-R baseline (even mores for XLM-RA-base).
Note that the amount of training data is the same for XeroAlign and Target (except MTOD) thus there is no advantage from using additional data. The primary task, which is learnt in English, has a somewhat higher average performance (∼1.5 points) than the Target languages. We hypothesise that transferring this advantage from a high-resource language via XeroAlign is the primary reason behind its effectiveness compared to using target data directly. Given that Target performance has recently been exceeded with MoCo (He et al., 2020) and the similarities between con-  trastive learning and XeroAlign, our finding seems in line with recent work, which is subject to ongoing research (Zhao et al., 2020).

Zero-shot Structured Prediction
While XLM-RA is able to exceed Target accuracy for text classification tasks, even our best F-Scores for slot filling are 8-19 points behind Target accuracy. This is despite a strong average improvement of +4.1 on MTOP, +5.7 on MultiATIS++ and +5.2 on MTOD for the XLM-R-large model (greater for the XLM-RA-base model). We think the gap is primarily down to the difficulty of the sequence labelling task, i.e. zero-shot text classification is 'easier' than zero-shot slot filling, which is manifested by a ∼10-20 point gap between scores. Sentences in various languages have markedly different input lengths and token/entity order thus word-level inference in cross-lingual zero-shot settings becomes significantly more challenging than sentence-level prediction because syntax plays a less critical role in sequence classification.
A less significant reason, related to XeroAlign's architecture, may be our choice to align the PXLM on the [CLS] embedding, which is subsequently used 'as is' for text classification tasks. Aligning individual token representations through the [CLS] embedding improves structured prediction as well, however, as the token embeddings are not directly used, the parameters in the uppermost transformer layer (following Multi-Head Attention) never receive any gradient updates from XeroAlign. Closing this gap is a challenging opportunity, which we reserve for future work. Once again, the languages with lower NLP resources (th, hi, tr) tend to benefit the most from cross-lingual alignment.

XeroAlign Generalisation
We briefly want to investigate the generalisation of XeroAlign, taking the PAWS-X task as our use case. We are interested in fining out whether aligning on just one language has any zero-shot benefits for other languages. Table 4 shows the XLM-RA results when aligned on a single language (rows) and tested on other languages (columns).  Table 4: XLM-RA aligned on one PAWS-X language (rows), evaluated on others (columns). AVE = average. EU = European languages, AS = Asian languages.
We can see that aligning on Asian languages (Japanese in particular) attains the best average improvement compared to aligning with European languages. This seems to reflect the known performance bias of XLM-R towards (high-resource) European languages, all of which show a strong improvement, regardless of language. Aligning only on European languages (de, es, fr) improves the average to 90.4 but aligning on Asian languages (zh, ko, ja) does not improve over Japanese (90.8).
In any case, it is notable that the XLM-R model Xe-roAligned on just a single language is able to carry this advantage well beyond a single language thus improve average accuracy by 1.5-2.5 points over baseline (88.3) from Table 3. This effect is even stronger for MTOP (+4 accuracy, +3 F-Score).

Smaller Language Models
We observed that the XeroAligned XLM-R-base model shows an even greater improvement than its larger counterpart with 24 layers and 550M parameters. To this end, we report the XLM-RA-base results (12 layers, 270M parameters) in Table 7 as the average scores over all languages for MTOP, PAWS-X, MTOD and MultiATIS++. We use a relative % improvement over the baseline XLM-R to compare the models fairly. The paraphrase detection accuracy improves by 3.3% for the large (L) PXLM versus 6.5% for the base (B) model.  Across three XNLU datasets, XeroAlign improves the standard XLM-R by 9.5% (L) versus 14.2% (B) on structured prediction (slot filling) and by 7.1% (L) versus 19.8% (B) on text classification (intent recognition). Therefore, applications with lower computational budgets can also achieve competitive performance with our simple cross-lingual alignment method for transformed-based PXLMs. In fact, the base XLM-RA can reach (on average) up to 90-95% of the performance of its larger sibling using lower computational resources.

Discussion
The XLM-RA intent classification accuracy is (on average) within ∼1.5 points of English accuracy across three task-oriented XNLU datasets. However, the PAWS-X paraphrase detection accuracy is almost 5 points below English models, which is also the case for other state-of-the-art PXLMs in Table 3. Why does XLM-R struggle to generalise more on this task for languages other than English? We can exclude translation issues since all models used the publicly available PAWS-X machine-translated data. Instead, we think that the greater than expected deficit may be caused by a) domain/topic shift within the dataset and b) a possible data leakage for English. The original PAWS data  was sourced from Quora Question Pairs and Wikipedia with neither being limited to any particular domain. As the English Wikipedia provides a large chunk of the English training data for XLM-R, it is possible that some of the English PAWS sentences may have been seen in training, which could explain the smaller generalisation gap for English.
We also want to find out whether this gap will diminish if we artificially remove the domain shift. To this end, we use parallel utterances (but not task labels) from the development and test sets to XeroAlign the XLM-R on an extended vocabulary that may not be present in the train set. We observe that the (Exp) model in Table 3 shows an average improvement of over 2 points compared to the best XLM-RA and other SOTA models suggesting that the increased generalisation gap may be caused by a domain shift for non-English languages on this task. When that topic shift gets (perhaps artificially) removed, the model is able to bring accuracy back within ∼2 points of the English model (in line with XNLU tasks). Note that this effect can be masked for English due to the language biases in data used for pretraining.
In section 2, we outlined the most conceptually similar methods that conducted large-scale model pretraining with task-agnostic parallel sentence alignment as part of the training routine (Hu et al., 2020a;Feng et al., 2020;Pan et al., 2020;Chi et al., 2020). Where ablation studies were provided, the average improvement attributed to contrastive alignment was ∼0.2-0.3 points (though the tasks were slightly different). While we do not directly compare XeroAlign to contrastive alignment, it seems that task-specific alignment may be a more effective and efficient technique to improve zeroshot transfer, given the magnitude of our results. This leads us to conclude that the effectiveness of our method comes primarily from cross-lingual alignment of the task-specific vocabulary. Language is inherently ambiguous, the semantics of words and phrases shift somewhat from topic to topic, therefore, a cross-lingual alignment of sentence embeddings within the context of the target task should lead to better results. Our simplified, lightweight method only uses translated task utterances, a single encoder model and positive samples, the alignment of which is challenging enough without arbitrary negative samples. In fact, this is the main barrier for applying contrastive alignment in task-specific NLP scenarios, i.e. the lack of carefully constructed negative samples. For smaller datasets, random negative samples would mean that the task is either too easy to solve, resulting in no meaningful learning or the model would receive conflicting signals by training on false positive examples, leading to degenerate learning.

Future Work
Our recommendations for avenues of promising follow-up research involve any of the following: i) aligning more tasks such as Q&A, Natural Language Inference, Sentence Retrieval, etc. ii) including additional languages, especially low-resource ones (Joshi et al., 2020) and iii) attempting largescale, task-agnostic alignment of PXLMs followed by task-specific alignment, which is reminiscent of the common transfer learning paradigm of pretraining with Masked Language Modelling before fine-tuning on the target task. To that end, there is already some emergent work on monolingual finetuning with an additional contrastive loss (Gunel et al., 2020). For the purposes of multilingual benchmarks (Hu et al., 2020b;Liang et al., 2020) or other pure empirical pursuits, an architecture or a language-specific hyperparameter search should optimise XLM-RA for significantly higher performance as the large transformer does not always outperform its smaller counterpart and because our hyperparameters remained fixed for all languages. Most importantly, the follow-up work needs to improve zero-shot transfer for cross-lingual structured prediction such as Named Entity Recognition (Pan et al., 2017), POS Tagging (Nivre et al., 2016) or Slot Filling (Schuster et al., 2018), which is still lagging behind Target scores.

Conclusions
We have introduced XeroAlign, a conceptually simple, efficient and effective method for taskspecific alignment of sentence embeddings generated by PXLMs, aimed at effective zero-shot cross-lingual transfer. XeroAlign is an auxiliary loss function that is easily integrated into the unaltered primary task/model. XeroAlign leverages translated data to bring the sentence embeddings in different languages closer together. We evaluated XeroAligned XLM-R models (named XLM-RA) on zero-shot cross-lingual text classification, adversarial paraphrase detection and slot filling tasks, achieving SOTA (or near-SOTA) scores across 4 datasets covering 11 unique languages. Our ultimate vision is a level of zero-shot performance at or near that of Target. The XeroAligned XLM-R partially achieved that goal by exceeding the intent classification and paraphrase detection accuracies of XLM-R trained with labelled data.