Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classiﬁcation

Recent multilingual pre-trained models, like XLM-RoBERTa (XLM-R), have been demon-strated effective in many cross-lingual tasks. However, there are still gaps between the contextualized representations of similar words in different languages. To solve this problem, we propose a novel framework named M ulti- V iew M ixed L anguage T raining (MVMLT), which leverages code-switched data with multi-view learning to ﬁne-tune XLM-R. MVMLT uses gradient-based saliency to extract keywords which are the most relevant to downstream tasks and replaces them with the corresponding words in the target language dynamically. Furthermore, MVMLT utilizes multi-view learning to encourage contextualized embeddings to align into a more reﬁned language-invariant space. Extensive experiments with four languages show that our model achieves state-of-the-art results on zero-shot cross-lingual sentiment classiﬁcation and dialogue state tracking tasks, demonstrating the effectiveness of our proposed model 1 .


Introduction
Due to the availability of large labeled datasets and parallel corpus, neural network models have achieved remarkable performance on a variety of natural language processing (NLP) tasks. However, generally large-scale training data with high quality is only available in a few languages. Artificially collecting or translating training data for different languages could be time-consuming and expensive, which will inevitably create a massive performance gap between high-resource language models (e.g., English and French) and low-resource language models (e.g., Swahili and Urdu).
Cross-lingual transfer learning (CLTL) aims at bridging this gap by transferring the learned knowl-edge from a resource-rich language (source) to a resource-lean language (target) (David Yarowsky and Wicentowski, 2001). The main idea of CLTL is to learn a shared language-invariant feature space for both languages, so that a model trained on the source language could be applied to the target language directly. Recently, Cross-Lingual Contextualized Embedding methods such as multilingual BERT (mBERT) (Devlin et al., 2018), XLM (Conneau and Lample, 2019), and XLM-RoBERTa (XLM-R)  have achieved state-of-the-art results on a variety of zeroshot cross-lingual tasks. However, those BERTstyle transformer (Vaswani et al., 2017) architectures, training cross-lingual embeddings from selfsupervised masked language modelling with monolingual corpus, may not well capture the semantic similarity of subwords across different languages.
In order to alleviate inconsistent contextualized representations within different languages, some supervised cross-lingual signals have been introduced in prior work (Kulshreshtha et al., 2020a), e.g., bilingual dictionaries and parallel corpora. Qin et al. (2020) propose a data augmentation framework called Code-Switching or Mix Language Training, which chooses a set of words randomly and replaces them with the corresponding words in a different language. For example, "I 喜欢 this 电影 so much" 2 is a code-switched sentence. They only use a bilingual dictionary to generate code-switched data to fine-tune mBERT, which encourage model to align representations between different languages. Nevertheless, there are two main problems in this method: (1) the importance of different words in a document is ignored, since they just replace words with the same probability randomly. Replacing some unimportant words will increase the burden of translation and even introduce noise that impairs the sentence semantic coherence; (2) they only use code-switched corpus to fine-tune mBERT, while the relation between original sentences and code-switched sentences is ignored completely, which may leads to the loss of some interactive information and hinder contextualized embeddings from further alignment.
To address the issues mentioned above, we propose a new framework named Multi-View Mixed Language Training (MVMLT), which leverages code-switched data with multi-view learning for zero-shot cross-lingual transfer. MVMLT first uses gradient-based saliency method to find keywords with high saliency scores in downstream tasks (Section 3.1). For example, in cross-lingual sentiment classification tasks, some words with sentiment information (e.g., "excellent", "interesting" and "boring") should have higher saliency scores than background words (e.g., "the", "a" and "what"). Relying on a bilingual dictionary, we replace these keywords with their corresponding words in the target language to generate code-switched data (Section 3.2). These code-switched keywords are the essential part for effective cross-lingual transfer, because they intersect with different languages and allow the shared encoder to learn some direct tying of meaning across different languages. Therefore, selecting the most task-related keywords by saliency detection facilitates cross-lingual performance for providing a strong tie across different languages.
Furthermore, MVMLT acquires comprehensive cross-lingual information from different perspectives and explores the consistency of multiple views by means of multi-view learning (Xu et al., 2013). Specifically, MVMLT constructs two views from the multilingual pre-trained model, i.e., XLM-R: (1) the encoded feature representation of the original sentence; (2) the encoded feature representation of the corresponding code-switched sentence. The key of cross-lingual transfer is to learn a languageinvariant feature space, so these two feature representations should be as similar as possible. Therefore, we utilize multi-view learning to enforce a consensus between two views, which encourages similar words in different languages to align into a shared latent space (Section 3.3).
In summary, our main contributions are as follows: • We propose a saliency-based mixed language training (MLT) framework, which utilizes gradient-based saliency to select task-related words for code-switching. Focusing on these keywords allows model to transfer cross-lingual signals more efficiently.
• We leverage multi-view (MV) learning to constrain the representation of original sentence and code-switched sentence consistently, and build a refined language-invariant space that is more robust to language shift compared to previous zero-shot cross-lingual transfer work (Liu et al., 2020;Fei and Li, 2020;Qin et al., 2020).
• Our MVMLT model is extensively evaluated in four languages on cross-lingual sentiment classification and dialogue state tracking tasks in zeroshot setting, and achieves state-of-the-art results in 10/11 tasks, demonstrating the effectiveness of MVMLT.
2 Related Work

Cross-Lingual Transfer Learning
Cross-lingual transfer learning aims at leveraging the learned knowledge of the source language to cope with the related task of the target language. Learning Cross-Lingual Word Embeddings (CLWE) (Mikolov et al., 2013) is a successful method for CLTL, which uses a bilingual dictionary to project words that have the same meaning close to each other. Recently, Cross-lingual Contextualized Embeddings use some form of language modeling to pre-train multilingual representations, which are then fine-tuned on the relevant tasks and transferred to different languages directly. Multilingual pre-trained models such as multilingual BERT (Devlin et al., 2018), XLM (Conneau and Lample, 2019), and XLM-RoBERTa  have been successfully used for zero-shot crosslingual transfer on various tasks (Wu and Dredze, 2019;Pires et al., 2019), i.e., Document Classification, Named Entity Recognition and Dependency Parsing. In addition, these multilingual pre-trained models can be further improved by different alignment methods (Kulshreshtha et al., 2020b;Cao et al., 2020), like rotation-based alignment and finetuning alignment. Our work is inspired by Qin et al. (2020), which propose a data augmentation framework and use task-related parallel word pairs to generate code-switched sentences for fine-tuning mBERT. The difference is that we use saliency detection to choose keywords rather than select-ing words randomly. Moreover, we leverage codeswitched data with multi-view learning to further align representations of multiple languages.

Multi-View Learning
Multi-view learning, aiming at learning from different views which contains complementary information and exploiting the consistency from multiple views , has been widely used in many NLP tasks. Clark et al. (2018) proposed Cross-View Training (CVT), a novel self-training algorithm that works well for neural sequence models.  unified multiple views of entities to learn better embedding representations for entity alignment. Fei and Li (2020) proposed multi-view encoder-classifier (MVEC) for sentiment classification, which enforced a consensus between multiple-views (i.e., the encoded sentences in the source languages and the encoded backtranslations of the source sentences from the target language) generated by encoder-decoder framework. Unlike MVEC, our model employs multiview training to restrain the encoded representation of original sentence and code-switched sentence consistent without using parallel corpus.

Saliency Detection
Since attention mechanisms (Bahdanau et al., 2014) boosted performance on many current NLP tasks, using attention weight as explanation of model predictions is a general approach for many models (Wang et al., 2016;Lin et al., 2017;Ghaeini et al., 2018). However, some recent work (Serrano and Smith, 2019; Jain and Wallace, 2019) casts doubt on attention's interpretability. Besides, Bastings and Filippova (2020) claimed that saliency methods are more applicable for model explanations.
There are three saliency methods for NLP as alternatives to attention (Arras et al., 2019): gradientbased (Denil et al., 2014), propagation-based (Bach et al., 2015), and occlusion-based (Zeiler and Fergus, 2014) methods. In our work, the gradientbased saliency method is adopted for selecting important words to be code-switched.

Methodology
Suppose we have two monolingual datasets is the labeled data only available in the source language L S , and is the unlabeled data in the target language L T . We aim at using D src to train an universal classification model and predicting the corresponding label when given an unseen language data D tgt .
The architecture of our model is illustrated in Figure1, which consists of three components: (1) Gradient-based keyword selection: selecting keywords in the training set and building a codeswitched dictionary; (2) Dynamic code-switching: code-switching the input sentence dynamically; (3) Multi-view training: training the encoder based on multi-view learning. We will elaborate each part in this section.

Gradient-based Keyword Selection
Intuitively, the influence of each word in a sentence is different when training a classification model. We call those words that have a greater impact on model as keywords. Different tasks or domains usually have different keywords, e.g., for News Classification task, keywords set should include words like "military", "salary" and "sport", and for Sentiment Classification task, keywords set should include words like "interesting", "fascinating" and "unworthy". Suppose we have a vocabulary set V contains v words in a dataset, we need to find a salient subset of keywords K ⊆ V for codeswitching, which would improve downstream tasks greatly. So we utilize saliency scores for selecting keywords. Gradient-based saliency computes the gradient of the loss L with respect to each token in the input text, and the magnitude of the gradient serves as a feature importance score (Arras et al., 2019).
Formally, let x S i = (w 1 i , w 2 i , · · · , w n i ) denotes the i-th sentence with n words from D src , Lŷ is the loss between model's predictionŷ i and the ground truth y i . For each token w i ∈ x S i , we define the saliency score as: where e(w i ) is the embedding of w i . Thus, the saliency value is a dot product between prediction function gradient and word embedding, which is referred as Gradient × Input (Shrikumar et al., 2017). The Gradient shows how much one word embedding contributes to the final decision, and the Input leverages the sign and magnitude of the input. Note that multi-lingual pre-trained models tokenize words into subwords, so we average the subword saliency scores of each word as the final result. . . .

Gradient-based Keyword Selection
Inverse Document Frequency ORG: nice picture book, it can be a good source of inspiration.

Gradient-based Code-Switching
CS: hübsche picture buchen, it can be a gutes source of inspirieren.
ORG: nice picture book, it can be a good source of inspiration. Equation 1 computes the local contribution of a token in one sentence, but we aim to build a global keyword set K in D src . Following Yuan et al. (2019), we add all saliency scores for token w occurred in D src and multiply them with the inverse document frequency (IDF) of w: where N is the total number of words in D src . The IDF term balances word frequency and saliency scores by assigning words with high document frequency a lower weight and vice versa. It is necessary because some irrelevant stop words (e.g., "of" and "a") have high total saliency scores, for they appear in the document many times.
Top-k salient words are chosen to compose the keyword set K, and a bilingual dictionary MUSE (Conneau et al., 2017) is adopted to build a codeswitched dictionary D = ((s 1 , t 1 ), · · · , (s k , t k )), where s and t represent the source and target language words, respectively. k is the number of keywords, and the influence of k value on model performance will be discussed in Section 5.3. The process of constructing code-switched dictionary is illustrated in the left part of Figure 1.

Dynamic Code-Switching
Given a source language sentence x org = (w 1 , w 2 , · · · , w n ), we replace the words in x org with their corresponding translation with a certain probability if they appear in D. After this codeswitching process, we get a code-switched sentence x cs = (w 1 , w 2 , · · · , w n ). Because the replaced words in source language could have multiple translations in the target language, we randomly choose one for replacement. In addition, we reset the replacement after each epoch, namely we replace different words at different epochs, which could be referred as a data augmentation method.

Multi-View Training
We train our MVMLT based on XLM-R architecture with multi-view learning. We first feed original sentence x org and code-switched sentence x cs into a shared XLM-R model separately: where h org and h cs are the aggregated sentence representation for the original sentence and the code-switched sentence, respectively.
For classification tasks, we input h org and h cs into a classification layer: where p org and p cs are the task-specific probability for all candidates, W and b are learnable parameters.
Our main learning objective is to train the classifier to match predicted labels with the ground truth, so we minimize the following cross-entropy loss between p org and ground truth label p: On the other hand, we hope the output produced by the encoder is language-invariant. To achieve this goal, we leverage multi-view learning to exploit a more comprehensive representation from multiple views which usually contain complementary information. We take two views into consideration: (1) the original sentence feature representation h org ; (2) the code-switched sentence feature representation h cs . The central assumption of MVMLT is that an ideal model for cross-lingual transfer should learn feature representations that perform well in the source language and are invariant to the shift in the target language. Therefore, we enforce a consensus between these two views, that is to say, predicted distributions on the two views should be as similar as possible: where KL is Kullback-Leibler (KL) (Kullback and Leibler, 1951) divergence to measure the difference between two distributions.
The final objective, combining the cross-entropy loss (Equation 5) and the KL divergence loss (Equation 6), is written as follows: λ kl is a hyper-parameter to trade-off cross-entropy loss and KL divergence loss, preventing the latter from drifting too far. The process of multi-view learning is illustrated in the right part of Figure 1.

Experiments
We evaluate the effectiveness of our proposed method on zero-shot cross-lingual dialog state tracking and sentiment classification tasks in four languages. In details, English is the source language, and the target languages are German, Italian, French and Japanese, respectively.

Datasets
Sentiment Classification (SC) For the sentiment classification task, we use the multilingual multidomain Amazon review dataset (Prettenhofer and Stein, 2010) which contains three domains: book, DVD and music. Each domain contains the reviews in four different languages: English, German, French and Japanese, which provides us 9 tasks in total. There are 1000 positive and 1000 negative reviews for each domain in each language. We use English as the source language, and the others as the target language. Following Fei and Li (2020), we combine the English training and test sets and randomly sample 20% (800)

Training Details
We leverage the XLM-R-base as Encoder in Equation 3, with 12 Transformer blocks, 768 hidden units, 12 self-attention heads. For DST task, we use Adam (Kingma and Ba, 2014) optimizer and set learning rate to 1e-5, λ kl to 1, the number of batch size to 8, word replacement ratio to 0.5 and keyword ratio to 0.1. For SC task, the learning rate is 1e-6, λ kl is 5, batchsize is 12, replacement ratio is 0.7, keywords ratio is 0.4 for German and French, 0.5 for Japanese. Our approach is implemented with Pytorch 3 and all experiments are conducted on an NVIDIA Tesla P100. All experiment results are the average score over 5 runs with random seeds.

Comparison Methods
We compare MVMLT with the following strong baselines. MVEC: Fei and Li (2020) leveraged an unsupervised machine translation system to construct an encoder-decoder framework with a language discriminator.  Table 2: Results on Multilingual WOZ 2.0. The slot accuracy individually compares each slot-value pair to its ground truth label. The joint goal accuracy compares the predicted dialogue states to the ground truth at each dialogue turn. ' † ' denotes results from (Liu et al., 2020). ' ‡ ' denotes our re-implemented results for this method based on XLM-R.

Overall Performance
Results of SC and DST are illustrated in Table 1 and Table 2, respectively. We can see that the finetuned multilingual pre-trained models like mBERT, XLM and XLM-R outperform all previous methods by a large margin, which indicates multilingual pretrained models have a strong ability of cross-lingual transfer in zero-shot setting. Besides, compared with these strong baselines, our model MVMLT leads to significant improvements and achieves state-of-the art performance on 10/11 tasks. Particularly, in SC task, compared with CoSDA (Qin et al., 2020), our method improves 3.37, 3.46 and 3.69 on average for de, fr and jp, respectively. For DST task, MVMLT also achieves notable gains in both languages, especially for joint goal accuracy. All these results well demonstrate the effective-  Table 3: Ablation study on Amazon reviews dataset for three languages.
ness of the proposed MVMLT, which is mainly attributed to leverage code-switched data with multiview learning for cross-lingual transfer.
We also find MVMLT greatly improves XLM-R when the target language is more similar to the source language. For example, MVMLT improves a lot when transfer to German, French and Italian, but has limited improvement in Japanese. We hypothesize that English and Japanese belong to different language families and have completely different linguistic structures. In the process of codeswitching, word-to-word replacement will disrupt the linguistic structure, especially for distant languages. Therefore, we can not simply map the English and Japanese sentence representations into the same space.

Ablation Study
We conduct an ablation study to explore the effect of saliency detection and multi-view learning on the overall performance. The results are reported in Table 3.
w/o saliency: selecting keywords randomly rather than extracting keywords based on saliency leads to approximately 1% degradation, which indicates that saliency has a strong ability to pick out the most important words in different downstream task documents.
w/o multi-view: the performance is also significantly degraded when the multi-view learning is substituted by just mixing original and codeswitched sentences together, and feed them to the encoder independently. Without multi-view learning, the interactive information between original sentences and code-switched sentences is ignored completely, so that the distribution of the latent representations are discrepant between source and target languages, which leads to a 2% performance degradation.

Number of keywords
Saliency Random Accuracy% Figure 2: Test accuracy on German book domain as a function of the replacement rate k v . Random denotes results from selecting keywords randomly. Saliency denotes results from selecting keywords by saliency detection. Figure 2 shows the influence of different strategies (i.e., selecting keywords by gradient-based saliency and selecting keywords randomly) with respect to different keyword sizes k v .

Effectiveness of Saliency Detection
The performance of selecting keywords randomly significantly declines when k v drops, while saliency-based method performs still well even with just 1% keywords (about 200 words). This is because gradient-based saliency helps MVMLT prioritize the most indicative keywords for codeswitching. These keywords serve as powerful anchor points (i.e,. identical strings that appear in both languages in the training corpus)  for cross-lingual transfer, and provide sufficient cross-lingual information for aligning different languages representations into a shared space. As k v increases, the additional keywords are less indicative, so they have a minor or even negative effect on model performance.
It demonstrates that MVMLT remains effective under a minimal translation budget by leveraging gradient-based saliency to detect the most taskrelated keywords. Appendix A.1 shows the top 10 extracted keywords and their translations to German, French and Japanese in SC corpus.

Visualization
We visualize the encoder's output of different methods for 2000 sampled parallel corpus in English and German provided by Amazon Reviews dataset with t-SNE (Van der Maaten and Hinton, 2008) The XLM-R results in Figure 3(a) show that there is almost no overlap between the two language representations. CoSDA in Figure 3(b) further reduces the distance of representations by introducing code-switched sentences, but there are still some mismatching parts in the space. By leveraging multi-view learning, MVMLT in Figure 3(c) significantly decreases the distributional discrepancies between English and German instances.
It demonstrates that MVMLT effectively learns the language-invariant representations of different languages by multi-view training.

Compared with other alignments
Furthermore, we also try two other strategies to align multilingual embeddings directly.
Distance-based alignment minimizes the distance between the two contextual representations: Similarity-based alignment minimizes the similarity between the two contextual representations, and we use cosine similarity here: As results shown in Table 4, we can conclude that minimizing the KL divergence between two probability distributions by multi-view learning is better than aligning contextual embeddings directly. Due to the different semantic structures and translation biases across different languages, forcing the encoded features to be exactly identical is harmful for its representation ability. While multi-view learning encourages two predicted distributions as close as possible, which gives model a softer way to learn language invariant representations.

MVMLT with Translate-Train
In this section, we add the third view called Translate-Train, which is the translation of the source language sentences by a Machine Translation system 4 trained on Europarl 5 corpus. The objective is written as follows: where p trans is predicted probability of translatetrain, λ kl1 and λ kl2 are set to 1.
The results are shown in Table 5. We can see that translate-train further improves the performance of MVMLT by offering an additional view. On the one hand, translate-train compensates for the shortcomings of code-switching that sometimes  breaks the semantic coherence. On the other hand, code-switching offers more target-related information compared to translate-train. Therefore, model could learn more robust cross-lingual representations from these complementary views.
However, it is an overkill to introduce a more complex translation system because large parallel data may not be available in every language. Overall, our MVMLT is still a simple yet efficient framework that can achieve promising scores, which is more suitable for rare-language and limited-budget scenarios.

Conclusion
In this paper, we propose Multi-View Mixed Language Training (MVMLT), a novel zero-shot cross-lingual transfer framework. Our approach utilizes gradient-based saliency to replace a few task-related words with target language, which is used for fine-tuning on downstream tasks. Besides, we introduce multi-view learning to construct a language-invariant feature space. Experiments show that our model achieves state-of-theart results on cross-lingual sentiment classification and dialogue state tracking tasks. In the future, we will investigate the effectiveness of our approach in multi-lingual setting and apply our model to more tasks.