Improving Cross-lingual Text Classification with Zero-shot Instance-Weighting

Cross-lingual text classification (CLTC) is a challenging task made even harder still due to the lack of labeled data in low-resource languages. In this paper, we propose zero-shot instance-weighting, a general model-agnostic zero-shot learning framework for improving CLTC by leveraging source instance weighting. It adds a module on top of pre-trained language models for similarity computation of instance weights, thus aligning each source instance to the target language. During training, the framework utilizes gradient descent that is weighted by instance weights to update parameters. We evaluate this framework over seven target languages on three fundamental tasks and show its effectiveness and extensibility, by improving on F1 score up to 4% in single-source transfer and 8% in multi-source transfer. To the best of our knowledge, our method is the first to apply instance weighting in zero-shot CLTC. It is simple yet effective and easily extensible into multi-source transfer.


Introduction
Natural language processing (NLP) has largely benefited from recent advances in deep learning and large-scale labeled data. Unfortunately, such labeled corpora are not available for all languages. Cross-lingual transfer learning is one way to spread the success from high-resource to low-resource languages. Cross-lingual text classification (CLTC) (Prettenhofer and Stein, 2010;Ni et al., 2011) can learn a classifier in a low-resource target language by transferring from a resource-rich source language (Chen et al., 2018;Esuli et al., 2019).
Previous work has learned a classifier in the target language using a very small sample of labeled target instances or external corpora of unlabeled instances (Wang et al., 2019;Xu and Wan, 2017). * Work done as an intern at IBM Research Almaden.
In addition, other resources that may be utilized to achieve the same include, but are not limited to, parallel corpora of unlabeled instances in the target language (Xu and Wan, 2017). In this work, we address the most challenging setting, zero-shot CLTC (Arnold et al., 2007;Joachims, 2003), where no resource in the target language is given. Among the many methods for transfer learning that have been successfully employed in NLP (Mogadala and Rettinger, 2016;Zhou et al., 2016;Eriguchi et al., 2018), instance (re-) weighting is perhaps one of the oldest and most well known (Wang et al., 2017(Wang et al., , 2019. It is best illustrated when we are given access to a few target labeled instances (few-shot learning). For example, both Dai et al. (2007) and Wang et al. (2019) learn a classifier iteratively by assigning weights to each instance in the source training data. While Dai et al. (2007) assigns weights to both source and target instances, Wang et al. (2019) pre-trains a classifier on the source training data and then re-weights the target labeled instances. Crucially, the weights are set to be a function of the error between the prediction made for the instance by the current classifier and the instance's gold label.
In a few-shot case, it is easy to see the appeal of re-weighting target language instances, since an instance that incurs a higher prediction loss can be given a larger weight, so as to improve the classifier. But in a zero-shot case, it seems impossible to compute instance weights based on prediction loss. In this work, we make it possible to assign such weights on instances in zero-shot CLTC. To the best of our knowledge, this is the first attempt to apply such a method to NLP tasks.
Our contributions are two-fold: First, we introduce zero-shot instance-weighting, a simple but effective, and extensible framework to enable instance weighted transfer learning for zero-shot CLTC. Second, we evaluate on three cross-lingual classification tasks in seven different languages. Results show that it improves F1 score by up to 4% in single-source transfer and 8% in multi-source transfer, identifying a promising direction for utilizing knowledge from unlabeled data.

Proposed Method
We illustrate the zero-shot CLTC framework in Figure 1. The source and target language inputs are x s and x t respectively, during training, only the source label y s is available and the task is to predict the target label y t . We first apply the pre-trained model as an encoder to encode the inputs, the encoded representations are denoted by h s and h t . The figure illustrates four instances for each language in the mini-batch. Then there is an Instance Weighting module to assign weights to source language instances by considering the hidden representations h s and h t . Note that these layers are shared. We train the task layer and fine-tune the pre-trained language model layers.
We evaluate on multiple tasks in Section 3, so there are different ways to utilize the pre-trained models. For the sentiment and document classification task, we train a fully-connected layer on top of the output of the [CLS] token, which is considered to be the representation of the input sequence. For the opinion target extraction task, we formulate it as sequence labeling task (Agerri and Rigau, 2019; Jebbara and Cimiano, 2019). To extract such opinion target tokens is to classify each token into one of the following: Beginning, Inside and Outside of an aspect. We follow a typical IOB scheme for the task (Toh and Wang, 2014;San Vicente et al., 2015;Álvarez-López et al., 2016). In this case, each token should have a label, so we have a fully-connected layer that is shared for each token. We note that it may be possible to improve all the results even further by employing more powerful task layers and modules such as conditional random fields (Lafferty et al., 2001), but keep things relatively simple since our main goal is to evaluate instance weighting with zero-shot CLTC.

Instance Weighting
The intuition behind instance weighting is the following: if the difference between a source instance and the target language is small, then it shares more common features with the target language, so it should make a larger contribution. For each instance in the source language, a large weight indicates a large contribution by the instance during training. Ideally, when deciding an instance weight, we should compare it with all instances from the target language. But doing so would incur prohibitively excessive computational resources. We thus approximate in small batches and calculate the weights by comparing how similar the instances are to the target ones within a small batch in each training step. Instance Weighting-based Gradient Descent Vanilla mini-batch gradient descent is defined as: where α is the learning rate, θ is the parameter that we want to update, g θ (x i ) is the model prediction for x i , ∇ θ is the partial derivative, and f (·) is the loss function. We modify Equation 1 to include instance weights: where we assign a weight w i to each instance within a mini-batch, and there is a weighted summation of the gradients in the mini-batch for all the instances and then update the parameter θ. It can be easily extended to multiple source languages, in this case, x s may be training samples from more than one languages. Unsupervised Weighting Metrics In each batch, to obtain weight w i for each source instance i, we follow a similarity-based approach. We define a scoring function to calculate a score between the current source instance representation h i and the target instance representation h j . Then we conduct a summation as the final score for source instance i to the set of target instances within this batch D t . For i ∈ D s : We normalize each w i in this batch to make sure the summation is 1, and they are plugged into Eq. 2.
Multiple ways exist to define a scoring function score(i, j), and a Cosine-Similarity based scoring function is defined as: We also investigate two other ways for scoring function: Euclidean-Distance based and the CORAL Function (Sun et al., 2016). While Cosine scoring function performs the best, so we report it in our main experiments and ignoring the other two.

Evaluation
We test on three tasks: opinion target extraction, document classification, and sentiment classification 3 . English is the source language for all the experiments. We evaluate four settings: 1) direct adaptation with mBERT-base (mBERT), 2) mBERT with Instance Weighting (mBERT+IW), 3) direct adaption of XLMR-base (XLMR), and 4) XLMR with Instance Weighting (XLMR+IW). Opinion Target Extraction We choose SemEval 2016 Workshop Task 5 (Pontiki et al., 2016) for opinion target extraction. It includes restaurant reviews in five languages 4 : English, Spanish (es), Dutch (nl), Russian (ru) and Turkish (tr). Given a sentence as input, one needs to classify each token into one of the three classes according to the IOB scheme. The training and testing size varies from 144 to 3,655. We compare against a list of models. Pontiki et al. (2014) and Kumar et al. (2016) are supervised and require extra corpora or resources to train. Agerri and Rigau exploits additional resources like unlabeled corpora. Jebbara and Cimiano (2019) applies multi-source (including the target) languages to train a classifier using cross-lingual embeddings and evaluates in a zeroshot manner. We summarize the results in Table 1.
Cross-lingual Document Classification We conduct cross-lingual document classification task on the MLDoc dataset (Schwenk and Li, 2018). It is a set of news articles with balanced class priors in eight languages; Each language has 1,000 training documents and 4,000 test documents, and splits into four classes. We select a strong baseline (Schwenk and Li, 2018), which applies pre-trained MultiCCA word embeddings (Ammar et al., 2016) and then trained in a supervised way. Another baseline is a zero-shot method proposed by Artetxe and Schwenk (2019), which applies a single BiLSTM encoder with a shared vocabulary among all languages, and a decoder trained with parallel corpora. Artetxe and Schwenk (2019) apply mBERT as a zero-shot language transfer. Table 2 shows the results of our comparison study. Sentiment Classification Finally, we evaluate sentiment classification task on Amazon multilingual reviews dataset (Prettenhofer and Stein, 2010). It contains positive and negative reviews from 3 domains, including DVD, Music and Books, in four languages: English (en), French (fr), German (de), and Japanese (ja). For each domain, there are 1,000 positive samples and 1,000 negative samples in each language for both training and testing. We choose the following baselines: translation baseline, UMM (Xu and Wan, 2017), CLDFA (Xu and Yang, 2017) and MAN-MoE (Chen et al., 2019). For the translation baseline, we translate the training and testing data for each target language into English using Watson Language Translator 5 , and trained on the mBERT model, which is more 5 https://www.ibm.com/watson/services/ language-translator/, version 2018-05-01   (Xu and Wan, 2017) 0.7772 0.7803 0.7870 CLDFA# (Xu and Yang, 2017) Table 3 for each domain.
Results Among the three tasks, both base models achieve competitive results for all languages thanks to the choice of pre-trained models. Instance weighting produces consistent improvements over the base models for nearly all target languages. Especially, in Table 1, the best model XLMR+IW beats the best baseline by 4.65% on average, improving from XLMR by 4% on Russian and gaining substantially on the other target languages; in Table 2, XLMR+IW outperforms the baselines, and surpassing XLMR steadily, with impressive gains on Russian, Chinese and Spanish. In Table 3, the best model shows the same trend in most cases. While our approach is model-agnostic, when the base model or the embedding improves, instance weighting will still help, as we can see the improved results obtained by switching from mBERT to XLMR. Again, the framework is simple but effective given these observations. Most importantly, it requires no additional external data and is easily adaptable into any deep models.

Discussion
Multi-source Expansion Studies show that multilingual transfer outperforms bilingual transfer (Guo et al., 2018). We run an experiment on the opinion extraction task to illustrate how our approach can be easily extended to enable multi-source transfer, (see Table 5). Here, we take the SemEval dataset, and for each target language, we train on the union of all other available languages. We can observe that by easily expanding into multi-source language training, we get a significant boost across the board in all target languages. Specifically, there is a 8.1% improvement on Russian. With easy adaptation, we show the extensibility and that multilingual transfer in zero-shot learning is a promising direction.
Case Study Intuitively, we should focus on the source instances with a smaller difference with target language, because they contain more common features with the target language. Thus, if we let those instances contribute more, it is possible that the model may perform better on the target language. As an example, Table 5 shows a positively-labeled French review containing adjectives with positive emotions (e.g., "exceptionnel", "superbe") and the instance weights for two English reviews, where the weights are generated using our best model XLMR+IW. Since English instance  1 contains adjectives with positive emotions (e.g. "favorite", "great"), it has a higher score than English instance 2 containing adjectives with negative emotions (e.g., "weak", "unconvincing").

Conclusion
We proposed instance weighting for CLTC and evaluated on 3 fundamental tasks. The benefits of our approach include simplicity and effectiveness by ensuring wide applicability across NLP tasks, extensibility by involving multiple source languages and effectiveness by outperforming a variety of baselines significantly. In the future, we plan to evaluate on more tasks such as natural language inference (Conneau et al., 2018)