Exploring Cross-Lingual Transfer Learning with Unsupervised Machine Translation

In Natural Language Understanding (NLU), to facilitate Cross-Lingual Transfer Learning (CLTL), especially CLTL between distant languages, we integrate CLTL with Machine Translation (MT), and thereby propose a novel CLTL model named Translation Aided Language Learner (TALL). TALL is constructed as a standard transformer, where the encoder is a pre-trained multilingual language model. The training of TALL includes an MT-oriented pre-training and an NLU-oriented ﬁne-tuning. To make use of unannotated data, we implement the recently proposed Unsupervised Machine Translation (UMT) technique in the MT-oriented pre-training of TALL. The experimental results show that the application of UMT enables TALL to consistently achieve better CLTL performance than our baseline model, which is the pre-trained multilingual language model serving as the encoder of TALL, without using more annotated data, and the performance gain is relatively prominent in the case of distant languages.


Introduction
Virtual assistants, such as Amazon Alexa, Apple Siri, and Google Assistant, are increasingly popular due to the convenience they bring to customers. A core function of virtual assistants is Natural Language Understanding (NLU), which is a combo of slot filling and intent classification. NLU models behind virtual assistants are generally trained in a supervised manner, which requires a large amount of annotated data. Collecting annotated data is not a big deal for high-resource languages, but difficult * Work done during internship at Amazon Alexa AI. or even impossible for low-resource languages. As a result, when ported to a low-resource language, an NLU model may suffer from the so-called "data hungriness" (van der Ploeg et al., 2014). This problem can be alleviated by conducting Cross-Lingual Transfer Learning (CLTL) (Yarowsky et al., 2001), where annotated data in a high-resource source language is used to bootstrap an NLU model aimed at a low-resource target language.
The key to CLTL is to learn a shared representation space for the given source-target language pair. A traditional way to achieve this goal is to leverage cross-lingual word embeddings, which are obtained by mapping the words in both languages to a shared word embedding space (Zhang et al., 2017;Conneau et al., 2017;Artetxe et al., 2018a;. However, most studies on this topic only consider similar languages (e.g. English-German) but ignore distant languages (e.g. English-Japanese), since it is more challenging to conduct CLTL between distant languages than between similar languages. Recently, contextualized word embeddings generated by pre-trained language models have shown significant advantages over ordinary word embeddings (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019). For the purpose of CLTL, many efforts have been made to develop multilingual variants of pretrained language models. These efforts have in turn brought about pre-trained multilingual language models, each of which is pre-trained on a multilingual corpus so that the learned representation space is not only rich in contextual clues but also shared by all the involved languages (Mulcaire et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020). However, in this pre-training, the collection of the multilingual corpus is not obviously biased to any language, thus in the learned representation space, similar languages are still similar to each other, and distant languages are still distant from each other. As a result, although pre-trained multilingual language models have greatly promoted CLTL, it is still more challenging to conduct CLTL between distant languages than between similar languages. This opinion has been verified by several empirical studies on a popular pre-trained multilingual language model named Multilingual BERT (M-BERT) (Devlin et al., 2019), where the CLTL performance of M-BERT between similar languages is decent, but that between distant languages is still far from satisfactory (Pires et al., 2019;Wu and Dredze, 2019;Karthikeyan et al., 2020). From our point of view, CLTL can be analogized to the process of a human being learning a foreign language, where the prior knowledge on the native language plays an important role. Language educators believe that a foreign language learner can benefit a lot from translation, since translation not only involves all aspects of foreign language learning but also helps to enhance the correlation between the native language and the foreign language (Witte et al., 2009). According to our observation and experience, this is especially the case when the native language and the foreign language are distant from each other. Inspired by these thoughts, to facilitate CLTL, especially CLTL between distant languages, we propose a novel CLTL model named Translation Aided Language Learner (TALL), where CLTL is integrated with Machine Translation (MT). Specifically, we adopt a pre-trained multilingual language model, which is now recognized as the state of the art in CLTL, as our baseline model, and construct TALL by appending a decoder to it. On this basis, we directly fine-tune the baseline model as an NLU model to conduct CLTL, but put TALL through an MT-oriented pre-training before its NLU-oriented fine-tuning. We believe that the MT-oriented pretraining can help TALL to enhance the correlation between the given source-target language pair in its representation space, and thus can make CLTL easier to conduct in its NLU-oriented fine-tuning, especially in the case of distant languages. To make use of unannotated data, which is not only large in amount but also available for every language, we implement the recently proposed Unsupervised Machine Translation (UMT) (Artetxe et al., 2018b;Lample et al., 2018a;Yang et al., 2018;  To verify the effectiveness of TALL, we carry out a series of experiments to compare the CLTL performance of TALL with that of the baseline model. In these experiments, we address not only CLTL tasks between similar languages but also those between distant languages. For each given CLTL task, we separately use two popular pre-trained multilingual language models for model construction. To implement UMT, we collect unannotated sentences from Wikipedia dumps. To conduct CLTL, we separately collect annotated sentences from two multilingual NLU datasets. The experimental results show that the application of UMT enables TALL to consistently achieve better CLTL performance than the baseline model without using more annotated data, and the performance gain is relatively prominent in the case of distant languages.

Task Definition
NLU is a combo of slot filling and intent classification. Given a sentence x consisting of m words {w 1 , . . . , w m }, slot filling is to predict a slot label y σ i for each word w i , and intent classification is to predict an intent label y ι for x. In this paper, NLU models are required to be trained under a zero-shot CLTL scenario, where annotated sentences in the given source language are used for model optimization, while those in the given target language are used for model evaluation.

Baseline Model
A transformer (Vaswani et al., 2017) is a sequenceto-sequence model consisting of an encoder and a decoder. A main feature of transformers is that they use multi-head self-attention and multi-head crossattention to model dependencies in sequential data. These attention mechanisms enable transformers to extract long-term contextual clues from text. As a result, transformers have been intensively used in transfer learning to develop pre-trained language models, which generate contextualized word embeddings. For example, some pre-trained language models, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), are implemented as transformer encoders, and some other ones, such as the GPT family (Radford et al., 2018(Radford et al., , 2019, are implemented as transformer decoders. As a sub-field of transfer learning, CLTL has witnessed the wide application of transformers in developing pre-trained multilingual language models. Most of the existing pre-trained multilingual language models, such as M-BERT, XLM (Conneau and Lample, 2019), and XLM-RoBERTa (XLM-R) (Conneau et al., 2020), are implemented as transformer encoders. Actually, these pre-trained multilingual language models are the multilingual variants of BERT and RoBERTa, since each of them is identical to either BERT or RoBERTa except being pre-trained on a multilingual corpus. The representation space learned through this pre-training is not only rich in contextual clues but also shared by all the involved languages. Therefore, in theory, each of these pre-trained multilingual language models can be simply fine-tuned to address any CLTL task between its involved languages. The pre-trained multilingual language models mentioned above are now recognized as the state of the art in CLTL. To push the state of the art, we adopt one of them as our baseline model, and fine-tune it as an NLU model to conduct CLTL. As shown in Figure 1, in this NLU-oriented fine-tuning, we feed each given sentence to the baseline model, and feed the final hidden states of the baseline model to an NLU predictor. Since the baseline model is fitted with a sub-word tokenizer, a given sentence x consisting of m words {w 1 , . . . , w m } is tokenized into n tokens (n m) such that the baseline model generates n final hidden states {h 1 , . . . , h n }. For slot filling, the NLU predictor first performs an average pooling on the final hidden states related to each word w i , and then uses a dense layer with a softmax normalization to map the pooling result to a slot distribution for w i : where W σ is a trainable weight, b σ is a trainable bias, k i and l i separately represent the start position and end position of the final hidden states related to w i , and f a (·) represents average pooling. For intent classification, the NLU predictor first performs an average pooling on all the final hidden states, and then uses another dense layer with another softmax normalization to map the pooling result to an intent distribution for x: where W ι is a trainable weight, and b ι is a trainable bias. On this basis, for model optimization, we minimize the following joint loss through stochastic gradient descent on annotated sentences in the given source language: For model evaluation, we infer the baseline model on annotated sentences in the given target language to measure three evaluation metrics, namely Slot F1, Intent Accuracy, and Semantic Accuracy (i.e. sentence-level joint accuracy).

Proposed Model
Since the baseline model is pre-trained on a multilingual corpus, all its involved languages are correlated with each other in its representation space. Normally, the larger such correlation between languages, the easier it is to conduct CLTL. To equally treat all possible CLTL tasks, the multilingual corpus used in the pre-training of the baseline model is collected in a subtle way that is not obviously biased to any language. However, there are two side effects of doing so. On the one hand, instead of focusing on a specific CLTL task, the baseline model pays equal attention to all possible CLTL tasks. On the other hand, in the representation space of the baseline model, the correlation between languages is proportional to their linguistic similarity, or in other words, similar languages are still similar to each other, and distant languages are still distant from each other. This implies that the CLTL ability of the baseline model can be pertinently improved for each given CLTL task, and the room for improvement is relatively large when the CLTL task is between distant languages. To pertinently improve the CLTL ability of the baseline model for each given CLTL task, we would like to transform its representation space, which is used for all possible CLTL tasks, into a specialized one, where the correlation between the given source-target language pair is expressly enhanced. This goal can be achieved by resorting to MT, since translation is the most direct way to correlate languages with each other. As shown in Figure 2, for MT to be workable, we treat the baseline model as an encoder and append a decoder to it. Considering that the encoder is implemented as a transformer encoder, we implement the decoder as a transformer decoder to keep the model architecture consistent. Besides, as in Vaswani et al. (2017), we also share the token embeddings between the encoder and the decoder. The resulting new model can be seen as a standard transformer, where the encoder is a pre-trained multilingual language model. We expect this model to learn the correlation between the given source-target language pair by addressing a two-way MT task, and thus name it Translation Aided Language Learner (TALL). Before conducting CLTL with TALL, we need to pre-train it as a two-way MT model that translates between the given source-target language pair. As shown in Figure 2, in this MT-oriented pre-training, we feed each given sentence to the encoder, feed a prompt for the translated sentence to the decoder, and feed the final hidden states of the decoder to a token predictor. Given a sentence x and a prompt x for the translated sentence, suppose the decoder generates a final hidden state h i for the i-th token in x , then h i can be seen as a memory of both x and the first i tokens in x . The token predictor uses a dense layer with a softmax normalization to map this memory to a token distribution for the position i in the translated sentence: where W τ is a trainable weight tied to the token embeddings, and b τ is a trainable bias. Since twoway MT requires the translated sentence to be in either the source language or the target language, which depends on the current direction, we extend the token vocabulary with two language identifiers, which separately represent the two languages, and thereby inform the decoder about the currently required language by setting the first token of x to the corresponding language identifier. By the way, since the token vocabulary is highly multilingual, most probabilities in the above token distribution are for the tokens beyond the given source-target language pair and thus make no sense. Therefore, we ignore these probabilities when inferring TALL to generate translated sentences. By convention, the training of MT models is supervised and thus requires parallel corpora. However, parallel corpora are generally expensive to collect, which makes them scarce or even unavailable for many source-target language pairs. Since TALL is designed to be a general-purpose CLTL model, a supervised training on parallel corpora is not applicable to its MT-oriented pre-training. Recently, an unsupervised training technique for MT models, which is named Unsupervised Machine Translation (UMT), has been proposed. Instead of relying on parallel corpora, UMT relies on monolingual corpora of unannotated sentences. This is attractive to us, since a large amount of unannotated sentences are always available for every language. Therefore, we implement the UMT training recipe proposed by Lample et al. (2018b) in the MT-oriented pretraining of TALL. Specifically, for model optimization, we collect a source-language corpus S and a target-language corpus T , each of which is a set of unannotated sentences. On this basis, we measure the following two losses: • Denoising auto-encoding loss. As in Lample et al. (2018a), we implement a noise injector f n (·), which injects noise to each given sentence by randomly dropping and swapping its tokens. For each source-language sentence s ∈ S, we first run the noise injector to obtain a noise-injected sentence f n (s), which can be seen as a sentence in a different language, and then use TALL to translate f n (s) to the source language, the expected result of which is s. Besides, we also perform this process on each target-language sentence t ∈ T . This is the so-called "denoising auto-encoding", whose loss is defined as the cross-entropy loss on recovering the original sentences from the noise-injected sentences: • Back-translation loss. Let us use f m (·) to represent the inference of TALL, which translates each given sentence to its opposite language in the given source-target language pair. For each source-language sentence s ∈ S, we first infer TALL to obtain a TALL-translated sentence f m (s), which is in the target language, and then use TALL to translate f m (s) to the source language, the expected result of which is s. Besides, we also perform this process on each target-language sentence t ∈ T . This is the so-called "back-translation", whose loss is defined as the cross-entropy loss on recovering the original sentences from the TALL-translated sentences: We sum up the above two losses to obtain a joint loss, and thereby minimize the joint loss through stochastic gradient descent. For model evaluation, we collect another source-language corpus and another target-language corpus, each of which is also a set of unannotated sentences. On this basis, we implement the round-trip translation trick proposed by Lample et al. (2018a), where we first translate each given sentence to its opposite language in the current source-target language pair, and then translate the resulting sentence to the original language. By inferring TALL, we perform this process on all the sentences in the above two corpora to obtain two reconstructed corpora. Thereby, we measure the BLEU score between the two original corpora and the two reconstructed corpora to evaluate the translation performance of TALL. The above MT-oriented pre-training guarantees that TALL can learn a representation space, where the given source-target language pair are expressly correlated with each other. As a result, it will be easier to conduct CLTL with the pre-trained TALL than with the baseline model. This is especially the case when the given source-target language pair are distant from each other, since translating between distant languages reveals more knowledge than translating between similar languages. However, considering that the representation space of TALL is co-carried by the encoder and the decoder, we have to fine-tune them together as an NLU model when we conduct CLTL with the pre-trained TALL. To this end, we implement the fine-tuning approach of BART  in the NLU-oriented fine-tuning of TALL. Specifically, as shown in Figure 2, we feed each given sentence to the encoder, feed this sentence again as a prompt to the decoder with the corresponding language identifier prefixed to it, and feed the final hidden states of the decoder except the last one to the NLU predictor. On this basis, both the model optimization and the model evaluation remain the same as in the NLU-oriented fine-tuning of the baseline model.

Related Works
Cross-lingual word embeddings. A traditional way to conduct CLTL is to leverage cross-lingual word embeddings, which are usually learned in an unsupervised manner. For example, Zhang et al. (2017) formulate the learning of cross-lingual word embeddings as an adversarial game, and explore several adversarial training methods to implement it. Conneau et al. (2017) first use adversarial training to learn a linear mapping from the word embeddings of a source language to those of a target language, and then use a Procrustes solution to refine it. Artetxe et al. (2018a) first use an unsupervised initialization scheme to create an initial mapping, and then use a self-learning procedure to iteratively improve it.  propose a language-adversarial training method, and use it to address cross-lingual sentiment classification. Besides, there are also several studies on multilingual word embeddings. For example,  propose an unsupervised approach to learning multilingual word embeddings, which directly exploits the relations between all the involved languages. On this basis,  propose a multi-source CLTL model, which not only uses adversarial training to learn language-invariant features, but also uses a mixture-of-experts method to dynamically exploit the similarity between a target language and multiple source languages.
Pre-trained multilingual language models. The currently dominant way to conduct CLTL is to fine-tune pre-trained multilingual language models, which are multilingual variants of pre-trained language models, and are each pre-trained on a multilingual corpus.  et al. (2018a) construct an MT model consisting of a language-invariant pair of encoder and decoder, and train it on a non-parallel corpus not only through denoising auto-encoding and backtranslation but also through adversarial training. Yang et al. (2018) construct an MT model consisting of two pairs of encoder and decoder, which partially share their parameters, and train it on a nonparallel corpus not only through denoising autoencoding and back-translation but also through adversarial training. Lample et al. (2018b) propose a simple but effective approach based on the above works, where the constructed MT model only consists of a language-invariant pair of encoder and decoder, and its training on a non-parallel corpus only requires denoising auto-encoding and backtranslation.  first pre-train BART on a non-parallel multilingual corpus through denoising auto-encoding, and then fine-tune the pretrained BART for downstream MT tasks.

Experimental Settings
CLTL tasks. For generality, we address not only CLTL tasks between distant languages but also those between similar languages. Specifically, we separately conduct CLTL between three sourcetarget language pairs, which include two distant language pairs, namely English-Japanese and German-Japanese, and one similar language pair, namely English-German. Pre-trained multilingual language models. For compatibility, we use different pre-trained multilingual language models for model construction.
Specifically, for each given CLTL task, we separately use two popular pre-trained multilingual language models, namely M-BERT (base and cased) and XLM-R (base), to construct both the baseline model and TALL. Training data. For practicality, we adopt largescale corpora and real-world datasets as training data. Specifically, to implement UMT, we collect a source-language corpus of 1M unannotated sentences and a target-language corpus of 1M unannotated sentences from Wikipedia dumps for model optimization, and also collect a source-language corpus of 10K unannotated sentences and a targetlanguage corpus of 10K unannotated sentences from Wikipedia dumps for model evaluation. To conduct CLTL, we collect annotated sentences in real-world domains from two multilingual NLU  datasets. The first multilingual NLU dataset is MultiATIS++ (Xu et al., 2020), which is an extension to Multilingual ATIS (Upadhyay et al., 2018). It provides 5K annotated sentences for each language, which are all in the domain of airline travel. The second multilingual NLU dataset is a multidomain dataset collected from a virtual assistant. It provides 100K annotated sentences for each language, which are evenly distributed in five domains, namely music, notifications, smart home, weather, and books. By the way, in the above two multilingual NLU datasets, each word is annotated with a slot label in the B-I-O format, and each sentence is annotated with an intent label.

Implementation details.
We use WikiExtractor (Attardi, 2015) to extract paragraphs from Wikipedia dumps, use Stanza (Qi et al., 2020) to split paragraphs into sentences, use HuggingFace's Transformers (Wolf et al., 2019) to tokenize sentences into tokens and load pre-trained multilingual language models, and use PyTorch (Paszke et al., 2019) to implement both the baseline model and TALL. For model optimization, we apply an AdamW optimizer (Loshchilov and Hutter, 2019) with an initial learning rate of 0.0001, a weight decay factor of 0.01, and a batch size of 64 in the MT-oriented pre-training of TALL, and apply another AdamW optimizer with an initial learning rate of 0.00005, a weight decay factor of 0.01, and a batch size of 256 in the NLU-oriented fine-tuning of both the baseline model and TALL.
After each epoch, we evaluate the validation performance, which refers to BLEU score in the MToriented pre-training of TALL and Semantic Accuracy in the NLU-oriented fine-tuning of both the baseline model and TALL. If the obtained performance number is improved, we save the model, otherwise we cancel the finished epoch by restoring the model to the last saved version. We decay the learning rate by 0.5 after each cancelled epoch, and terminate the model optimization after the 5th cancelled epoch. For model evaluation, we use NLTK (Loper and Bird, 2004) to measure BLEU score, and use the evaluation script for the CoNLL-2000 shared task to measure Slot F1.

Experimental Results
As shown in Table 1, we carry out a series of experiments on the unannotated sentences collected from Wikipedia and the annotated sentences collected from MultiATIS++. Each of these experiments is aimed at a different combination of CLTL task and pre-trained multilingual language model, and includes the corresponding MT-oriented pre-training of TALL and the corresponding NLU-oriented finetuning of both the baseline model and TALL. On this basis, we first evaluate the translation performance of TALL in its MT-oriented pre-training, then evaluate the CLTL performance of both the baseline model and TALL in their NLU-oriented fine-tuning, and finally calculate the CLTL performance gain of TALL over the baseline model in percentage. Besides, as shown in Table 2, we also repeat the NLU-oriented fine-tuning of both the baseline model and TALL on the annotated sentences collected from the multi-domain multilingual NLU dataset, and thereby obtain another CLTL performance gain of TALL over the baseline model. The experimental results show that due to the application of UMT in the MT-oriented pretraining, TALL consistently achieves better CLTL performance than the baseline model in the NLUoriented fine-tuning without using more annotated data, and the performance gain is relatively prominent in the case of distant languages.

Ablation Study
Denoising auto-encoding. In the MT-oriented pretraining of TALL, we try to discard the denoising auto-encoding loss and only minimize the backtranslation loss in the UMT training. As a result, we observe that TALL achieves a very poor translation performance and a very poor CLTL performance. This implies that TALL learns little cross-lingual knowledge through the UMT training without denoising auto-encoding. Back-translation. In the MT-oriented pre-training of TALL, we also try to discard the backtranslation loss and only minimize the denoising auto-encoding loss in the UMT training. As a result, we observe that TALL achieves an almost perfect translation performance but a very poor CLTL performance. This is because the UMT training without back-translation makes TALL a copying model instead of an MT model, and a copying model can work perfectly in the model evaluation based on round-trip translation.
BART-style fine-tuning. In the NLU-oriented fine-tuning of TALL, instead of following the finetuning approach of BART, we try to discard the decoder and only fine-tune the encoder following the way we fine-tune the baseline model. As a result, we observe a very poor CLTL performance. This implies that the decoder of TALL is necessary for its NLU-oriented fine-tuning.

Further Discussion
Is a startup supervision necessary for the backtranslation? In several existing UMT training recipes, the back-translation is supervised during its startup stage, where the supervision is provided by replacing the inference of TALL with a bilingual dictionary (Lample et al., 2018a;Artetxe et al., 2018b). This startup supervision is aimed at initial-izing a shared representation space for the given source-target language pair. However, since the encoder of TALL is a pre-trained multilingual language model, TALL already possesses a properly initialized representation space, which is shared by all the involved languages, and thus does not need a startup supervision. Actually, we tried to use a parallel corpus generated by a naive MT model to provide a startup supervision, which is equivalent to using a bilingual dictionary, but did not observe any translation performance gain. How does the UMT training affect the CLTL performance? The UMT training uses the denoising auto-encoding and the back-translation to enhance the correlation between the given sourcetarget language pair in the representation space of TALL. Since the encoder of TALL is a pre-trained multilingual language model, the representation space of TALL can be seen as an extension to that of the pre-trained multilingual language model. In the representation space of the pre-trained multilingual language model, similar languages have been more correlated with each other than distant languages. That is to say, in the representation space of TALL, there is more potential to enhanced the correlation between distant languages than between similar languages. As a result, although the CLTL performance between similar languages is better than that between distant languages, the CLTL performance gain between distant languages is larger than that between similar languages.

Conclusion
The contribution of this paper is three-fold. First of all, we construct a novel CLTL model TALL based on a pre-trained multilingual language model. In the next place, we train TALL to conduct CLTL through an MT-oriented pre-training and an NLUoriented fine-tuning. Last but not least, we implement UMT in the MT-oriented pre-training of TALL to make use of unannotated data. Compared with the baseline model, which is the pretrained multilingual language model used to construct TALL, TALL consistently achieves better CLTL performance without using more annotated data, and the performance gain is relatively prominent in the case of distant languages. In the future, we will collect unannotated corpora that are linguistically compatible with the downstream NLU tasks for the UMT training, which we believe can further boost the CLTL performance of TALL.