Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding

Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.


Introduction
Spoken language understanding (SLU) is a key component in task-oriented dialogue systems. SLU consists of two subtasks: intent detection and slot tagging (Wang et al., 2005;Tur and De Mori, 2011). Although promising progress has been achieved on SLU in English (Liu and Lane, 2016;Huang et al., 2020), those methods need large amounts of training data, and thus cannot be applied to low-resource languages where zero or few training data is available.
In this paper, we target at the extreme setting for cross-lingual SLU where no labeled data in target languages is assumed, which is critical for industry practice, since annotating a large SLU dataset with high quality for every language is simply infeasible.
Existing cross-lingual transfer learning methods mainly build on pre-trained cross-lingual word embeddings (Ruder et al., 2019) or contextual models (Wu and Dredze, 2019;Huang et al., 2019; * Work is done during internship at NLP Group, Microsoft STCA. † Corresponding author. Lample and Conneau, 2019;Conneau et al., 2020), which represent texts with similar meaning in different languages close to each other in a shared vector space. Those approaches often show good performance on intent detection. The results on slot tagging, however, are often unsatisfactory, especially for distant languages, which are dramatically different from English in scripts, morphology, or syntax (Upadhyay et al., 2018;Schuster et al., 2019;. Several studies (Schuster et al., 2019;Liu et al., 2020b; show that adding translated data into the fine-tuning process of pretrained models can improve the results of cross-lingual SLU substantially when no golden-labeled training data in target languages is available. For example, machine translation can be employed to translate the training data in English into target languages, and some alignment methods, such as attention weights (Schuster et al., 2019), fastalign (Dyer et al., 2013) or giza++ (Och and Ney, 2003), can be further applied to label the translated data. Another approach to alleviate the problem of data scarcity is to automatically generate training data. Recently, some methods for monolingual SLU (Anaby-Tavor et al., 2020;Kumar et al., 2020;Zhao et al., 2019; automatically label domain-specific data or use pre-trained language models to generate additional data. However, the synthesized training data derived from both the translation approach and the data generation approach may be quite noisy and may contain errors in label. For the translation approach, both the translation process and the alignment process may generate errors Li et al., 2020b). For the data generation approach, it is often hard to control a right tradeoff between generating correct but less diverse data and generalizing more diverse data but with more noise. Moreover, generating synthetic training data across languages further adds challenges to the robustness of the generation methods.
To filter out noise in the synthesized training data, a few methods are proposed, such as the mask mechanism , the soft alignment method , the unsupervised adaptation method (Li et al., 2020b), the rule-based filtering method , the classifier based filtering method (Anaby-Tavor et al., 2020) and the language model score based filtering method (Shakeri et al., 2020). These methods rely on either adhoc rules or extra models. Although they have shown promising results, each of them considers only a single source for data augmentation. It is still challenging to differentiate noisy instances from useful ones, since all those instances are sampled from the same distribution generated by the same method.
In this paper, we regard both the translation approach and the generation approach as data augmentation methods. We tackle the problem of reducing the impact of noise in augmented data sets. We develop a principled method to learn from multiple noisy augmented data sets for cross-lingual SLU, where no golden-labeled target language data exists. Our major technical contribution consists of a series of denoising methods including instance relabeling, co-training and instance re-weighting.

First,
motivated by the self-learning method (Zoph et al., 2020), we design a model-ensemble-based instance relabeling approach to correct the noisy labels of augmented training data in low resource languages. As teacher models can not always generate correct labels, the original self-learning method tends to suffer from accumulated errors caused by model predictions. To alleviate the problem of accumulated errors, in our instance relabeling approach, we use crowd-intelligence from multiple models to derive the more reliable labels of pseudo training instances. Besides, we filter out noisy instances based on co-training and instance re-weighting strategies to reduce the impact of incorrect predictions on subsequent training. Our training strategy does not follow a traditional teacher-student manner. Instead, we use model predictions in the last epoch as pseudo labels in the current epoch to compute loss which saves training time.
In order to filter out noisy instances, we adopt a co-training mechanism, which uses selected instances from the other models to train the current model. Different from the co-teaching method (Han et al., 2018) where two models are trained with the same data, we propose multiple models should be trained with multiple different noisy augmented data where noise may be largely independent. It is because deep neural networks have a high capacity to fit noisy labels. When two models are trained with the same data, we tend to obtain two similar models. The co-teaching method gradually becomes a naive selection method with two models ensembled. In our co-training method, by employing very different data generation methods, we hope to attain that the noise from different sources may be largely independent and models can learn different knowledge from them. Therefore, the instances that pass the screening process of the other models can serve as the supervision signals to the current model which alleviates the problem of accumulated errors caused by selection bias.
Last, we further propose an instance reweighting technique to adjust the weights of training instances adaptively. As we do not have real training data in target languages, we can use the consistency among the soft labels predicted by different models to predict the reliability of the instances. Intuitively, if the predictions of different models are highly inconsistent on an instance, the instance may contain much noise. The larger deviation, the more uncertainty, and the less weight. This idea further increases the robustness of the selected training instances.
We conduct extensive experiments on two public datasets. The experiment results clearly indicate that, by consciously considering multiple noisy data sources derived from very different augmentation methods, our approach is more effective than using any single source. Our methods improve the state of the art (SOTA) by 3.05 and 4.24 percentage points on the two benchmark datasets, respectively.

Related Work
The cross-lingual spoken language understanding methods can be divided into two main categories: the model transfer methods and the data transfer methods.
The model transfer methods build on pre-trained cross-lingual models to learn language agnostic representations, such as MUSE (Lample et al., 2017), CoVE (McCann et al., 2017), mBERT (Wu and Dredze, 2019), XLM (Lample and Conneau, 2019), Unicoder (Huang et al., 2019), and XLM-R (Conneau et al., 2020). The English training data is applied to fine-tune the pre-trained models and then the fine-tuned models are directly applied to target languages (Liu et al., 2020b;Schuster et al., 2019;Qin et al., 2020;Upadhyay et al., 2018;. To better align embeddings between source and target languages, Liu et al. (2019) use domain-related word pairs and employ a latent variable model to cope with the variance of similar sentences across different languages. Liu et al. (2020b) and Qin et al. (2020) use parallel word pairs to construct code-switching data for fine-tuning. Their methods encourage the model to align similar words in different languages into the same space and attend to keywords. Liu et al. (2020c) propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
The data transfer methods construct pseudolabeled data in target languages. These methods usually employ machine translators to translate training instances in a source language into target languages and then apply alignment methods, such as attention weights (Schuster et al., 2019), fastalign (Dyer et al., 2013), or giza++ (Och andNey, 2003), to project slot labels to the target language side. The derived training instances are combined with the training data in the source language to fine-tune the pre-trained cross-lingual models. Previous studies (Upadhyay et al., 2018;Schuster et al., 2019; show that adding translated training data can significantly improve the model performance, especially on languages which are distant from the source language. The data generation approaches can also construct additional training data. Some methods (Wang and Yang, 2015;Marivate and Sefara, 2020; make slight changes to the original training instances through word replacement or paraphrases. More sophisticated methods generate training data through large-scale neural networks, such as generative adversarial networks (Goodfellow et al., 2020), variational autoencoders (Doersch, 2016;dos Santos Tanaka and Aranha, 2019;Russo et al., 2020), and pre-trained language models Anaby-Tavor et al., 2020;Kumar et al., 2020;.

Method
In this section, we define the problem and then propose our method.

Problem Definition and Solution Framework
The SLU task aims to parse user queries into a predefined semantic representation format. Formally, given an utterance with a sequence of L tokens, a SLU model targets to produce an intent label y I for the whole utterance and a sequence of slot labels , where y S i is the slot label for the i th token x i . Here, we target at the extreme cross-lingual setting where only some training data D src in English (or, in general, a rich-resource source language) and some development data D dev in English exist. Besides, some annotated data D test in target languages is used as the test set. Cross-lingual SLU is to learn a model by leveraging D src to perform well on D test , using D dev for parameter tuning.
We add a special token [CLS] in front of each input sequence. Then we feed x into an encoder M enc to obtain the contextual hidden representa- , where Θ denotes the parameters of the encoder.
We take h 0 as the sentence representation for intent classification and take h i (1 ≤ i ≤ L) as the token representations for slot filling. We apply linear transformation and the softmax operation to obtain the intent probability distribution p I (x; Θ) and the slot probability distribution p S i (x; Θ), that is, where p I ∈ R 1×|C I | , p S i ∈ R 1×|C S | , C I is the set of intent labels, C S is the set of slot labels under the BIO annotation schema (Ramshaw and Marcus, 1999), W I ∈ R |C I |×d and W S ∈ R |C S |×d are the output matrices, and b I and b S are the biases.
The overall architecture of our proposed method is shown in Figure 1. It consists of two major modules, the data augmentation module and the denoising module.

The Data Augmentation Module
In this module, we augment training data in target languages via translation and generation. The left part of Figure 1 shows the details.

Translation
We use Google Translator to translate the training corpus D src in the source language (English) to the target languages. In addition to translation, we also need some word alignment methods to project the slot labels to the target language side. We try giza++ (Och and Ney, 2003) and fastalign (Dyer et al., 2013) to obtain word alignment information and find that the pseudo slot labels projected by giza++ generally lead to better performance (about 2% increase in F1 on the SNIPS dataset). Thus, in the rest of the paper, we use Google translator and giza++ to produce translated data. Denote by D trans the translated training corpus.

Generation
To further increase the diversity of synthesized training data, we leverage multilingual BART (mBART) (Liu et al., 2020a) as the generator to synthesize additional target language training corpus. Specifically, we first fine-tune the pretrained mBART model on the translated training data D trans by adopting the denoising objective (Liu et al., 2020a) -the cross-entropy loss between the decoder's output and the original input. The input to mBART consists of the dialog act and the utterance in D trans , defined by are the slotvalue pairs. Here, v i in the target language is obtained by word alignment between utterances in the source and the target language. Following Liu et al. (2020a), we apply text infilling as the injected noise in the fine-tuning stage.
After fine-tuning, we apply the same noise function to the input data X and leverage the finetuned mBART to generate m candidates for each instance. To increase diversity of generated data, the top-p sampling strategy (Fan et al., 2018) is adopted. Each generated instance consists of "an utterance, the corresponding intent and slot-value pairs". Then, we perform preliminary data selection by filtering out the generated utterances not containing the required slot values. Last, we randomly sample a instances from the candidate set for each input to construct the generated corpus D gen .

The Denoising Module
To tackle the noisy label problem introduced by the data augmentation module, we design a denoising module shown in Algorithm 1.
At the initialization stage, we first train K models using the augmented data derived from K different augmentation methods. All models are optimized by the cross-entropy loss function computed using the original labels of intent and slots. For the k-th model (k ≤ K), where x is a training utterance, y I is the intent label, y S j is the slot label of the j-th word in the utterance, and p I (x; Θ k ) and p S j (x; Θ k ) are the predicted probability distributions of the intent and the slot, respectively.
To keep our discussion simple, in this paper, we mainly consider using the training corpora derived from machine translation and generation. Thus, we maintain K = 2 SLU networks with the same structure in the training process. M 1 and M 2 are trained using different training corpora D 1 = {D src , D trans } and D 2 = {D src , D trans , D gen }, respectively. Based on our epxeriments, D gen only is too noisy thus we combine with D trans . Our training framework in general can handle more than 2 models. We present experimental results with more than 2 models in Section 4.4.
After the initialization stage, each model has learned some knowledge from each augmented training data. Since there exists noise in the augmented training data (such as D trans and D gen ), we step into the relabeling stage, which combines a series of strategies: instance relabeling, co-training and instance re-weighting to reduce the impact of the noise. In the relabeling stage, model training and instance relabeling are conducted iteratively. Motivated by the idea of model ensemble, we use the ensemble of model predictions to correct label errors in a self-learning manner. Specifically, all models are trained using all training corpora D = {D src , D trans , D gen }. The slot labels and the intent labels of the training instances in D trans and D gen are modified to the corresponding ensemble predicted probability distributions, which are used as the pseudo-truth labels to compute the loss in the next epoch. That is,

Co-Training
Heuristically, instances with small losses are more likely to have cleaner labels. When noise from different augmentation methods is more or less independent, each model can learn instances with small cross-entropy losses from the other models. Specifically, when K = 2, in each batch of the training data, each network discards the instances with larger losses computed by Equation 3 by a ratio of δ and then teach the remaining instances to another one.

Instance Re-weighting
Another way to reduce the impact of noisy instances is to assign different weights to different instances, the more noisy an instance, the less weight it is associated. We design a re-weighting mechanism to implement this idea. The intuition is that if the predictions by multiple models are quite inconsistent, the instance may likely be noisy. Technically, we design an uncertainty based weight to re-weight the training instances. The larger the deviation, the more uncertainty and the less the weight. Specifically, the uncertainty of each instance is defined as: We further compute weight by w = e −u and incorporate this weight into Equation 3 to obtain: which is the new training objective during the relabeling stage.

Experiments
In this section, we report our experiments on two benchmark datasets.  Methods with reimp are re-implemented in this paper with different translator and alignment tool. EN, Trans. and Gen. denote source language data D src , translated target language data D trans and generated target language data D gen , respectively. Denoise denotes proposed denoising module.

Settings
We evaluate the effectiveness of our proposed approach over five languages on two benchmark datasets: SNIPS (Schuster et al., 2019) and MTOP . The details of datasets are provided in Appendix.
For generation, we generate m = 10 candidates for each input and randomly sample a = 1 from each candidate set to construct D gen . Our SLU model is based on the pre-trained XLM-R large model, which has L = 24 layers and two additional task-specific linear layers for intent classification and slot filling. More implementation details including hyper-parameters are described in Appendix.
Following the previous works (Schuster et al., 2019;, we use F1 score to measure the slot filling quality and use accuracy score to evaluate the intent classification quality on the SNIPS dataset and use Exact Match Accuracy on the MTOP dataset. We employ the following SOTA baselines in two groups. The first group is the model transfer methods, including Multi.CoVe (Schuster et al., 2019); Transferable Latent Variable (Liu et al., 2019); Attention-Informed Mixed (Liu et al., 2020b); CoSDA-ML (Qin et al., 2020); LR&ALVM (Liu et al., 2020c); and EN . The second group is the data transfer methods, including EN+Trans.  and EN+Trans.+mask .  Table 1 reports the results of our approach and the SOTA baselines. As the translator used in  is not publicly available, we use Google translator instead, which leads to some results on some languages slightly different from reported by .

Results
Our method outperforms the SOTA baselines and achieves new SOTA performance. Our method improves the Exact Match Accuracy on MTOP from 65.59 to 69.83, the F1 score on SNIPS from 81.44 to 84.49, and the accuracy on SNIPS from 98.64 to 98.71. These results clearly demonstrate the effectiveness of our proposed method.
One interesting finding is that the performances on Spanish and French become slightly worse after adding the translated data. It is because the noise introduced by the machine translation and alignment processes may hurt the performance. Our method introduces the denoising training approach, which is able to handle the noise of synthesized data.

Comparison with Other Denoising Methods
To verify the effectiveness of our approach, we conduct experiments with two kinds of denoising approaches used by the previous works. First, we consider the classifier based selection approach. Following Anaby-Tavor et al. (2020), we train an extra classifier using the corpus in English and the translated corpus, and filter out noisy data according to the probability scores predicted by the classifier. Second, we consider the LM based selection approach. Following Shakeri et al. (2020), we use the language model score as the indicator to select high-quality data. For fair comparisons, for each noise filtering baseline, we also fine-tune two pre-trained XLM-R large models using different random seeds and take the ensembled model as the final model. Additionally, to remove the effect of instance relabeling method, we also apply it to baselines similar to our approach. Table 2 shows the comparison results on the MTOP dataset. Our approach outperforms those two methods (w/ relabeling) by 0.91 and 1.32 percentage points, respectively. This suggests that the gain of our method is not from the simple ensemble of two models. Instead, our method could indeed effectively remove the noise of synthesized data, outperforming previous noise filtering methods.

Ablation Study
To validate the contribution of each component in our method, we conduct the ablation study on the MTOP dataset. We consider several ablation options. (1) w/o generation removes the generated training data. (2) w/o instance relabeling keeps the intent and slot labels of data unchanged throughout the training process. (3) w/o co-training trains models using all training data without filtering. (4) w/o instance re-weighting skips the instance reweighting strategy.
As shown in Table 3, compared with the performance of approach using translated and source language data, the performance of approach using generated data and source language data is reduced by 3.11 percentage points without denoising strategies. This is due to the much noise introduced by the generation process, which hurts the performance. When combining with our denoising approach, the approach with generated training corpus is superior to the approach without generated corpus by 0.47 percentage points. We consider that multiple augmented data sets increase data diversity and lead to better supervision signals. Table 3 also shows that removing any of the other components generally leads to clear performance drop. It confirms that all of the proposed techniques contribute to the cross-lingual setting.

Effect of Instance Relabeling
To better understand the effect of the instance relabeling strategy, in Figure 2, we record the Exact Match Accuracy of our method with or without the relabeling strategy on the MTOP test set after each training epoch. The performance of our method with relabeling strategy keeps improving and is consistently better than the baseline during    the relabeling stage. It demonstrates that the relabeling method indeed corrects many label errors in the noisy training data and the corrected labels contribute to the performance improvement.

Effect of Number of Models
We explore the effect of the number K of models (XLM-R). Specifically, we conduct experiments using one or three models. In the setting of one model, that is K = 1, we only train one network with all training corpora D = {D src , D trans , D gen }, and adopt the instance relabeling and instance filtering strategies. In the setting of K = 3, three models are trained using {D src , D trans }, {D src , D gen } and {D src , D trans , D gen }, respectively.
The results shown in Table 5 indicate that our method can effectively extend beyond two models. When the number of networks increases, the performance improves. The intuition is that more models can produce more reliable predictions, and thus can lead to better instance relabeling as well as instance filtering.

Effect of Filtering Rate in Co-Training
To study the impact of the co-training strategy, we conduct experiments with different filtering rates on the MTOP dataset. Table 6 shows the results with regard to different filtering rates and different training corpora. For both approaches using or not using the generated corpus, as the filtering rate increases, the performance improves as well. This demonstrates that the filtering strategy can in-deed filter out noisy instances effectively. However, further increasing the filtering rate degrades the performance. It is mainly because of the excessive drop of useful information contained in the training data. Another finding is that the best filtering rate for training corpus {D src , D trans , D gen } is larger than that for {D src , D trans }. The explanation may be that the generated corpus D gen has more diverse data than the translated corpus D trans , but may also contain more noise.

Case Study
We conduct case analysis of the instance relabeling results on the MTOP dataset to examine the capability of our approach. We statistically analyze the differences between the original labels and the modified labels after relabeling stage.
We find that our instance relabeling method effectively corrects wrong labels of the synthesized data, including intent label and slot label as shown in Table 4. Specifically, there are four types of label modifications: 1) Intent Change: the intent label of an utterance is modified; 2) Slot Change: the slot type of a text span is modified; 3) Boundary Change: the BIO boundaries of a slot are modified; and 4) Slot and Boundary Change: both the slot type and the BIO boundaries are modified. For the MTOP dataset, intent labels of 4.99% of the translated and generated data are modified and the slot labels of 33.10% of those data sets are modified.
From the case study, we can see that the synthesized data indeed contains much noise and our relabeling strategy is able to greatly reduce the negative impact of the noise by correcting different types of label errors.

Conclusions
In this paper, we propose a denoising training approach where multiple models trained from various augmented methods provide supervision signals to each other. Extensive experimental results show that the proposed method outperforms the previous approaches, and can certainly alleviate the noisy label problem. Our proposed method is independent of the backbone network (e.g., XLM-R model) and the task. As future work, we plan to investigate the performance of our method on different cross-lingual tasks.

A.1 Datasets
We evaluate the effectiveness of our proposed approach over five target languages on two benchmark datasets: SNIPS (Schuster et al., 2019) and MTOP . Statistics of used data are detailed in Table 7

A.2 Implementation Details
For generation, we fine-tune mBART pre-trained on 25 languages with 0.3 dropout, 0.2 label smoothing, 2500 warm-up steps, 3×10 −5 maximum learning rate, and 1024 tokens in each batch. For text filling, we mask 35% of the words in each instance by randomly sampling a span length according to a Poisson distribution (λ = 3.5). Then we append to each instance an end-of-sentence token (< /S >) and the corresponding language id symbol (< LID >). We don't search the best parameters for generation but use the default values in open-source code * . The final models are selected based on validation likelihood. For SLU, we use XLM-R large model with about 550M parameters as the backbone network. In the fine-tuning process, we set the batch size as 128, fine-tuning epochs E all = 10, initialization epochs E = 4 and 0.1 dropout for two benchmark datasets. The maximum filtering rates are δ = 0.2 and δ = 0.3 for the SNIPS and MTOP datasets, respectively. The learning rates are 2 × 10 −5 and 5 × 10 −5 for SNIPS and MTOP datasets, respectively. We select the best hyper-parameters by searching a combination of batch size, learning rate, the number of fine-tuning epochs, the number of initialization epochs and the filtering ratio with the following range: batch size {32, 64, 128}, learning rate {1, 2, 3, 4, 5} × 10 −5 , fine-tuning epochs {5, 10, 15}, initialization epochs {2, 3, 4}, filtering ratio {10%, 20%, 30%, 40%}. The models are saved by performance on the English development corpus and translated target language development   corpus. The models are trained using mini-batch back-propagation, and the AdamW (Loshchilov and Hutter, 2019) optimizer is used for optimization. We fine-tune the models on two V100-32GB GPUs which lasts about 4 hours.

B More Experimental Results and Discussions B.1 Effect of Generated Data Size
To further analyze the effect of generated data, we randomly sample a = {0, 1, 2, 3} instances from the candidate set for each input to construct the generated corpus. As shown in Table 8, by increasing the size of the generated corpus, the performance improves. However, when the data size reaches a certain scale, e.g., a = 3 in our experiments, the performance slightly regresses, but still outperforms the baseline without generated data. This suggests that the augmentation module indeed increases the diversity of training data and then improves the performance. Though increasing noise limits the growth of the improvement, our approach is robust enough to achieve comparable performance.

B.2 Variance Analysis
We conduct 5 runs of training and calculate the mean and standard deviation (Stdev) values for our approach and the baseline on the MTOP dataset. The results are listed in Table 9. Besides, we also conduct a two-sided statistically significant t-test with the significance threshold 0.05 to compare the baseline with our method. The results show that the variance of our approach is similar to that of the baseline. Moreover, with p-value = 4.5 × 10 −8 , our method outperforms the baseline with statistical significance.