Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems

As the labeling cost for different modules in task-oriented dialog (ToD) systems is expensive, a major challenge is to train different modules with the least amount of labeled data. Recently, large-scale pre-trained language models, have shown promising results for few-shot learning in ToD. In this paper, we devise a self-training approach to utilize the abundant unlabeled dialog data to further improve state-of-the-art pre-trained models in few-shot learning scenarios for ToD systems. Specifically, we propose a self-training approach that iteratively labels the most confident unlabeled data to train a stronger Student model. Moreover, a new text augmentation technique (GradAug) is proposed to better train the Student by replacing non-crucial tokens using a masked language model. We conduct extensive experiments and present analyses on four downstream tasks in ToD, including intent classification, dialog state tracking, dialog act prediction, and response selection. Empirical results demonstrate that the proposed self-training approach consistently improves state-of-the-art pre-trained models (BERT, ToD-BERT) when only a small number of labeled data are available.


Introduction
Large-scale pre-trained language models, such as BERT (Devlin et al., 2019), UniLM (Dong et al., 2019), GPT (Radford et al., 2018), GPT-2 (Radford et al., 2019), and GPT-3 (Brown et al., 2020), have shown great few-shot or zero-shot learning abilities in various NLP tasks with the help of task-agnostic language knowledge learned via pre-training tasks. In task-oriented dialog (ToD) systems, the labeling cost is very high such that the size of well-labeled data is often small. Therefore, few-shot learning in ToD is important and valuable in many practical applications. Many attempts (Peng et al., 2020b,a; have been proposed to leverage large-scale pre-trained language models to improve few-shot learning in ToD. Specifically, a model pretrained on general text corpora is further trained on public ToD datasets. Although the size of labeled data is often small, a practical ToD system de facto has many unlabeled dialog data. Therefore, utilizing unlabeled data to improve a ToD system is practically important. In this paper, we take a semi-supervised self-training (ST) perspective to iteratively train a better Student model using unlabeled data (Scudder, 1965;Yarowsky, 1995). ST has been successfully applied to a variety of tasks, including image classification (Yalniz et al., 2019;Xie et al., 2020;Zoph et al., 2020), automatic speech classification (Synnaeve et al., 2019;Kahn et al., 2020;Park et al., 2020;Likhomanenko et al., 2020), sequence generation (He et al., 2020), and natural language understanding (Du et al., 2020).
We are going to study this research question: can self-training provide complementary benefits on top of the strong pre-training models for few-shot learning in ToD? Recently, Xie et al. (2020); Zoph et al. (2020) studied a similar question in the context of image classification, showing that ST effectively refines pre-training models. Du et al. (2020) also recently showed the benefit of ST over pre-training for general natural language understanding. Yet, their main proposal is to crawl a large amount of similar unlabeled data from the web.
In this paper, we propose a self-training approach based on iterative pseudo-labeling (Lee, 2013). It first trains a Teacher on the labeled samples. The Teacher then iteratively generates pseudolabels for the most confident subset of unlabeled samples to train a better Student. To train a more robust Student during self-training, we propose a data augmentation technique called GradAug. GradAug first "masks" a fraction of tokens of a dialog input. Then, it reconstructs the corrupted text with a pretrained masked language model of BERT. Different from Ng et al. (2020), the probability of masking a token is conditioned on the gradient of the corresponding token embedding w.r.t. the downstream task. In this way, GradAug prevents replacing tokens that are critical for a downstream task.
The main contribution of this paper is three-fold: • This is the first attempt to study the effect of self-training on top of existing strong pre-trained models for ToD in few-shot learning scenarios.
• We propose a self-training method to gradually train a stronger Student by iteratively labeling the most confident unlabeled data and a new text augmentation technique (GradAug).
• We conduct extensive experiments on four downstream tasks in ToD, including intent classification, dialog state tracking, dialog act prediction, and response selection. Empirical results demonstrate that self-training consistently improves state-of-the-art pre-trained models (BERT, ToD- BERT Wu et al. (2020)).

Pre-training for ToD Systems
Budzianowski and Vulic (2019) first applied GPT-2 to train a response generation model by taking the system belief state, database entries, and last dialog turn as input. Henderson et al. (2019) pre-trained a response selection model for ToD by first pretraining on general-domain conversational corpora (Reddit). Ham et al. (2020) trained the pre-trained GPT-2 for dialog state tracking and response generation on MultiWOZ (Budzianowski et al., 2018). Hosseini-Asl et al. (2020) proposed SimpleToD to train the pre-trained GPT-2 on three different subtasks (dialog state tracking, dialog act prediction, and response generation) of ToD as a sequence prediction problem. Recent studies have shown that large-scale pretrained language models are good few-shot learners (Brown et al., 2020). Several studies have also confirmed these findings for ToD. For the task of generating responses conditioned on a semantic representation, GPT-2 was leveraged by Peng et al. (2020b) to improve few-shot learning and by Mi et al. (2020) in a continual learning setting. Peng et al. (2020a) utilized GPT-2 for end-to-end response generation from dialog contexts in a fewshot learning scenario.  further trained a BERT model on multiple ToD corpora to improve few-shot learning performance on four different downstream tasks.

Self-training
The first focus of self-training is designing better policies to label unlabeled samples. Zhang and Zhou (2011) evaluated the confidence via a statistic-based data editing technique. Lee (2013) designed an annealing function that gradually increases the loss of labeled samples during training. Amiri (2019) utilized a Leitner queue (Dempster, 1989) to gradually put confident samples in the front. Niu et al. (2020) selected the most confident samples with prediction loss below some threshold. Kumar et al. (2010); Ma et al. (2017); Li et al. (2019); Mukherjee and Awadallah (2020) proposed to learn sampling weights for unlabeled data to control the selection process. Reinforcement learning (RL) methods (Chen et al., 2018;Wu et al., 2018; designed an additional Q-agent as the sample selector. Nevertheless, methods using learnable weights or RL provide marginal benefits compared to the elevated optimization cost. As designing new sample selection schemes is not our primary focus, we will go for a simple and effective pipeline described in Section 4.1. Specialized explorations on this topic are orthogonal to the focus of this paper and will be left as future work. The second focus of self-training is to improve the robustness of the Student model trained from potentially noisy pseudo-labeled samples. Data augmentation techniques are widely used. In computer vision, recent works demonstrated the benefit of different stochastic augmentation tricks, including input transformations (Laine and Aila, 2017;Xie et al., 2020;Zoph et al., 2020), dropout (Laine and Aila, 2017;Xie et al., 2020;Zoph et al., 2020), adversarial samples (Miyato et al., 2019), and Mixup (Berthelot et al., 2019(Berthelot et al., , 2020. Text augmentation is more challenging because of the complex syntactic and semantic structures. Miyato et al. (2017) utilized adversarial training to apply perturbations to word embeddings. Wei and Zou (2019) proposed EDA using basic synonym replacement, random insertion, swap, and deletion. Kumar et al. (2019) proposed to maximize a monotone sub-modular function to obtain diverse paraphrases. Xie et al. (2019) proposed UDA applying backtranslation (Edunov et al., 2018) and word replacement using a Tf-Idf metric. He et al. (2020) studied the effect of dropout compared to back-translation during self-training for the neural sequence generation task.  proposed MixText that utilizes Manifold Mixup (Verma et al., 2019) to interpolate hidden layers corresponding to semantic representations of BERT. Ng et al. (2020) proposed SSMBA utilizing the masked language model of BERT to replace words. In experiments, we compare the proposed GradAug technique with state-of-the-art text augmentation methods.

Background of Using Pre-trained Models for Downstream Tasks in ToD
In this section, we first briefly overview the pipeline of utilizing large-scale pre-trained models for four common downstream tasks (intent classification, dialog state tracking, dialog act prediction, and response selection) in ToD. We denote the input and label of different downstream tasks as x and y, and a prediction model is denoted asŷ x = F (x). F can often be decomposed into two parts. The first part is a feature extractor h = A(x) ∈ R l which computes a hidden representation h of x, and the second part is an output network for prediction. Large-scale pre-trained language models serve as feature extractor A to compute a hidden representation for an input. For example, we use the [CLS] embedding of BERT as the hidden representation h when BERT is adopted as A. Different output networks are designed for different downstream tasks, and the details following ToD-BERT  are described below.
Intent classification. This is a multi-class classification problem to predict the single intent label y of an input utterance x. The model computes the probability over I possible intents as: where W 1 ∈ R I×l is a trainable weight matrix, and the model is optimized by the standard crossentropy loss compared to the ground truth.
Dialog state tracking. It is a multi-class classification problem based on a predefined ontology. Unlike intent classification, the dialog history (a sequence of utterances) is used as the input x. For each (domain, slot) pair, the model predicts a score over all potential slot values. For the i-th slot value v j i of the j-th pair, the cosine similarity score compared to the input x is computed as follows: where G j is the slot projection layer of the j-th pair, and the number of layers |G| equals the number of (domain, slot) pairs. The model is trained with the cross-entropy loss summed over all the pairs.
Dialog act prediction. This is a multi-label classification problem to predict dialog act (DA) intents for the next system response. The model takes a dialog history as input x and predicts a Bernoulli outcome for each possible DA intent as: where W 2 ∈ R N ×l is a trainable weight matrix, and N is the number of possible DA intents. Values in a are between [0, 1], and the model is optimized by a binary cross-entropy loss w.r.t. the ground truth. A threshold of 0.5 is applied during inference.
Response selection. This task predicts the most relevant system response from a candidate pool.
A dual-encoder model (Henderson et al., 2019) is adopted to compute the similarity between the input dialog history x and the i-th candidate response c i : During training, we randomly sample 20 negative responses for each ground truth response. A crossentropy loss is applied aiming to rank the ground truth highest.

Self-training
In this section, we introduce our self-training (ST) algorithm. The overall ST algorithm is introduced in Section 4.1, and a new text augmentation method (GradAug) for ST to train a more robust Student is elaborated in Section 4.2.

Overall ST Algorithm
During training, two data pools are maintained and denoted as U (unlabeled data) and L (labeled data).
Two versions of the model are maintained, Teacher (F T ) and Student (F S ). Before the iterations of ST start, the Teacher is first trained on the initial small number of labeled data L to "warm up".
Pseudo-Labeling. In the beginning of an ST iteration, the Teacher first makes predictions on U .
For every data input x ∈ U , the Teacher predicts the label of x asŷ x = F T (x). We set the predicted score of the predictionŷ x as the confidence score s x for this prediction. When there is only a single label in the predictionŷ x (c.f. intent classification, response selection), s x is the prediction score corresponding to the predicted label. When there are multiple labels in the predictionŷ x (c.f. dialog state tracking, dialog act prediction), s x takes the mean of the prediction scores corresponding to the predicted labels. In each iteration, the Selector chooses top-k instances from U with the highest confidence scores, and assigns the corresponding predictionsŷ x as labels to them. These labeled instances will be moved from U to L.
Iterative Student training. The updated L is used to train a stronger Student model. We applied dropout (Srivastava et al., 2014) and a new text augmentation technique (GradAug) introduced later in Section 4.2 which augments L to L Aug . At the end of each iteration, the Teacher model is overridden by the current Student to be used in the next iteration. We reinitialize the Student in every iteration to avoid over-fitting the initial and earlier data in L in multiple training iterations. As noted by Xie et al. (2020); Du et al. (2020), the Student should have an equal or larger capacity than the Teacher to gradually learn from L with increasing size. In this paper, we set the Student the same size as the Teacher, and we demonstrate in experiments that consistent improvements can be achieved without increasing model capacity. Details of our ST algorithm are described in Algorithm 1, and the pipeline of one ST iteration (i.e., the "While" loop in Algorithm 1) is visualized in Figure 1.

Text Augmentation (GradAug)
Next, we propose a novel text augmentation technique called "GradAug" for data in L to train a more robust Student. Our method employs the masked language model (MLM, Devlin et al.  2019)), which is a common pretraining strategy for BERT-like architectures. In MLM, some tokens are replaced by the special token [MASK], and the model is asked to reconstruct the original tokens from the context.
To utilize a pre-trained MLM (e.g. BERT) for text augmentation, the first step is to decide which tokens to mask. Random sampling is used by the original BERT framework and a recent text augmentation method (SSMBA, Ng et al. (2020)). However, if some crucial tokens are masked, the semantics might change after the reconstruction. For example, if the important token "status" in Figure 2 is masked, top predictions from the MLM of BERT includes "purpose", "cost", and "route", which will potentially change the original semantics.
Gradient-based token masking. Instead of randomly masking tokens, we compute a masking probability p = [p 1 , ..., p n ] for an input x of n tokens. For input x with token embedding matrix 1 X = [X 1 , ..., X n ] ∈ R n×d and label y, the importance of tokens in x to the label y is computed by a saliency map (Simonyan et al., 2014) Figure 2: An illustrative example of GradAug. First, the smooth saliencyM is computed for each token, and we highlight important tokens in blue for the intent label "flight_status". Less important tokens are more likely to be masked. Then, the masked token ("american") is reconstructed by the MLM of BERT and the replacement token "scheduled" does not change the semantics of the original sentence.
where F T y (X) is the Teacher model's prediction score for the label y. M (X i ) measures the importance of the i-th token by accumulating the gradients of all elements in its embedding X i ∈ R d by differentiating F T y (X) w.r.t. X i . The intuition is that tokens with large gradients are important to the label y. However, previous studies (Sundararajan et al., 2017;Smilkov et al., 2017) pointed out that raw gradients can be very noisy and may sharply fluctuate locally. To this end, we compute a smooth saliency measure (Smilkov et al., 2017)M (X i ) for the i-th token as: where m Gaussian noises z j ∼ N (0, Σ) ∈ R d with mean 0 and diagonal co-variance matrix Σ are added to X i to calculate m regular saliency measures, which average to the smooth saliencỹ M (X i ) for X i . The probability p i of masking the i-th token is inversely correlated toM (X i ) as: where β controls the flatness of the distribution p, and p is normalized by its sum. As the probability p i to mask a token x i is inversely correlated to its importanceM (X i ) to a downstream task, more important tokens are less likely to be masked. We sample 15% 2 tokens of x based on p and replace them by [MASK] to corrupt x to x . As F T is updated in each ST iteration, p is dynamically calculated in each ST iteration.
2 This is the default ratio used by BERT and SSMBA.

Algorithm 2 GradAug
Input: Labeled data: L, Teacher: F T , Number of augmentations per input: q Output: Augmented labeled data L Aug 1: Initialize L Aug ← L 2: for {x, y} ∈ L do 3: Compute masking probability p using F T

4:
for j ∈ 1 . . . q do 5: x ← Mask tokens of x based on p 6:x ← Predict masked tokens by MLM  2018), we reconstruct each [MASK] by sampling 1 token from 10 most likely tokens according to their predicted probabilities. Afterwards, we get a paraphrased x of the original x as an augmentation. As our gradient-based masking scheme avoids replacing tokens crucial to the meaning of x, the label ofx is preserved the same as x.
An illustrative example of GradAug is given in Figure 2, and the detailed procedure applying GradAug on L is described in Algorithm 2.

Dataset Description
We evaluate four different datasets for four downstream tasks as in .
OOS (Larson et al., 2019) is a benchmark dataset for intent classification in ToD. It consists of 150 in-domain intents and 1 out-of-scope intent. The full dataset contains 15,100/3,100/5,500 samples for train/validation/test, and all data are balanced across 151 different intents.
MWOZ (Eric et al., 2020) is evaluated in three downstream tasks, including dialog state tracking, dialogue act prediction, and response prediction. It contains 8,420/1,000/1,000 dialogues for train/validation/test. For dialog act prediction, we remove the domain information from original labels as in , resulting 13 DA intents.

Experiment Settings
We randomly sample 1% or 10% of the training data to serve as the initial labeled data L, while the remainders are used as unlabeled data U . We report mean and standard deviation with three different random seeds for each experiment to reduce data sampling variance. We also report the upper bound of pre-trained models without ST using all labeled training data, referred to as "Full". We test two pre-trained models: (i). uncased base BERT with 110M parameters; (ii). ToD-BERT 3  that is further pre-trained on 9 public ToD datasets on top of BERT. When ST is applied to them, the corresponding MLM is used by GradAug to reconstruct masked tokens. Basic model parameters of the first 3 downstream tasks are set the same as . In response selection, we reduced training batch size from 25 to 20 to fit our computation constraint.
BERT and Tod-BERT without ST are trained on the initial labeled data L until validation performance does not improve for 20 epochs 4 . For ST, when the Student is trained on L Aug in one ST iteration (c.f. Algorithm 1 line 12), we apply early stop until validation performance does not improve for 10 epochs. Moreover, the best Student across multiple ST iterations is selected based on validation performance (c.f. Algorithm 1 line 2). It means that the best Student model does not necessarily 3 We used their joint version (ToD-BERT-jnt) pre-trained with the MLM and "response contrastive loss" objectives. 4 Our different (often better) results compared to the ToD-BERT paper mainly come from this stricter early stop criteria.  use up all unlabeled data. Other hyper-parameters of ST selected base on validation performance are reported in Appendix A.3.

Main Results of Four Downstream Tasks
Intent classification. Results of intent classification on OOS are presented in Table 1 with accuracy of all 151 intents; 150 in-domain intents; the out-of-scope intent, and the recall of the outof-scope intent. ST significantly improves the pretrained BERT and ToD-BERT. When only 1% labeled data are used, ST achieves 33.6% and 36.8% higher accuracy on all 151 intents for BERT and ToD-BERT respectively. For 10% labeled data, the above two margins are 7.0% and 9.8%. Furthermore, ST largely improves the recall of the outof-scope intent, indicating that it is more robust to out-of-scope intents with noisy distributions.
Dialog state tracking. Results of dialog state tracking on MWOZ are presented in Table 2. Two common evaluation metrics (Budzianowski et al., 2018; are used: slot accuracy and joint goal accuracy. Slot accuracy is computed for each individual state (domain, slot, value) to check whether the value is correctly predicted. Joint goal accuracy checks whether the predicted states exactly matches the ground truth states. We could   Dialog act prediction. Experiments are conducted on three datasets and results are reported in Table 3. We report micro-F1 and macro-F1 scores for this multi-label classification task. Again, the benefit of ST can be observed by the improvement for both BERT and ToD-BERT. When 10% labeled data are used, BERT and ToD-BERT perform similarly to their upper bound (Full), and the improvement margin of ST is limited. When 1% labeled data are used, more notable margins of ST can be seen on the two simpler datasets (DSTC2, GSIM) and the macro-F1 score of MWOZ.
Response selection. Results of response selection on three datasets are reported in Table 4. We randomly sample 100 responses as negative responses and report Recall@1&3 (Henderson et al., 2019) indicating whether the true response is ranked in the top-1 or top-3 predicted responses. When 1% labeled data are used, ST achieves 6%, 12.3%, and 16.4% higher Recall@1 accuracy over ToD-BERT on three datasets respectively. For 10% labeled data, the three margins above are 13.0%, 7.5%, and 14.4% respectively. Larger improvements can be observed for ST on top of BERT.
Altogether, our experiments on four different downstream tasks reveal that: • Self-training provides complementary benefits on top of pre-training. ST consistently improves both BERT and ToD-BERT on all four downstream tasks with only 1% and 10% labeled data.
• Self-training is on par with customized pretraining for ToD. BERT performs worse than ToD-BERT, yet BERT-ST achieves comparable or even better performance than ToD-BERT which is heavily pre-trained on ToD corpora.
• Self-training bridges the gap between few-shot learning and full supervision. BERT and ToD-BERT with 10% labeled data perform much worse than models using all labeled data ("Full") for intent classification and response selection. ST largely improves performances in these two cases with results comparable to "Full".
• The benefit of self-training is evident on two simpler single-label prediction tasks (intent classification, response selection), indicated by 6-37% gain with 1% labeled data; 7-15% gain with 10% labeled data. The margin is smaller on two other more challenging multi-label prediction tasks.

In-depth Analyses of Self-training
In this section, we provide in-depth analyses of the proposed self-training approach. As case studies, we limited our discussion on intent classification (IC) on OOS and response selection (RS) on GSIM using ToD-BERT-ST with 10% labeled data.
Reported results are accuracies on all intents and Recall@3 respectively.
Ablation study. In Table 5, we compare three simplified versions of ToD-BERT-ST to understand the effects of different components. We can observe that: (i) Masking tokens using the smooth saliency computed in Eq. (6) for GradAug is beneficial because replacing it by the vanilla saliency in Eq. (5) ("w/o Smooth Saliency") degrades the performance by 3.4% and 0.5% on IC and RS. (ii) Training a more robust Student using data augmented by GradAug is advantageous because dropping this augmentation step ("w/o Augmentation") impairs performance by 4.9% and 10.1%. (iii) The Pseudo-Labeling operation to iteratively label unlabeled data is important for ST, indicated by the 8.4% and 15.2% performance drop of "w/o Pseudo-Labeling" that only applies GradAug to the initial labeled data without utilizing unlabeled data.
Comparison to other Selectors in ST. In Table  6, we compare our scheme of selecting samples with top-k confident predictions from U in each iteration with (i) Random-k: randomly select k samples; (ii) Least-k: select samples with leastk confident predictions (iii) Select-all (Xie et al., 2020;Du et al., 2020): label all samples of U in an iteration and relabel them in the next iteration.
We could see that "Random-k" and "Least-k" perform worse than ours, yet they both outperform "Select-all" by large margins. It means that the initial Teacher trained on limited labeled data is not good enough to assign reliable labels to a large number of unlabeled data.   (Ng et al., 2020) 84.6% 64.2% MixText  83.6% 62.7% UDA (Xie et al., 2019) 82.5% 62.2% EDA (Wei and Zou, 2019) 77.2% 57.6% w/o Augmentation 80.4% 54.8% Comparison to other text augmentation methods. In Table 7, we compare GradAug with four representative text augmentation methods to augment L. We follow the default setting of these techniques and apply them to our ST pipeline to generate three paraphrases for each input as in GradAug. We could see that GradAug consistently outperforms the current state-of-the-art (SSMBA, MixText, UDA), and it outperforms EDA by large margins. As EDA might easily change the input semantics, it even performs worse than using no data augmentation for intent classification. This result reinforces the importance of preserving semantics during augmentation for ToD.

Conclusion
We study using self-training to improve the strong pre-trained models for few-shot learning tasks in ToD. An iterative self-training method with a new text augmentation technique (GradAug) is proposed to gradually train a stronger Student model using unlabeled data. Extensive empirical results on four downstream tasks in ToD demonstrate the consistent improvements of self-training on top of pre-trained models. Our findings on using selftraining to improve learning from limited labeled data may inspire future studies towards building more sample-efficient and scalable ToD systems.