Self-training with Few-shot Rationalization

While pre-trained language models have obtained state-of-the-art performance for several natural language understanding tasks, they are quite opaque in terms of their decision-making process. While some recent works focus on rationalizing neural predictions by highlighting salient concepts in the text as justifications or rationales, they rely on thousands of labeled training examples for both task labels as well as annotated rationales for every instance. Such extensive large-scale annotations are infeasible to obtain for many tasks. To this end, we develop a multi-task teacher-student framework based on self-training pre-trained language models with limited task-specific labels and rationales and judicious sample selection to learn from informative pseudo-labeled examples. We study several characteristics of what constitutes a good rationale and demonstrate that the neural model performance can be significantly improved by making it aware of its rationalized predictions, particularly in low-resource settings. Extensive experiments in several benchmark datasets demonstrate the effectiveness of our approach.


Introduction
Recent success in several natural language understanding tasks can be attributed to training largescale and complex neural network models. While these models work very well for specific tasks, they offer limited insights into their inner working and are often used as black-box predictors. To address these shortcomings, recent works (DeYoung et al., 2020;Paranjape et al., 2020;Yu et al., 2019) have focused on designing interpretable NLP systems that can explain the model's predictions. A typical approach to study this decision-making process has been via annotating rationales as a short and sufficient part of the input text leading to the specific 1 Code available at https://aka.ms/RationaleST prediction that can be also used as auxiliary supervision for training. Appropriate use of such rationales can improve the downstream task performance as the model learns to focus on the task-relevant parts of the input (Pruthi et al., 2020b).
However, a significant resource challenge is to obtain large-scale annotated rationales to train these models as explored in fully supervised setting in recent work (DeYoung et al., 2020). This requires models to have access to both instance-level task labels as well as token-level binary rationale labels depicting whether a token should be included in the rationale or not. Such extensive annotations are infeasible to obtain for many tasks, hence devising models that can effectively exploit a limited number of annotated rationales is utterly important. Therefore, our objective is two-fold: (1) improve downstream task performance and (2) improve rationale extraction -with few labeled examples for the downstream task and corresponding rationales.
Recent works (Mukherjee and Awadallah, 2020;Wang et al., 2020;Xie et al., 2019) on few-shot learning have explored self-training as a mechanism to train neural network models with limited labeled data. These methods usually train a teacher and then a student model to imitate the teacher in turn. They usually assume access to a set of unlabeled instances and use stochastic regularization techniques such as dropout and data augmentation obtained from pseudo-labeled examples. In this work, we leverage self-training as a mechanism to train neural network models with self-generated rationales and task labels over unlabeled data. Since pseudo-labeled rationales from the teacher model can be noisy, we show that judicious sample selection to upweight informative examples and downweight noisy ones is beneficial. Furthermore, we predict task and rationale labels in a multi-task learning (MTL) setup, where we share parameters between the task objective and the rationale prediction objective. We show that the MTL setup for joint learning is more effective than the decoupled learning, which consists of first extracting rationales and then using them for classification, as explored in some of the prior works.
Given the paucity of rationale labels, a critical part in the MTL setup is to understand what constitutes a good rationale. We build over insights from prior work (Yu et al., 2019;Lei et al., 2016) focusing on low-resource settings with access to limited labels via multi-task self-training. To this end, we explore several characteristics of a good rationale in terms of (i) sufficiency such that the extracted rationale is adequate for the model to make its decision; (ii) completeness such that the model is less confident on its predictions if it ignores the rationale text; and (iii) coherency such that the model extracts phrases as rationales rather than isolated disconnected words. In practice, we enforce (i) by matching predictions of the student model with the rationale as input and the teacher model with the full input; (ii) by maximizing entropy in the student predictive distribution when it sees the complementary of the rationale as input; and (iii) by recurring to additional regularization methods. We show that our multi-task joint optimization captures all of the above salient aspects for rationale extraction while improving the downstream task performance. In summary, our contributions are: (a) We develop a multi-task self-training framework to train neural models with limited labels along with extracting rationales as justifications for its predictions. Furthermore, we show the impact of judicious sample selection for sample-and token-level re-weighting to learn from informative pseudo-labeled examples during self-training. (b) We build over prior work on rationale extraction to encode desired rationale characteristics by judiciously designed loss functions in our multi-task self-training algorithm.
(c) Extensive experiments on five datasets from the ERASER benchmark (DeYoung et al., 2020) demonstrate the effectiveness of our approach. Ablation experiments demonstrate the impact of different components of our framework.

Related Work
Rationale Extraction Prior works (Lei et al., 2016;Yu et al., 2019) on rationale extraction explore encoder-generator based models with two components for extracting rationales and then using them to make a prediction. Alternately, Jain et al. (2020) propose decoupled architectures for extractor (using attention weights) and predictor. Following these works, DeYoung et al. (2020) develop the ERASER benchmark that contains human annotated rationales in extractive format and provide BERT-to-BERT baselines for the tasks. Paranjape et al. (2020) propose a weakly supervised model with user controlled sparsity threshold for rationale extraction and predictions based on the extracted rationale. Similarly, Pruthi et al. (2020a) propose a semi-supervised BERT-CRF architecture with few gold annotations and abundant task labels. In contrast to all these prior work requiring thousands of annotations for either rationales or the task labels, our framework is geared for low-resource settings with access to very few labels for both the tasks. We incorporate insights from prior work in rationale extraction via judiciously designed loss functions in our multi-task self-training framework.
Self-training Self-training (Yarowsky, 1995;Nigam and Ghani, 2000;Lee, 2013) trains a base (teacher) model on limited labeled data and applies them to unlabeled data to generate pseudolabels. The generated pseudo-labels are used for training the student model in an iterative fashion. Self-training has demonstrated state-of-the-art performance in several tasks including text classification (Mukherjee and Awadallah, 2020;Wang et al., 2020) and image classification (Xie et al., 2019;Zoph et al., 2020). We leverage self-training with re-weighting noisy pseudo-labels for both task and rationale extraction in a multi-task learning framework while encoding the desired characteristics of a good rationale.

Problem Statement
Let x 1 , . . . , x n be a set of n documents with corresponding associated task labels y 1 , . . . , y n , where each x i = {x ij } is a sequence of tokens. We consider each document to be associated with a ground-truth rationale sequence r i = {r ij }, where r i,j = 1 if the j th token in document x i is part of the rationale, and 0 otherwise. We consider a low-resource setup with very few documents labeled with both the task labels (instance-level) and the rationale labels (token-level) for each task, and additional unlabeled data.
Let us denote D l = {(x i , y i , r i )} as the joint task and rationale labeled training set. We also Multi-task learning Train student for rationales that are concise yet sufficient, complete and coherent this one is about surprise , drugs and crime. One thing you need to know is yet another brainless teen flick, this one is about , surprise , drugs and crime. One thing you need to know is I hated the movie.
return p S assume to have access to a set of unlabeled documents D u = {u 1 , . . . , u m } for which neither rationales nor task labels are available and |D u | |D l |. Our goal is to learn a model from the few task and token-level rationale labels and additional unlabeled documents to improve its performance for the downstream tasks.

Self-training
We leverage self-training as the backbone of our framework. The algorithm is composed of two phases that are executed iteratively until convergence. In the first phase, we perform multi-task learning of a teacher model on the few-shot labeled set D l by jointly learning to predict instance-level task and token-level rationale labels. Optimizing these losses leads to estimating the parameters of a teacher model p T .
In the second phase, we leverage the teacher model p T to infer pseudo-labels for the unlabeled set D u and train a student model p S to mimic the teacher's predictions for both the task and associated rationale. Finally, the teacher model is updated with the student model's parameters and above steps are repeated until convergence.
Due to the noisy nature of pseudo-labels, the above self-training process may result in gradual drifts (Zhang et al., 2016). To address this, we train the student model to explicitly account for the teacher's confidence on the generated pseudolabels with a special weighting scheme. Furthermore, we explore several characteristics of a good rationale and enrich the above framework with additional auxiliary losses.

Multi-Task Teacher
In the first phase, we leverage the small amount of labeled data D l to train the teacher model p T to jointly predict the task labels and the rationale labels. To this end, we leverage a shared BERT encoder h T with two separate softmax classification layers for the two tasks. We denote p T (y|x) = softmax(h T (x); θ T t ) and p T (r j |x) = softmax(h T (x) j ; θ T r ) to be the corresponding task and rationale predictions of the teacher model given an instance x. h T (x) j is the BERT hidden state representation corresponding to the j th token and θ T t ,θ T r are the task-specific head parameters. For brevity, we will omit parameter specification in what follows and denote p T as the shared BERT and task-specific parameters.
We jointly optimize the following losses with respect to (the parameters of) p T : where y is the ground-truth task label for input x and r T j is the ground-truth rationale label for the j th token in input x.
In contrast to prior work (DeYoung et al., 2020) that leverages two decoupled BERT models in a stage-wise fashion to first extract the rationales, and then use those rationales for task label prediction -our framework learns a single model to predict them jointly in a multi-task learning setup. This allows the model to capture richer interactions between the two tasks.
There are a few design choices for optimizing the teacher in Eq. 1. For instance, we can optimize the teacher parameters with few-shot labeled data in each self-training iteration after the student becomes the new teacher; or only once at the beginning of self-training to initialize a good teacher. We observe that executing this phase in each selftraining loop is more effective as it diminishes drifting from the ground-truth data distribution.

Multi-Task Student
In the second phase, we self-train a student model p S on the teacher-generated pseudo-labels with a pseudo-label task loss and rationale loss. In contrast to the teacher model operating on the few labeled examples D l , the student model operates on unlabeled data D u . Additionally, the student model has the same architecture as the teacher model with a shared encoder h S and task-specific classifica- The pseudolabeled multi-task loss is formulated as: (2) where y T is the teacher-generated task pseudolabel for input u and r T j is the teacher-generated rationale pseudo-label for the j th token in input u.

Student-Teacher Update
At the end of every self-training iteration, we transfer the knowledge acquired by the student back into the teacher model by setting h T = h S , θ T t = θ S t , θ T r = θ S r and start again by fine-tuning the newly obtained teacher on ground-truth data D l .

Re-weighting Pseudo-labeled Samples
Instead of directly imitating the teacher's predictions as described in Eq. 2, we found it extremely effective to train the student model to explicitly account for the teacher's confidence for the generated pseudo-labels. This allows us to filter noisy pseudo-labels as the student model can selectively focus more on the pseudo-labeled samples that the teacher is more confident on compared to the less certain ones. Therefore, we optimize a weighted version of the pseudo-labeled loss in Eq. 2: The proportional sign is due to the fact that these weights are normalized across each batch when training by minibatch SGD, so the weights depend on the batch and sum to one over the batch. Re-weighting noisy labels with different weighting schemes has been explored with metalearning (Ren et al., 2018) and uncertain-aware self-training (Mukherjee and Awadallah, 2020).

Rationale Characteristics
In this section, we encode several characteristics of what constitutes a good rationale from prior work in our self-training framework via several auxiliary loss functions.

Sufficiency
A desired property of a good rationale is sufficiency. This imposes the model predictions about the task label considering the entire input text to be similar to the predictions made by looking at only the rationale text. This concept can be promptly translated into an objective function by resorting to a consistency objective: where the expectation is taken w.r.t u ∼ D u , r T ∼ p T (r|u), y T ∼ p T (y|u) and u r T is the masked version of document u in which tokens that are not part of the rationale (as predicted by the teacher) are replaced with a special [MASK] token. Here, the teacher model looks at the full input and the student model looks only at the rationale tokens.
The sufficiency loss can be interpreted as an alternative way of integrating rationale information in the model. Current efforts either predict the rationale first and then use it for task prediction sequentially as in BERT-to-BERT (DeYoung et al., 2020); or employ attention regularization such that the BERT attention weights are as close as possible to uniform on the rationale tokens (Pruthi et al., 2020b). The first approach can be very sensitive to error propagation from the rationale generator since the task label is predicted using only the generated rationales at test-time. The second approach strictly assumes uniform attention on the rationale tokens. In contrast, our sufficiency loss makes very few assumptions on how the model should attend to the rationale tokens, and only requires the student distribution of the task labels to be similar to that of the teacher. This yields more robustness to rationale errors, given that at test-time our model can use the full input to predict the task label.

Completeness
Another desiderata of a rationale is completeness.
Completeness implies that the rationale should capture all the aspects in the input text that are predictive of the task label. We translate this concept by requiring the student model to be maximally uncertain of the task label if it does not look at the rationale i.e. by masking out the teacher-predicted rationale tokens in the input text: where the expectation is w.r.t. u ∼ D u , r T ∼ p T (r|u), H is the entropy of the student predictive distribution and u (1 − r T ) is the document obtained by masking out the tokens in the rationale.

Coherence Loss
Finally, we desire the rationales to be short and composed of contiguous chunks of text rather than unigrams. To ensure this, we adopt the regularization losses introduced in Lei et al. (2016). This explicitly penalizes the rationale generator for predicting long rationales and encourages rationales to span contiguous chunks of the input text.

Training Objectives
Our overall training objective in the teacher learning phase is simply the loss on labeled data L l (p T ).
For the student, we use a combination of the previously presented loss functions on unlabeled data:   Awadallah, 2020). We perform early stopping on validation performance for the teacher multi-task training in each loop of self-training. We select the best model based on validation loss from the different self-training iterations. For Evidence, we use a hyper-parameter 4 to upweight examples from the minority pseudo-labeled class in the self-training loop to combat class imbalance.
Baselines We compare our model performance to the following methods for both fully supervised (with access to all training labels) and our few-shot setting with 100 labels per class.
(1) BERT w/o explanation fine-tunes BERT on a set of labeled examples without accounting for rationales.
(2) BERT with explanation is our multi-task learning setup where the classifiers are trained to predict both task labels and rationales without encoding the rationale characteristics. We also compare against the semisupervised setting from prior work Paranjape et al.
(2020) that uses 25% annotations for rationales and 100% task labels for each dataset. Finally, we compare our method against a fully supervised BERT-to-BERT model in the ERASER benchmark (DeYoung et al., 2020) which first performs rationale extraction followed by downstream task prediction using extracted rationale.

Results
Overall performance Table 2 summarizes the results of our proposed model and baselines across all the datasets. We observe that our model trained 4 Set as inversely proportional to the count of pseudolabeled samples per class with only N = 100 labels per class performs within 10.6% of fully supervised BERT trained with thousands of labels while obtaining an aggregate F1 of 66%. Our self-training framework iteratively improves over the teacher model with a judiciously designed student network with a performance gain of 6.45%.
For the fully supervised models, we observe that BERT with explanation in our multi-task learning setup improves by 2.6% over vanilla BERT, thereby, demonstrating the usefulness of rationales and the effectiveness of multi-task learning.
Our model performance for downstream tasks and rationales further improves with increasing the amount of labeled data. With 25% labeled training data, our model performs at par with Paranjape et al. (2020) (which has been trained with 100% task labels) in 3 out of 4 tasks and has comparable performance with fully supervised BERT-to-BERT baseline. Additionally, our model does not require additional user inputs in the form of desired sparsity threshold as in prior work (Paranjape et al., 2020).
In order to validate the effectiveness of sufficiency and completeness loss as a way of integrating rationales in the model in contrast to attention regularization in Pruthi et al. (2020b), we perform the following experiment. Instead of using teacherpredicted rationales, we use the ground-truth rationales. We observe that our model with sufficiency and completeness loss outperforms the prior method of integrating explanations via attention regularization (results in Figure 3 in Appendix). This may reflect the fact that attention regulariza-   tion imposes the stricter assumption to have uniform attention on the explanation tokens. We observe the self-training performance to improve on initializing our encoders with pre-trained domain-specific checkpoints. For instance, using BioBERT checkpoint, we observe a gain of 4.8% over that of BERT-base in the Evidence dataset. Table 4 presents few examples from our rationale extractor and the corresponding task label. Table 3 summarizes the impact of different components of our framework with N = 100 labels per class as training data for Movies and e-SNLI.

Ablation Study
Re-weighting pseudo-labels We found it extremely useful to re-weight noisy pseudo-labeled samples from the teacher model by its confidence. We observe that re-weighting the rationales and task labels work quite well for the Movies dataset. However, this has limited impact for e-SNLI with a low coverage of rationales from the teacher model. Correspondingly, we did not observe a difference in model performance from re-weighting the task and rationale pseudo-labels.
Impact of different loss functions. We observe that using only the sufficiency loss (Eq. 4) results in the model extracting the entire input as the rationale. This is counteracted by adding penalization via sparsity loss to obtain rationales that are concise yet informative about the task label. From Table 3, we observe sparsity loss to significantly reduce the number of tokens included in the rationale.
However, adding the sparsity loss also caused instability in some cases (Figure 2 (b)). We empirically demonstrate that this instability is mitigated by including the completeness loss that forces the model to be maximally uncertain when it does not look at important tokens constituting the rationale.
Impact of amount of labeled training data We observe our self-training framework to improve both in task and rationale extraction performance

Dataset: Movies
Ground truth: Negative, Prediction: Negative There're so many things to criticize about I don't know where to start. Recommendation: turn off your brain -don't be like me, decreasing the rating everyday because I think about it too much ..... Firstly, there is nothing outstandingly inferior about the making of the film (nor is there anything outstandingly good about it), but the plot holes make the film corny and stupid.

Dataset: Movies
Ground truth: Negative, Prediction: Negative Yet another brainless teen flick, ...... stars Katie Holmes and Sarah Polly couldn't look more bored. One thing you need to know is I really hated this movie. Everything about it annoyed the hell out of me. The acting, and script, the plot, and ending.
Dataset: e-SNLI Ground truth: Contradiction, Prediction: Contradiction A man playing electric guitar on the stage . A man playing banjo on the floor . A guy whose entire life is broadcast 24 hours a day ?.... which is why I was pleasantly surprised by "edtv," which turns out to be a fresh , insightful , and often times hilarious film about the follies of instant celebrity .
Dataset: e-SNLI Ground truth: Entailment, Prediction: Neutral A woman tired from her long day takes a nap on her bed above the sheets and covers. A lady is lying in bed. with increase in number of labeled samples for training (Appendix, Figure 4). Figure 2 (c) shows the improvement in the task accuracy of our model over several self-training iterations for Movies. The corresponding plots for other datasets are provided in Figure 6 in Appendix. We observe that self-training gradually improves model performance in the first few iterations for majority of the tasks and converges fast in 12-15 iterations. However, we observe that rationale extraction module drifts after 10 self-training iterations for most of the datasets (Figure 2 (d)) due to error propagation from noisy pseudo-labels, thereby, necessitating early stopping based on rationale validation loss.

Impact of number of self-training iterations
Few labeled data fine-tuning At each selftraining iteration, the teacher is fine-tuned on labeled data. This is to avoid drifting from the original task description via few annotated labels (Figure 2 (a)). Figure 2 demonstrates the change in the accuracy of our student model with and without this teacher training in every self-training iteration. Table 5 presents a snapshot of the qualitative error analysis of our model. On analyzing the extracted rationales for mis-classified instances, we observe some common failure points where instances have shifts in context; presence of satire or sarcasm; rationales relying on background knowledge; and noisy or incomplete annotated rationales. For instance, the overall polarity of the first example from Movies is positive, although majority of the text describing the movie plot depicts a negative connotation. We observe a similar trend with reviews involving sarcasm or satire. In the last example from e-SNLI, the annotators marked {woman, nap} to entail {lady, lying in bed}. In this rationale, annotators do not follow the guidelines for sufficiency and completeness since the spatial qualifier for nap is missing in the ground-truth. Surprisingly, our model does not pick up the spatial concept either and marks the sequences to be neutral to each other.

Conclusion
We develop a multi-task self-training framework for rationale extraction focusing on low-resource settings with access to very few training labels. To this end, we build on insights from prior work on the characteristics of a good rationale to encode them via judiciously designed loss functions in our self-training framework. Extensive experiments on benchmark datasets show our model to outperform other state-of-the-art methods with access to limited labels. We further demonstrate that the performance of pre-trained language model can be improved by making it aware of the rationales for its decision-making process in both high (fully supervised) and low-resource (few label) settings.  Though good -looking , its lavish sets , fancy costumes and luscious cinematography can do little to compensate for the emotional wasteland... this is Jodie Foster 's first movie since the jaw -droppingly brilliant contact came out more than two years ago and it isn't the best choice to show off her acting chops .
Dataset: e-SNLI Ground truth: Entailment, Prediction: Neutral A woman tired from her long day takes a nap on her bed above the sheets and covers. A lady is lying in bed.

Ground truth: Contradiction, Prediction: Entailment
A mountainous photo is complete with a blue sky. The photo was taken on a cloudy night. Legend Ground-truth rationales not detected by the model , Rationales extracted by the model but absent in ground-truth.
Rationales present in both the model and ground-truth.