Few Shot Rationale Generation using Self-Training with Dual Teachers

Self-rationalizing models that also generate a free-text explanation for their predicted labels are an important tool to build trustworthy AI applications. Since generating explanations for annotated labels is a laborious and costly pro cess, recent models rely on large pretrained language models (PLMs) as their backbone and few-shot learning. In this work we explore a self-training approach leveraging both labeled and unlabeled data to further improve few-shot models, under the assumption that neither human written rationales nor annotated task labels are available at scale. We introduce a novel dual-teacher learning framework, which learns two specialized teacher models for task prediction and rationalization using self-training and distills their knowledge into a multi-tasking student model that can jointly generate the task label and rationale. Furthermore, we formulate a new loss function, Masked Label Regularization (MLR) which promotes explanations to be strongly conditioned on predicted labels. Evaluation on three public datasets demonstrate that the proposed methods are effective in modeling task labels and generating faithful rationales.


Introduction
Interpretable NLP has emerged to learn models which explain their predictions through either extractive (DeYoung et al., 2020) or natural language explanations (Camburu et al., 2018;Narang et al., 2020;Wiegreffe et al., 2020).Due to higher expressivity of free text, generative self-rationalizing models have gained much research interest.However, the early works assume a fully supervised setup and require a large annotated dataset (Narang et al., 2020).Collecting large scale, manual annotations for task labels and corresponding explanations is challenging and expensive.On the other hand, a much larger unlabeled corpora is often available, making semi-supervised approaches like * Work done during an internship at Amazon few-shot learning (Brown et al., 2020) and selftraining (He et al., 2019) attractive solutions.In the context of self-rationalizing models, (Marasovic et al., 2022) explore few-shot learning, while (Zelikman et al., 2022) seek to improve a supervised labeler by augmenting it with rationale generation.In this work we start from a few-shot setup, assuming only a handful of examples available with their labels and hand-written rationale.We leverage a large unlabeled dataset and self-training techniques to improve over the simple few-shot model.
We hypothesize that using only a few examples, learning to generate meaningful explanations jointly with predicting the labels themselves, is a particularly challenging objective and self-training can suffer from a weak initial model.To address this, we propose a novel Dual Teacher learning approach to learn a self-rationalizing model from the two teacher models in a cascading manner.At first, a Predictor model is learned for predicting task labels, and then a Rationalizer model is learned to generate an explanation conditioned on an input and the task labels predicted by the Predictor model.We iteratively improve both models via self-training.In contrast to learning the Joint model directly, the Rationalizer model allows for much richer representation learning by moving the label information from decoder to the encoder part, and utilizing the encoder's self-attention mechanism to extract input-label correlations.A stronger few-shot model for rationale generation provides higher quality pseudo labels, consequently making self-training more effective.
Although the two conditional models (Predictor and Rationalizer) might be better performing, a single self-rationalizing model is still desirable for practical applications, due to its easeof-maintenance and parameter efficiency for faster inference.We apply principles from knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016) on the two conditional models to learn a joint model that generates task label and explanation as a single sequence.The teacher models are used for generating pseudo labels on the entire unlabeled dataset.The initial few-shot labeled data and the pseudo labeled dataset are finally combined to train the joint model.
Faithfulness of explanations is an imperative property for practical applications of interpretability analysis.A model generated explanation is considered faithful if it accurately explains the decision making of the model (Alvarez Melis and Jaakkola, 2018;Wiegreffe et al., 2020).Similar to prior study (Jacovi and Goldberg, 2020), we also observe that a free text explanation generated by models might sound plausible, without satisfying the faithfulness criteria of explaining the predicted task label.This motivates us to design a masking based regularization function, Masked Label Regularizer (MLR), to encourage the model to condition on the task label while generating an explanation.MLR is an entropy based constraint that forces the Rationalizer model to be maximally uncertain in generating an explanation in absence of label tokens and is used to ensure that the Rationalizer model preserves faithfulness through the self-training iterations.To summarize, our contributions are: • Proposing to utilize self-training for learning selfrationalizing models with free-text explanations, demonstrating that it provides significant performance boost compared to few-shot learning.
• Proposing a novel Dual Teacher framework, where two teacher models are trained with selftraining in a cascading manner for learning two tasks, and a multi-task joint student model is learned through distillation from the teachers.
• Extensively studying the faithfulness property of free-text explanations, and designing an entropy based regularization to encourage labelexplanation conditioning.
• Experiments on three public benchmark datasets and demonstrating the effectiveness of our proposed model in improving both task accuracy and explanation quality.

Related Work
Prior works on generating free text rationales have explored joint models (Narang et al., 2020;Marasovic et al., 2022) as well as several variants of pipeline models (Wiegreffe et al., 2020;Jang and Lukasiewicz, 2021).We also use sequence to sequence models (Raffel et al., 2019) as our backbone models.While most of the self-rationalizing literature assumes fully supervised setups, STaR (Zelikman et al., 2022) explores an alternate bootstrapping setup where limited rationales are available, but the task labels are present for the whole dataset.We consider the generic and more restrictive setting where only limited annotations are available for both task label and rationale.
For limited labeled data scenario, many NLP applications have started reporting success with self-training (Mehta et al., 2022;Yu et al., 2022;He et al., 2019;Bhat et al., 2021).Inspired from these works, we employ self-training to the selfrationalization problem.We introduce a new training framework with two conditional models and using them as teachers in a further distillation step to train the joint model.Besides the popular use for model compression, Knowledge Distillation has also shown superior performance when using the same model architecture and size for both the student and teacher models (Furlanello et al., 2018), and distilling from multiple teachers (Yuan et al., 2021;Liu et al., 2020).Recently, a work (Ghiasi et al., 2021) in computer vision domain has explored using pseudo-labels from multiple teachers to train a joint student model.However, they have multiple specialized teachers trained independently through full supervision, in contrast to the cascading nature of our dual teacher self-training setup.
Evaluating the quality of free-text rationales is significantly challenging and several works have proposed metrics to evaluate the explanations around fluency and their faithfulness properties (Hase and Bansal, 2020;Hase et al., 2020;Marasovic et al., 2022).A recent work (Wang et al., 2022) also tries to imbue faithfulness through a regularizing coefficient.However, they apply the regularizer to perturb the rationale while generating task label.In contrast we use a label masking regularizer to enforce the Rationalizer model to generate an explanation which is faithful to the label.

Background
We first provide some necessary background on Self-Rationalizing models and a theoretical outline of Self Training based learning.Self-Rationalization: A Self-Rationalization model tries to learn the joint distribution of output(O) and explanation(E), given an input(I), i.e.P (O, E|I).A common approach is modeling it as a sequence-to-sequence problem and generating the task prediction and the rationale jointly (Narang et al., 2020).Input-output format for a self rationalizing joint model is illustrated in Figure 1.The input consists of a task prompt, (e.g.explain nli), and in output sequence the task label is generated first (e.g.contradiction), followed by a separator token (explanation:), and then the free text explanation.During inference, greedy decoding is used to generate the sequence until an EOS token is produced.Self-Training is a type of Semi-Supervised Learning based method, which assumes access to a small labeled dataset (D l ) and a large, unlabeled indomain dataset (D u ).The algorithm progresses iteratively in four steps.First, a teacher model is trained on the labeled dataset (D l ), to obtain θ T .The trained teacher is then used to infer pseudo-labels on D u , generating the pseudolabeled dataset D pl .A student model is then trained on D pl to obtain the θ S .In the next iteration the teacher model is updated with the learned parameters from the student and the process repeats until a convergence criterion is met.

Dual Teacher for Self-Rationalization
We combine the strengths of self-training and knowledge distillation to train a self-rationalizing joint model from dual teachers.Following sections describe the components, their losses and the learning procedures in more detail.Input-output formats of the models are shown in Figure 1, and the overall framework is illustrated in Figure 2.

Problem Setup
We tackle the self-rationalization problem with fewshot labels.We consider access to a small labeled set, D l = {(i j , o j , e j )} N j=1 , where i j is the input, o j is the task output, and e j is the natural language explanation.We also leverage a much larger unlabeled dataset denoted by D u = {i j } M j=1 , where M ≫ N .In the unlabeled dataset only the input text is available and no annotation is provided for either task label or rationale.
To keep all models identical, we model all distributions in a sequence to sequence manner using T5 (Raffel et al., 2019).The teacher model in self-training is trained on few shot ground truth output sequences and the trained teacher is then used for generating output sequences for the unlabeled dataset.These sequences are considered as pseudo labels to train the student model.We re-weight the loss of each example with confidence of the teacher model.This limits error propagation through self-training iterations due to the noisy nature of pseudo labels.We use likelihood of the generated sequence as confidence estimates.Following (Bhat et al., 2021) we normalize the weights in a batch.

Splitting the Joint into Conditionals
In order to make the learning task easier, we break down the joint probability of modeling task and rationale, into its conditionals.

P (O, E|I)
This allows us to build two separate models in a cascading manner: (1) Predictor Model for predicting task label, i.e.P (O|I), and (2) Rationalizer Model for rationalizing the task label for an input, i.e.P (E|I, O).Prior works (Jang and Lukasiewicz, 2021) have shown that factorization of this distribution to predicting the output first (Prediction) and generating an explanation for the prediction (Rationalization) has obtained better performance than alternate factorizations.
We hypothesize that with limited labeled examples, learning a joint distribution for <task la-bel+rationale> sequence would be much harder than focusing on learning to predict only the task label.More importantly, for rationale generation we move the task label from output sequence (in the joint model) to input sequence (in Rationalizer model).This allows the encoder to capture much richer interactions between task label and the input through its self-attention network, compared to only the decoder in joint model.The stronger initial few-shot models for predictor and rationalizer would be further boosted through self-training in generating higher quality pseudo labels.

Predictor Teacher
In the first step of our framework, we train a Predictor model with self-training.The Predictor is trained to model the probability of the task output given the input, i.e.P (O|I).The task output is decomposed into subwords, and the model is trained to minimize the negative log likelihood of the output token sequence: The predictor model is trained within its own self training loop, utilizing the few shot ground truth task labels and unlabeled inputs.After self-training has converged, we store the predictor and use it for generating pseudo task labels on all unlabeled data.
These pseudo labels are then used for training the Rationalizer model and the Joint model.

Rationalizer Teacher
In the second stage we train a Rationalizer model that can generate natural language explanations given an input and the predicted task output, modeling the conditional distribution P (E|I, O).
For training the teacher model we use the few-shot ground truth labeled dataset for task label and rationale.For generating rationale pseudo-labels on the unlabeled set, we use the task pseudo labels generated by the predictor model as input.The generated rationale pseudo labels are then used to train a student rationalizer model in self-training loop until convergence.

Faithfulness of Explanations
For a Rationalizer model to generate a faithful explanation, we want the explanation to be strongly conditioned on the label.The rationalizer should not be able to generate an explanation solely based on the input, but must take into consideration the label for which it is rationalizing.We introduce a regularizing constraint in our rationalizer model to explicitly encode this property.

Masked Label Regularization
We design an entropy based regularization which tells the model to be maximally uncertain in generating the explanation in absence of a task label.
We achieve this by replacing the task output with mask tokens and maximizing the per-token entropy of the explanation sequence.
where H θ [e|i] refers to the entropy of producing an explanation from input directly.
There could be alternate ways of encoding the constraint of label-explanation association.We experimented with one such variant where the ground truth explanation would be generated with a high entropy in case of a wrong label.We observed similar empirical results in our experiments for this alternative.However, it is strictly less generalsince it becomes limited to only categorical problems, and also is computationally more expensive due the necessity of computing entropy for multiple wrong labels.Therefore, we use the simpler and generic form of masking the label tokens.
The overall loss of the Rationalizer is a weighted summation of the sequence generation loss and the regularization loss: λ M LR is empirically set to 1e −4 in our experiments for all datasets.

Learning from Multiple Teachers: Distilling a Joint from the Conditionals
Knowledge Distillation is an effective learning paradigm to train a lighter student model with rich supervision signals from better performing teacher model(s).To alleviate the limitations of limited labeled data for learning a good self-rationalization model, we leverage the unlabeled data and collect task and rationale pseudo-labels sequentially from trained Predictor and Rationalizer teacher models.
The final pseudo-labeled dataset is then combined with the few-shot labeled data and a joint model is trained on this set.This allows the knowledge from both the Predictor and Rationalizer models to be distilled into the student Joint model through pseudo labels and the teachers' confidence weights.
The joint model is trained to maximize the likelihood of a concatenated sequence of task output and explanation, as illustrated in Figure 1.The detailed training algorithm is described in Algorithm 1.
Loss Re-weighting: Similar to most sequence-tosequence models, in WT5 (Narang et al., 2020), all output tokens in the generated sequence have uniform weights in the loss.However, in the joint task setup, the number of tokens from task label is substantially smaller than those in the explanation.To balance this, we re-weight the token-level losses between the output and the explanation.For a tuple (i j , o j , e j ), the loss is computed as: where λ ∈ [0.5, 1) is a weight coefficient.

Results and Discussion
We evaluate on public datasets for three different tasks.Table 1 shows statistics of the datasets.e-SNLI (Camburu et al., 2018) extends the popular SNLI dataset (Bowman et al., 2015) by adding human-annotated explanations to the NLI labels.
The task requires generation of a task label which describes the relationship between a premise and a hypothesis as entailment/contradiction/neutral, and a free text explanation for the prediction.
ComVE (Wang et al., 2020) aims to evaluate if a model can distinguish between sensible and nonsensical statements based on common knowledge.We combine the data from SubTask A (Validation) and SubTask C (Generation) for our experiments.
ECQA (Aggarwal et al., 2021) augments the Commonsense QA dataset (Talmor et al., 2019) with free-text explanations that support the the correct answer choice and refute the incorrect ones.We utilize the explanations for the correct output (Positive Property) as the explanation.
For few-shot settings we sample 100 examples per class for each dataset.The self-training setup leverages the few-shot labeled dataset(D l ) and the rest of the training set as unlabeled dataset(D u ).

Implementation Details
We use the base variant of T5 (Raffel et al., 2019) as backbone model for the Predictor, Rationalizer and Joint models.Following (Narang et al., 2020), we also measure task performance using accuracy, and rationalization using SacreBLEU (Post, 2018).Label smoothing was set to 0.1 and early stopping

Main Results
In Table 2  Our experiments show that self-training is a promising direction in bridging the performance gap, improving accuracy and BLEU across all the tasks over the few-shot counterparts.We observe that reweighting the pseudo-labels with the confidence of the teacher models, provides small improvements in the overall performance and is in alignment with previous findings (Bhat et al., 2021).
Stronger Results with Dual Teacher Self-Training Framework.Finally, we observe a further improvement by our proposed method of performing self-training on the Predictor and Rationalizer models, and subsequently distilling the knowledge to a joint student model through pseudo labels.
The improvement in aggregate scores shows that the accuracy is within 8% of a fully supervised model, and 5% higher than the few-shot baseline.
The improvements from the proposed model are most prominent for the Rationale generation task -the BLEU scores are improved by a large margin compared to learning both tasks jointly in a self-training setup.Impressively, the dual-teacher approach achieves an aggregated result of 20.71 BLEU which is close to the aggregate performance of the Fully Supervised model (21.5 BLEU).We even obtained higher performance (BLEU score) than the supervised model on the two smaller datasets, ComVE and ECQA.

Discussion
Next we conduct several deeper analysis of the models and provide detailed insight to the overall results presented in Section 5.2.
RQ1: Does breaking the joint into conditionals improve performance for task label prediction and explanation quality?
We first want to analyze the effectiveness of breaking the joint model into conditionals and learning two separate models for task prediction and rationalization.From the results in Table 3, it is evident that by breaking the joint distribution into conditionals, we obtain significantly higher performance across all datasets, especially for explanation generation.This validates our hypothesis that with limited labels, it is much harder for the model to learn the joint distribution of output and explanation, compared to learning the conditionals separately.With self-training, the gap in performance between the joint and the conditionals decreases, but the individual models still outperform the joint model.These results align with the improvement observed from the Dual Teacher framework over Joint model in Table 2. Training the Predictor and Rationalizer models in their own self-training loops creates two strong teacher models and provides better pseudo labels.This allows us to train a strong self-rationalizing model through distillation than training a joint model directly through self-training.
RQ2: Does the Masked Label Regularization help to generate more faithful explanations?
While our method achieves better BLEU scores compared to different baselines, it is also important to evaluate whether the generated explanations are faithful to the predictions, i.e. provide reasoning that support the predicted label.During creation of the datasets, the annotators were instructed to assign a label and then explain the assignments with a natural language explanation.Therefore, it is desirable for the models to preserve the faithfulness properties in generated explanations.
We perform two tests to analyze whether (1) the explanations are dependent on the output and (2) if they reflect the intended label.Through these experiments we also conduct an ablation study to estimate the effect of the proposed Masked Label Regularization (MLR) constraint in improving the faithfulness of explanations.
Label-Explanation Association.We first conduct a simple analysis to check if the explanations are dependent on the model predictions.As a necessary condition for generating faithful explanations, different predicted labels have to produce different explanations.We measure this association as the number of test instances for which the model generates a distinct explanation for all labels.
We vary the task label and ask the model to generate an explanation.For joint models, we replace the generated label with other possible labels and ask the decoder to continue generating an explanation.For Rationalizer model, we simply generate predictions with providing different labels in the input.We study the effect of MLR by removing the entropy regularization loss while training the Rationalizer.We denote this variant as Rationalizer − MLR.Dual teacher − MLR refers to the Joint model trained using Rationalizer − MLR. 4 show that for the Joint model, only 72% of the examples have unique explanations per output on an average across datasets.This implies that the label-explanation association is not inherently captured in the decoder and for 28% of instances the generated explanation is constant and has no association with the labels.Adding the MLR loss encourages the model to condition on labels, and thereby provides a substantial improvement of over 10% for the Dual Teacher model.This indicates a strong association between the generated label and explanation, where the explanations are unique to the label in over 88% of cases.As can be seen from the Table 5: Simulatability score of the explanations from different methods.The higher the score the more aligned the explanation is with the predicted label are much longer on average.Simulatibility of Explanations.We utilize the Simulatability metric as defined in prior work (Chan et al., 2022;Hase and Bansal, 2020) to evaluate how well an external system, human or AI, is able to simulate the prediction made by a blackbox, self-rationalizing model using the explanation it generates.As simulators, two models are trained to predict the task label -(1) a control model P (O|I), which predicts the output given input and (2) a treatment model, P (O|I, E) predicting output given input and an explanation.The simulators are used to measure how much the explanations generated the self-rationalizing model help in 'guessing' its predicted label.The simulatability score is defined as

Results in Table
where ŷ refers to predicted label from the selfrationalizing model, y C and y T refers to predictions from the control and treatment simulators, respectively.The higher the faithfulness of a model, the better aligned its explanations are with its pre-dicted labels, relative to the control simulator which does not consider explanations.
Table 5 shows the simulatibility scores of the various self-rationalizing models under consideration.We observe a similar trend as in Table 4 while comparing the different models, with the exception of e-SNLI.For e-SNLI the control simulator was notably stronger compared to treatment, potentially due to the overlap with pre-training tasks of T5.We note that overall there is significant gap in the simulatability between our models and the Fully-Supervised model, indicating a large room for improvement in the faithfulness of explanation for weakly supervised models.
RQ3: How does the performance change as self-training progresses?
Figure 3 shows the performance of different models over self-training iterations.We observe that the two teacher models consistently outperform the joint model over iterations in both datasets.In ECQA dataset there is a large jump in accuracy in the first iteration and the algorithm converges soon.A similar trend is observed for BLEU scores, with a slight improvement in the Rationalizer in first iteration and the score plateauing or even declining in case of the Joint model.For e-SNLI dataset, accuracy continues to improve till five iterations for the Predictor, and three for the Joint model.The rationalization performance also converges after nearly five iterations for both the models.Convergence of the algorithm could be explained by the poor separability of the class labels in the datasets, causing more erroneous pseudolabels and plateauing of performance as time progresses.
RQ4: How does the performance change with increase in labelled dataset size?
We study the performance of our model by conducting experiments with different dataset sizes.We only vary the labeled dataset size and keep the remaining training set as unlabeled data.For example, for ECQA the total size (D) is 7.5K, and we conduct experiments with labeled data (D l ) in the range {50, 2.5K} and the remaining data size (D − D l ) as unlabeled data.

Conclusion
We study the self-rationalization problem with fewshot labels and demonstrate that self-training is an effective learning paradigm and can significantly reduce the gap between few shot and fully supervised model performance.We present a novel dual teacher learning framework that learns two models for task label prediction and rationale generation through self-training and efficiently distill the knowledge in a single self-rationalizing joint model.With a masking based loss formulation we enforce label-explanation association in the rationalizer, leading to generation of more faithful explanations.We conduct experiments on three public benchmark datasets for free text explanations, and show that the proposed methods are effective in improving task performance while generating accurate and faithful explanations.

Limitations
Despite strong performance compared to few-shot our self-training methods still contain significant room for improvement compared to the fully supervised benchmarks.It would be interesting to try larger language models to see if it is possible to close this gap with more knowledge embedded into the pre-trained models.Our evaluation of free text rationales are limited by the automatic metrics, which are necessary but not sufficient to analyze quality of an explanation for decision making of the model.From example explanations (a few of which are shown in Appendix), it is evident that we still lack understanding on multiple dimensions such as, when an explanation is factually wrong, is it due to the model believing in the wrong knowledge or is unable to retrieve the correct one.Works that probe a language model with various prompts could be useful for investigating in these directions.

Appendix
We include some qualitative analysis of the different design choices in our method.
8.1 Impact of moving the task label to the input from the output sequence We observed substantial improvement in rationale performance with the Rationalizer teacher model compared to the Joint model.This is can be attributed to the the prediction being passed as an input to encoder of the Seq2Seq model, generating better representation of the predictions and yielding better quality rationales.Table 7 shows a sample of explanations generated from the Joint and Rationalizer models, for cases when the label was predicted correctly.We see that the Rationalizer generally produces higher quality explanations, and in contrast, while the Joint model often generates nonsensical explanations with frequently repeated words.The better quality rationales obtained from the Rationalizer teacher helps generate better pseudo labels and the final model is able to capture those with distillation.

Effect of the Masked Label Regularization on faithfulness
Table 8 shows a few examples of rationales generated by the Dual Teacher model with and without the MLR loss.From the first example on e-SNLI dataset, we see that without MLR constraint the model generates same explanation for neutral and contradiction labels, and the explanation for the neutral label indicates a contradiction.In contrast when trained with MLR, it outputs an explanation which is in alignment with the assigned label.In the second example from ComVE, the model without MLR outputs the same explanation showing that it ignores the label assigned.With MLR constraint the model is able to generate explanations sensitive to the assigned label.Although the reasoning for the incorrect label is wrong, this behavior is still desired for an interpretable system elucidating why a prediction was made.

Error Analysis
Table 9 shows a snapshot of the qualitative analysis of the errors from our model.From the explanations generated for the predictions, we see that the model is unaware of situations which require additional background information, such as the existence of hair on eyes, or subtle differences between words, such as paws and feet.We believe a better pretrained Language Model can help alleviate some of these issues.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Input and output formats for Predictor, Rationalizer and Joint models.

Figure 2 :
Figure 2: Dual Teacher Training Framework.Predictor and Rationale models are trained in their own Self-training loop.Pseudo labels generated from the trained predictor and rationale model are used for training the Joint model.

Figure 3 :
Figure 3: Performance across self-training iterations on ECQA and e-SNLI datasets of the Confidence Weighted Joint, Predictor and Rationalizer models.Dashed lines show the performance of the Few-Shot Joint model.

Table 1 :
Dataset Statistics.Token-level statistics were generated using the T5-base tokenizer.

Table 3 :
Performance of the Joint model compared to Predictor and Rationalizer models in Fully supervised, Few-Shot and Self-Training setup.

Table 4 :
Label-Explanation association measured as % of inputs with distinct explanations for each task label.
Table, the Rationalizer teacher achieves significantly better label-explanation association compared to the Joint counterparts.The MLR constraint further improves the results, especially in the ComVE dataset where explanations

Table 6 :
Table 6 reports the accuracy and BLEU score of our proposed model for dataset sizes ranging from 50 to 2500 samples.We see that there is a improvement in the test accuracy and BLEU score as the labeled data size increases.With as few as 500 examples per label, the model is able to achieve accuracy within 6% of Effect of size of the labeled set on the final performance.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?section 5.1 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? section 5.1 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.