ER-Test: Evaluating Explanation Regularization Methods for Language Models

By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (LMs). Explanation regularization (ER) aims to improve LM generalization by pushing the LM's machine rationales (Which input tokens did the LM focus on?) to align with human rationales (Which input tokens would humans focus on?). Though prior works primarily study ER via in-distribution (ID) evaluation, out-of-distribution (OOD) generalization is often more critical in real-world scenarios, yet ER's effect on OOD generalization has been underexplored. In this paper, we introduce ER-Test, a framework for evaluating ER models' OOD generalization along three dimensions: unseen dataset tests, contrast set tests, and functional tests. Using ER-Test, we extensively analyze how ER models' OOD generalization varies with different ER design choices. Across two tasks and six datasets, ER-Test shows that ER has little impact on ID performance but can yield large OOD performance gains. Also, we find that ER can improve OOD performance even with limited rationale supervision. ER-Test's results help demonstrate ER's utility and establish best practices for using ER effectively.

Though prior works primarily evaluate ER models' in-distribution (ID) generalization, the results are mixed, and it is unclear when ER is actually helpful.Furthermore, out-of-distribution (OOD) generalization is often more crucial in real-world settings (Chrysostomou and Aletras, 2022;Ruder, 2021), yet ER's impact on OOD generalization has been underexplored (Ross et al., 2017;Kennedy et al., 2020).In particular, due to the lack of unified comparison of different works using ER, little is understood about how OOD performance is affected by major design choices in building an ER pipeline, like the rationale alignment criterion (i.e., loss function), human rationale type (instance-level vs. tasklevel), number and choice of rationale-annotated instances, and time budget for rationale annotation.In light of this, we propose ER-TEST 1 (Fig. 2), a framework for evaluating ER methods' OOD generalization via: (1) unseen datasets, (2) contrast set tests, and (3) functional tests.For (1), ER-TEST tests ER models' performance on datasets beyond their training distribution.For (2), ER-TEST tests ER models' performance on real-world data instances that are semantically perturbed.For (3), ER-TEST tests ER models' performance on synthetic data instances created to capture specific linguistic capabilities.
Using ER-TEST, we study four questions: (A) Which rationale alignment criteria are most effective?(B) Is ER effective with task-level human rationales?(C) How is ER affected by the number/choice of rationale-annotated instances?(D) How does ER performance vary with the rationale annotation time budget?For two text classification tasks, we show that ER has little impact on ID performance but yields large gains on OOD performance, with the best ER criteria being task-dependent (Sec.5.2).Furthermore, ER can improve OOD performance even with distantlysupervised (Sec.5.3) or few (Sec.5.4) human rationales.Finally, we find that rationale annotation yields more improvements than label annotation, specifically with limited time-budget for annotating (Sec.5.5).ER-TEST results further show ER's utility and establish best practices for using ER effectively.

Explanation Regularization (ER)
Given an NLM for an NLP task, the goal of ER is to improve NLM generalization on the task by 1 Code available at https://github.com/INK-USC/er-test.
pushing the NLM's (extractive) machine rationales (Which input tokens did the NLM focus on?) to align with human rationales (Which input tokens would humans focus on?).The hope is that this inductive bias encourages the NLM to solve the task in a manner that follows humans' reasoning process.
Let F be an NLM for M -class text classification.F usually has a BERT-style architecture (Devlin et al., 2018), consisting of a Transformer encoder (Vaswani et al., 2017) followed by a linear layer with softmax classifier.Let x i = [x t i ] n t=1 be the n-token input sequence (e.g., a sentence) for task instance i.Let y i denote F's predicted class for x i .Given F, x i , and y i , the goal of rationale extraction is to output machine rationale r i = [r t i ] n t=1 , such that each 0 ≤ r t i ≤ 1 is an importance score indicating how strongly token x t i influenced F to predict class y i (Luo et al., 2021).Let G denote a rationale extractor, such that r i = G(F, x i , y i ).
G can also be used to compute machine rationales w.r.t.other classes besides y i (e.g., target class ẏi ).Let ri denote the machine rationale for x i w.r.t.ẏi .Given ri obtained via G and F, many works have explored ER, in which F is regularized such that ri aligns with human rationale ṙi (Zaidan and Eisner, 2008;Lin et al., 2020;Rieger et al., 2020;Ross et al., 2017).ṙi can either be human-annotated for individual instances, or generated via human-annotated lexicons for a given task.Typically, ṙi is a binary vector, where ones and zeros indicate positive (important) and negative (unimportant) tokens, respectively.
We formalize the ER loss as: L ER = Φ(r i , ṙi ), where Φ is an ER criterion measuring alignment between ri and ṙi .Thus, the full learning objective is: L = L task + λ ER L ER , where L task is the task loss (e.g., cross-entropy loss) λ ER ∈ R is the ER strength (i.e., loss weight) for L ER .While there are many choices for Φ, it is unclear how Φ impacts training and when certain Φ should be preferred.Also, as a baseline, let F No-ER denote an NLM that is trained without ER, such that L = L task .

Unseen Dataset Tests
First, we evaluate OOD generalization w.r.t.unseen datasets (Fig. 2A).Besides D, suppose we have datasets { D(1) , D(2) , ...} for the same task as D. Each D(i) has its own train/dev/test sets and distribution shift from D. After training F with ER on D train and hyperparameter tuning on D dev , we measure F's performance on each OOD test set D(i) test .This tests ER's ability to help F learn general (i.e., task-level) knowledge representations that can (zero-shot) transfer across datasets.

Contrast Set Tests
Second, we evaluate OOD generalization w.r.t.contrast set tests (Fig. 2B).Dataset annotation artifacts (Gururangan et al., 2018) can cause NLMs to learn spurious decision rules that work on the test set but do not capture linguistic abilities that the dataset was designed to assess.Thus, we test F on contrast sets (Gardner et al., 2020), which are constructed by manually perturbing the test instances of realworld datasets to express counterfactual meanings.Contrast set tests reveal the dataset's intended decision boundaries and if F has learned undesirable dataset-specific shortcuts.Given D(i) test , we can con-vert D(i) test to contrast set C(i) test using various types of semantic perturbation, such as inversion (e.g., "big dog" → "small dog"), numerical modification (e.g., "one dog" → "three dogs"), and entity replacement (e.g., "good dog" → "good cat").However, since contrast sets are built from real-world datasets, they provide less flexibility in testing linguistic abilities, as a given perturbation type may not apply to all instances in the dataset.Note that, unlike adversarial examples (Gao and Oates, 2019), contrast sets are not conditioned on F specifically to attack F.

Functional Tests
Third, we evaluate OOD generalization w.r.t.functional tests (Fig. 2C).Whereas contrast sets are created by perturbing real-world datasets, functional tests evaluate F on synthetic datasets, which are manually created via templates to assess specific linguistic abilities (Ribeiro et al., 2020;Li et al., 2020).While contrast set tests focus on semantic abilities, functional tests consider both semantic (e.g., perception of word/phrase sentiment, sensitivity to negation) and syntactic (e.g., robustness to typos or punctuation addition/removal) abilities.Therefore, functional tests trade off data realism for evaluation flexibility.If ER improves F's functional test performance for a given ability, then ER may be a useful inductive bias for OOD generalization w.r.t. that ability.Across all tasks, ER-TEST contains four general categories of functional tests: Vocabulary, Robustness, Logic, and Entity (Ribeiro et al., 2020).See Sec.A.2.3 for more details.

ER-TEST Design Choices
An ER model consists of three key components: rationale alignment criterion, type of human rationales, and instance selection strategy.ER-TEST evaluates the design choices for each component.

Rationale Alignment Criteria
Compared to existing works, ER-TEST uses a wider range of rationale alignment criteria to eval- Nolan is a great director .uate ER model generalization.This provides a more comprehensive picture of ER's impact on both ID and OOD generalization, helping us understand why and when certain criteria work well.To demonstrate ER-TEST's utility, we consider six representative rationale alignment criteria (i.e., choices of Φ), described below.

Only
Mean Squared Error (MSE) is used in Liu and Avci (2019), Kennedy et al. (2020) . Huber Loss (Huber, 1992) is a hybrid of MSE and MAE, but is still unexplored for ER.Our experiments use the default δ = 1 (Paszke et al., 2019).
Order Loss Recall that the human rationale ṙi labels each token as positive (one) or negative (zero).Whereas other criteria generally push positive/negative tokens' importance scores to be as high/low as possible, order loss (Huang et al., 2021) relaxes MSE to merely enforce that all positive tokens' importance scores are higher than all negative tokens' importance scores.This is especially useful if ṙi is somewhat noisy (e.g., some positive-labeled tokens should not really be positive).

Types of Human Rationales
Unlike prior works, ER-TEST considers both instance-level and task-level human rationales.
Instance-Level Rationales Human rationales are often created by annotating each training instance individually (Lin et al., 2020;Camburu et al., 2018;Rajani et al., 2019).For each instance, humans are asked to mark tokens that support the gold label as positive, with the remaining tokens counted as negative.Here, each human rationale is specifically conditioned on the input and gold label for the given instance.However, instance-level rationales are expensive to obtain, given the high manual effort required per instance.
Task-Level Rationales Some works construct distantly-supervised human rationales by applying task-level human priors across all training instances (Kennedy et al., 2020;Rieger et al., 2020;Ross et al., 2017;Liu and Avci, 2019).Given a task-level token lexicon, each instance's rationale is created by marking input tokens present in the lexicon as positive and the rest as negative, or vice versa.Here, rationales are not as fine-grained or tailored for the given dataset, but may provide a more general learning signal for solving the task.

Instance Selection Strategies
In real-world applications, it is often infeasible to annotate instance-level human rationales ṙi for all training instances (Chiang and Lee, 2022;Kaushik et al., 2019).Besides task-level rationales, another approach for addressing this issue could be to annotate only a subset S train ⊂ D train of training instances.Given a constant budget of |S train | = k 100 |D train | instances, where 0 < k < 100, our goal is to select S train such that ER with S train maximizes F's task performance.While instance selection via active learning is well-studied for general classification (Schröder and Niekler, 2020) and Padmanabhan, 2002).
Highest Confidence (HC) selects the |S train | instances for which F No-ER yields the highest target class confidence probability F No-ER ( ẏi |x i ).This is the opposite of LC.
Lowest Importance Scores (LIS) Given machine rationale ri for F No-ER and 0 < k ′ < 100, let r(k ′ ) i denote a vector of the top-k% highest importance scores in ri .With r S = (1/|r as the mean score in r(k ′ ) i , LIS selects the |S train | instances for which r S is lowest.This is similar to selecting instances with the highest ri entropy.
Highest Importance Scores (HIS) Given r S , HIS selects the |S train | instances for which r S is highest.This is the opposite of LIS.

Experiments
Now that ER-TEST lays a foundation for evaluation and design choices, we conduct a systematic study of the ER pipeline through four research questions (Fig. 3).
First, we compare different rationale alignment criteria and analyse which performs better for which task. (RQ1: Sec.5.2).Second, we compare ER pipelines with different types of available human rationales: either dense, instance-level vs. distantly-supervised task-level rationales.(RQ2: Sec.5.3) Third, we look into strategies on how to select instances on which ER should be applied, given resource constraints on the number of rationale-annotated samples.(RQ3: Sec.5.4).Lastly, we investigate whether ER is worth doing, given a time-budget to obtain rationale-annotated instances, while comparing it methods without ER and the same time-budget to obtain more labelled data.(RQ4: Sec.5.5).

Tasks and Datasets
ER-TEST uses a diverse set of text classification tasks.We mainly focus on sentiment analysis and natural language inference (NLI), but also consider named entity recognition (NER) and hate speech detection in Appendix A.5.2 First, for sentiment analysis, we use SST (short movie reviews) (Socher et al., 2013;Carton et al., 2020) as the ID dataset.

RQ1: Which rationale alignment criteria are most effective?
Here, we study the effectiveness of different rationale alignment criteria specified in ER-TEST for two tasks -sentiment analysis and NLI.
Setup.Rationale alignment criteria described in Section 4.1 are used to align instance-level rationales for the train set (ID datasets SST and e-SNLI for sentiment analysis and NLI tasks respectively).
For the NLM architecture, we use BigBird-Base (Zaheer et al., 2020), in order to handle input sequences of up to 4096 tokens.For all results, we report the mean over three seeds, as well as the standard deviation.We use a learning rate of 2e−5 and effective batch size of 32.Further implementation details are in Appendix A.3.3.
Results.Tables 1 and 2, and Figure 4 summarize the results for this research question.ID Generalization.We observe (Table 1) that using ID task performance, it is difficult to distinguish between different rationale alignment criteria, as all of them yield about the same task performance as the None baseline (for both SST and eSNLI).Unseen Datasets.For sentiment analysis, MAE yields significant gains over all other rationale  Contrast Set Tests.We observe (Table 2) the drop in performance (∆) for sentiment analysis and NLI when using a contrast set designed for the given dataset.MAE leads to the least drop in performance and all methods apart from Order yield lower drops than None.All of them also have a higher performance on the original and contrast sets.For sentiment analysis, we observe that Order has the highest variance, and for NLI, it has the highest drop in performance.We believe that some of it can be attributed to the soft-ranking that is imposed by Order, which may be indifferent towards minor label-changing edits, that is observed by the contrast sets.Functional Tests. Figure 4 demonstrates failure rates on functional tests.We observe that apart from the entity-based tests, rationale alignment criteria generally have lower failure rates than None.Generally, all methods perform well on robustness-  11).Each rationale alignment criterion corresponds to the IxG rationale extractor.
based tests, as they have lower failure rates, with order loss having the least.What is important to note is the significant improvement by Order loss in vocabulary-based tests than None, even though all of the methods are exposed to the same training set instances.We hypothesize that the biases induced by ER alleviates the shortcuts learnt by None, also demonstrated by an overall lower failure rate of rationale alignment criteria.

RQ2: Is ER effective with distantly
supervised human rationales?
As described in Section 4.2, obtaining instancelevel rationales are expensive to obtain.In this research question, we compare and contrast differences between instance-level and task-level human rationales with ER-TEST, on the Sentiment Analysis task.
Results.We show ID and OOD performance of instance and task-based rationales in   for low resource cases (5/50%), LIS leads to similar OOD performance to k = 100% (using all samples for ER), and is always greater than Random/Non-ER.This also shows that carefully selecting a small subset of samples for rationale annotation can yield same benefits like that of annotating all the samples, with a lower annotation cost, and significant improvements over Random/No-ER.
5.5 RQ4: How is ER affected by the time taken to annotate human rationales?
So far, we explored questions surrounding ER that assumed the ease of obtaining rationale-annotated instances.However, obtaining rationales for ER are not only tedious, but also time-consuming.In this RQ, we benchmark the time efficiency of ER through the lens of time taken to collect such data, when compared to collecting labelled data without rationales.
Setup.Our setup comprises of two steps -firstly, estimating the time taken to annotate one instance using Amazon Mechanical Turk (MTurk), followed by using these estimates to create training sets with varying time budgets.On MTurk2 (details in A.7), we devise three tasks -one where the annotators have to first select a sentiment for an instance, and then highlight rationale tokens that support their selected sentiment (Label + Expl), one where they have to highlight the rationales given a ground truth sentiment (Only Expl) and one baseline task where they only have to label an instance with a sentiment (Only Label).Results.As we can observe in Figure 6, when provided with a lower time budget for annotation (≤ 5 hours), annotating rationales for existing instances in the training set (Only Expl) yield improvements over adding new instances with labels (and rationales), in all of the OOD datasets.However, their performance saturates over time.When provided with a higher time-budget, adding new instances with both labels and their rationales (Label + Expl) is better than only adding labelled data without rationales (Only Label).This is even though Label + Expl takes the most amount of time to annotate, and thus fewer instances with these annotations would be added with a given time budget.In general, we observe that about 24 hours of Only Label annotation yields the same OOD performance with just 30 mins of Only Expl annotation.This validates that ER not only provides improvements in generalization, but also does it in a time-(and cost-) efficient manner.

Related Work
Explanation-Based Learning Many methods have been proposed for explanation-based learning (Hase and Bansal, 2021; Hartmann and Sonntag, 2022), especially using human explanations (Tan, 2022).ER, which is based on machine-human rationale alignment, is a common paradigm for learning from human explanations.In ER, the human rationale can be obtained by annotating each instance individually (Zaidan and Eisner, 2008;Lin et al., 2020;Camburu et al., 2018;Rajani et al., 2019;DeYoung et al., 2019) or by applying domain-level lexicons across all instances (Rieger et al., 2020;Ross et al., 2017;Ghaeini et al., 2019;Kennedy et al., 2020;Liu and Avci, 2019).Existing choices of rationale alignment criteria include MSE (Liu and Avci, 2019;Kennedy et al., 2020;Ross et al., 2017), MAE (Rieger et al., 2020), BCE (Chan et al., 2021), order loss (Huang et al., 2021), and KL divergence (Chan et al., 2021).Beyond ER, there are other ways to learn from explanations.Lu et al. (2022) used human-in-the-loop feedback on machine rationales for data augmentation.Meanwhile, Ye and Durrett (2022) used machine rationales to calibrate black box models and improve their performance on low-resource domains.
Evaluating ER Existing works have primarily evaluated ER models via ID generalization (Zaidan and Eisner, 2008;Lin et al., 2020;Huang et al., 2021), which only captures one aspect of ER's impact.Meanwhile, a few works have considered auxiliary evaluations -e.g., machine-human rationale alignment (Huang et al., 2021;Ghaeini et al., 2019), task performance on unseen datasets (Ross et al., 2017;Kennedy et al., 2020), social group fairness (Rieger et al., 2020;Liu and Avci, 2019).Carton et al. (2022) showed that maximizing machine-human rationale alignment does not always improve task performance, while human rationales vary in their ability to provide useful information for task prediction.further showed that

Conclusion and Future Work
In this work, we study explanation regularization (ER) -aligning machine rationales with human rationales, in detail.We propose ER-TEST, that evaluates ER's OOD generalization along three pillars -unseen datasets, contrast set tests and functional tests, and uses it to investigate four research questions surrounding the choice of the rationale alignment criterion, type of human rationale, choice of and time taken to obtain rationale-annotation instances.Although ER shows minor impact on ID task performance, improvements on OOD datasets is significant.Furthermore, ER not only works well with dense, instance-level human rationales, but also with distantly supervised task-level rationales.Lastly, ER is shown to provide benefits even with limited number of rationale-annotated instances, or within time constraints for rationale annotation.
In future, we aim to study ER as a tool for improving human-in-the-loop (HITL) debugging of NLMs.Furthermore, currently ER-TEST is only defined for extractive rationales.Human feedback for free-text machine rationales is also a promising extension for ER-TEST.

Acknowledgments
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007, DARPA MCS program under Contract No. N660011924033, NSF IIS 2048211, and gift awards from Google and Amazon.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.We would also like to thank all of our collaborators at the USC INK Research Lab for their constructive feedback on this work.

Limitations
Some tests defined by ER-TEST might not be applicable for all NLP tasks.It is difficult to define tests like contrast set tests for certain NLP tasks like NER.Furthermore, currently these tests have been picked up from their respective releases, however, they are extremely tedious to design and generate for new tasks and datasets.
Current work simulates human rationales in the ER pipeline.ER is meant to align machinegenerated rationales with human-rationales.However in our current work, we use human rationales that are pre-annotated as part of the datasets we use.This simulation of live human feedback is used in rationale alignment criterion.We believe this limitation can be easily addressed, by collecting human-in-the-loop rationale annotations.
Current work assumes ER pipelines to be offline in nature.Fine-tuning strategies have shown to distort the underlying data distribution (Kumar et al., 2022), therefore, once F undergoes ER, its machine rationales can differ from before.Currently, ER is being studied in an offline manneronce human rationales are collected, they are used to update model weights.However, what is more effective is to study the effect of ER when applied incrementally in an online manner, thus improving rationale alignment.

Ethics Statement
Data.All the datasets that we use in our work are released publicly for usage and have been duly attributed to their original authors.
User Study.As part of our user study conducted in Section 5.5, we collected information about the time taken to annotate rationales and labels for instances.We provide the instructions given to MTurkers in Appendix A.7, along with screenshots of the UI displayed to them.Further details about the task setup and results are provided in Section 5.5.Each task is setup in a manner that ensure that the annotators receive compensation that is above minimum wage.

A Appendix
A.1 Section 2 Appendix G first computes raw importance scores s i ∈ R n , then normalizes s i as probabilities r i using the sigmoid function.
Broadly, heuristic G's can either be a gradientbased, that assign importance scores based on gradient changes in F (Sundararajan et al., 2017;Sanyal and Ren, 2021;Shrikumar et al., 2017), samplingbased, that assign important scores based on the neighbours/context of a given token (Zeiler and Fergus, 2013;Jin et al., 2019), or attention-based, that use the attention-scores or a function of them to assign importance scores.(Ding and Koehn, 2021).

A.2.1 ID Generalization
While ER-TEST's main focus is on evaluating OOD generalization, ER-TEST also considers ID generalization as a baseline evaluation.Let D = {X , Y} N i=1 be a M -class text classification dataset, where X = {x i } N i=1 are the text inputs, Y = { ẏi } N i=1 are the target classes, and N is the number of instances (x i , ẏi ) in D. We call D the ID dataset.Assume D can be partitioned into train set D train , dev set D dev , and test set D test , where D test is an ID test set for D. After using ER to train F on D train , we measure F's task performance on the ID test set D test .Note that this is a standard protocol used by existing works to evaluate ER models (Zaidan and Eisner, 2008;Rieger et al., 2020;Liu and Avci, 2019;Ross et al., 2017;Huang et al., 2021;Ghaeini et al., 2019;Kennedy et al., 2020)

A.2.2 Contrast Sets
Given D(i) test (j) (a j th instance belonging to an OOD test set p is applied to that instance, where p denotes the kind of perturbation taking effect, and it often changes the target label for that instance.For example, p can signify semantic (e.g., changing tall to short), numeral (e.g., changing one dog to three dogs), or entities (e.g., changing dogs to cats).Each perturbation type is specific to the dataset it is being created for, so that instance labels are changed in a meaningful manner.The resulting set of instances )∀j, p are termed as a contrast set for that dataset.Based on the way they are created, contrast sets are a property of the dataset, and are not created to explicitly challenge F (unlike adversarial examples (Gao and Oates, 2019)).

A.2.3 Functional Tests
Vocabulary Tests Vocabulary tests are used to evaluate F's capability to address changes in the vocabulary of the text, and is particularly diverse w.r.t the parts-of-speech it caters to.For example, certain vocabulary tests evaluate the relationship (taxonomy) between different nouns in a sentence, whereas some swap the modifiers or the verbs present in a sentence in a meaningful manner based on the task at hand, to capture F's targeted performance towards such changes (Ribeiro et al., 2020).
Robustness Tests Robustness tests evaluate F's behavior under character-level edits to words in a sentence, keeping the rest of the context same so as to not change the overall prediction.They include testing against typos as well as contractions in words, as well as addition of tokens that are irrelevant for the downstream task (like URLs or gibberish like Twitter handles).(Jones et al., 2020;Wang et al., 2020) Logic Tests Testing F's reasoning capabilities towards logical changes in a sentence is also important to evaluate its reliance on shortcut-patterns.These tests perturb sentences in a logical manner (by adding or removing negations, or purposefully inducing contradictions) that also change the target label in the same manner.(Talman and Chatzikyriakidis, 2018;McCoy et al., 2019) Entity Tests For certain tasks, named entities like numbers, locations and proper nouns are not relevant for predicted a target label, and are often a source of gender or demographic biases (Mishra et al.;Mehrabi et al., 2020).Entity tests measure F's sensitivity towards changes in named entities such that the overall context as well as the task label remains the same (Ribeiro et al., 2020).

A.3.1 Tasks and Datasets
To evaluate ER models, ER-TEST considers a diverse set of sequence and token classification tasks.For each, task ER-TEST provides one ID dataset (annotated with human rationales) and multiple OOD datasets.Compared to prior works, ER-TEST's task/dataset diversity enables more extensive analysis of ER model generalization.
First, we have sentiment analysis, using SST (movie reviews) (Socher et al., 2013;Carton et al., 2020) as the ID dataset.For OOD datasets, we use Yelp (restaurant reviews) (Zhang et al., 2015), Amazon (product reviews) (McAuley and Leskovec, 2013), and Movies (movie reviews) (Zaidan and Eisner, 2008;DeYoung et al., 2019).Movies' inputs are much longer than the other three datasets'.For contrast set tests, we use an OOD contrast set for sentiment analysis released by the authors of the original paper (Gardner et al., 2020), which are created for the Movies dataset.Furthermore, for functional tests, we use an OOD test suite (flight reviews) from the CheckList (Ribeiro et al., 2020) which contains both template-based instances to test linguistic capabilities, as well as real-world data (tweets).
Second, we have natural language inference (NLI), using e-SNLI (Camburu et al., 2018;DeYoung et al., 2019) as the ID dataset.For the OOD dataset, we use MNLI (Williams et al., 2017).e-SNLI contains only image captions, while MNLI contains both written and spoken text, covering various topics, styles, and formality levels.For NLI, we also use an OOD contrast set created for the MNLI dataset (Li et al., 2020).Functional tests for NLI are generated from the AllenNLP test suite (Gardner et al., 2017) for textual entailment.

A.3.2 Intrinsic Evaluation of ER
ER in general is sensitive to certain hyperparameters for yielding meaningful training curves and actually attaining alignment between machine and human rationales.Due to a large set of tunable hyperparameters, running all configurations of ER are not feasible.Therefore, we intrinsically evaluate hyperparameter configurations by assessing the loss curves (which model alignment between machine and human rationales) w.r.t different hyperparameters values.We observe that the acceptable band of learning rates for ER is very narrow, and we use 2e−5 in all of our experiments.Furthermore, we also observe that setting λ ER = 1 and γ ER = 100 yields the most drop in the loss curves while training, so we use these hyperparameters for the rest of our experiments.We detail these experiments in Appendix A.3.3.

A.3.3 Intrinsic Evaluation: evaluating ER's sensitivity to hyperparameters
When using ER to train F, it is important to assess whether ER exhibits expected training behavior, orthogonally to task performance.If ER improves task performance, this kind of analysis can help us better understand ER's effectiveness.Conversely, if ER does not improve task performance, such analysis can help us identify the problem.Let γ ER > 0 be the rationale scaling factor, used to scale ŝi prior to sigmoid normalization.If the magnitudes of the ŝi scores are lower, then the ri scores will be closer to 0.5 (i.e., lower confidence).However, scaling ŝi by γ ER > 1 will increase the magnitude of ŝi , yielding ri scores closer to 0 or 1 (i.e., higher confidence).
Motivated by this, ER-TEST's intrinsic evaluation is based on machine-human rationale alignment, captured by the ER loss L ER = Φ(r i , ṙi ).When using ER, we should generally expect the ER loss to decrease as F is trained.In practice, this may not always be the case, even when ER leads to slightly higher task performance (which is likely a mirage caused by lucky random seeds)!That is, by definition, non-decreasing ER loss signals ineffective ER usage, since the machine rationales are not becoming more similar to the human rationales.This can stem from a number of issues: e.g., poor choice of ER criteria Φ, improper ER strength λ ER , improper rationale scaling factor γ ER , noisy human rationale ṙi , insufficient F capacity. Thus, we measure machine-human rationale alignment as the first step in diagnosing such issues.Let  (Huang et al., 2021;Ghaeini et al., 2019).
For intrinsic evaluation, we use ER strength λ ER = 1, rationale scaling factor γ ER = 1, and learning rate α = 2e−5, unless otherwise specified.As a proof of concept, we focus on SST here, but plan to add other datasets in future work.

A.3.4 Misc. Details
All models are trained on GeForce GTX 1080 Ti and Quadro RTX 6000 GPUs.All implementations are done using the HuggingFace API (Wolf et al., 2019).[0.5, 1, 10, 100, 300], on SST using MAE.Among the λ ER values, we see that λ ER = 1 yields ER loss curves with the greatest decrease (Table 5), signaling good ER optimization.on SST.Among the four γ ER values, we see that γ ER = 100 yields ER loss curves with the greatest decrease (Table 6), signaling good ER optimization.Meanwhile, although ER works use γ ER = 1 by default, we see that γ ER = 1 yields nearly flat ER loss curves for all five Φ choices.This suggests poor ER optimization.Based on these results, we fix γ ER = 100 for all experiments (Sec.5), thus greatly reducing the hyperparameter search space.
A.3.8 ER performance with different hyperparameters ER Strength vs. Task Performance To measure ER's impact on task performance, we plot F's task performance as a function of ER strength λ ER .This is conducted for ID test sets.For each sentiment analysis dataset, Fig. 8 shows task performance for ER strengths λ ER = [0, 0.5, 1, 10, 100, 300], using MAE.Note that λ ER = 0 is equivalent to training the NLM without ER (i.e., None in Table 1).For the ID dataset (SST), we see that all ER strengths yield very similar task performance, suggesting that ER has little effect on ID task performance.However, for the OOD datasets (Amazon, Yelp, Movies), task performance generally increases as λ ER increases, showing ER's positive impact on NLM generalization.Overall, based on OOD task performance, we find that λ ER = [1, 100] are the best ER strengths.This aligns with the results of Sec.A.3.5.

ER Loss vs.
Task Performance To measure ER's impact on task performance, we plot F's task performance as a function of ER loss L ER .This is conducted for both ID and OOD test sets.Fig. 9 displays the SST results for ID task performance (accuracy) vs. ER loss.For a given ER criterion, each point in the corresponding scatter plot represents the checkpoint at some train epoch of the ER-trained model, evaluated on either the dev set or test set (yielding two point sets).Fitting each point set with linear regression, we find that there is an  inverse relationship between task performance and ER loss.In other words, higher machine-human rationale alignment (i.e., low ER loss) corresponds to higher task performance, which validates the usage of ER to improve generalization.Table 7 displays the slopes and R 2 scores of the lines in Fig. 9.The slope indicates the strength of the relationship between machine-human rationale alignment and task performance (lower is better), while the R 2 score indicates how accurately each line fits its corresponding data points.Among the five ER criteria, across dev and test, we find that MAE has the lowest slopes and highest R 2 scores overall, suggesting that using ER with MAE is most effective.No-ER ⊂ D + ER .This means there is no opportunity cost in using ER, as ER increases the number of correct instances without turning any previouslycorrect incorrect.However, this may not necessar-  Percentage of Dev Instances in incor→cor Group, Binned by FNo-ER Target Class Confidence 0.0-0.1 0.1-0.20.2-0.30.3-0.40.4-0.5 0.5-0.6 0.6-0.7 0.7-0.80.8-0.90.9-1.0We consider ER with the MAE criterion, trained/evaluated on SST (via dev ID task performance).Fig. 12 visualizes how ER changes each dev instance's target class confidence as a result of ER, color-coding each point w.r.t.how ER changes the model's predicted class for this point.Among instances for which F No-ER 's target class confidence is low, there is a higher percentage of instances that are predicted incorrectly/correctly without/with ER (i.e., incor→cor).This suggests that, for F No-ER , instances with low target class confidence are more likely to benefit from ER (Table 10).Also, based on the T-test, target class confidence scores are significantly higher (p < 0.005) with ER than without.

ER Opportunity Cost
Table 9 displays the opportunity cost results for sentiment analysis.Generally, the opportunity cost results mirror the task performance results in Table 1, such that the methods with highest task performance tend to have the lowest opportunity cost.However, using opportunity cost, the variance is very high for OOD datasets, making it difficult to compare methods.In future work, we plan to modify the opportunity cost metrics to better accommodate OOD settings.

A.3.9 Efficient hyperparameter tuning with ER-TEST
In intrinsic evaluation (Sec.A.3.2), we used ER loss curves as priors for selecting three key ER hyperparameters (i.e., ER strength λ ER , rationale scaling factor γ ER , learning rate α).In Sec. 5, we assumed a tuning budget that allows only one value for each of λ ER , γ ER , and α.By not tuning these hyperparameters, we greatly reduced our hyperparameter search space.Since ER has little effect on ID task performance, tuning based on ID task performance is unlikely to have helped anyway.ER works better on OOD data, but it also does not make sense to tune based on OOD task performance (otherwise, it would not be OOD).Though the ER hyperparameters chosen via intrinsic evaluation generally improved OOD task performance, we seek to verify their effectiveness compared to other possible hyperparameter values.
In Table 8, we report sentiment analysis OOD (Amazon, Yelp, Movies) task performance, while varying each of the three hyperparameters.We include a Mean column, which averages the Amazon/Yelp/Movies columns.Our hyperparameters chosen via ER loss curves are highlighted in blue .For λ ER , 1 (ours) and 100 yield very similar Mean results, while considerably beating the other three values.For γ ER , we see the same trend for 100 (ours) and 10.For α, 2e−5 (ours) vastly outperforms other values in all columns.These results validate the utility of ER-TEST's intrinsic evaluation for low-resource ER hyperparameter tuning.

A.4 RQ1 NER Results
We also have named entity recognition (NER) task, using CoNLL-2003 (Sang andDe Meulder, 2003;Lin et al., 2020) as the ID dataset.For the OOD dataset, we use OntoNotes v5.0 (Pradhan et al., 2013).CoNLL-2003 contains only Reuters news stories, while OntoNotes v5.0 contains text from newswires, magazines, telephone conversations, websites, and other sources.In Table 15, we display the ID and OOD results of NER.In ID, we see more variance in task performance among ER criteria, although the variance is still quite small among the best methods (MSE, MAE, Huber).Here, MAE yields the highest task performance, while BCE yields the lowest by far.In OOD, MAE still performs best, while MSE and Huber are competitive.

Functional Tests
We provide details for different functional tests listen in Section 3.3.We breakdown each subcategory of functional tests and show performances of different ER criteria on those individual tests.For functional tests on the sentiment analysis task, refer to Table 11.NLI functional tests are listed in Table 14.

A.5.1 Lexicon-matching
Let L D be a list of lexicons curated by human annotators, specific to a given dataset D. Let l(•) be an indicator function that searches for a given lexicon list in all the tokens of an instance, and returns a binary representation of the same size as the instance with 1s in places with lexicon matches (0 otherwise).Therefore, we can obtain distantlysupervised human rationales ṙi = l(L D , x i ) and apply rationale alignment criteria as described in Section 4.1.
Each lexicon is matched to n-grams(uni-/bi-/trigrams), which leads to 93% of the train set instances to be matched.Additionally, since we combined two lexicons as resources, there are words appeared as positive and negative.We maintain lexicon overlapping with different sentiment polarities when matching with tokens.For equal comparison, we use instance-based rationales on the same train subset.We also run contrast set tests and functional tests on both lexicon-based and instancebased methods.The results are shown in Table 12  and Table 13.A.5.2 Hate Speech Detection Tests Task-Level Rationales For example, Kennedy et al. (2020) used a "blacklist" lexicon to distantly supervise human rationales for the hate speech detection task.In the past, hate speech detection models were largely oversensitive to certain group identifier words (e.g., "black", "Muslim", "gay"), almost always predicting hate speech for text containing these words.To address this, they first manually annotated a lexicon of group identifiers that should be ignored for hate speech detection.Then, for all training instances, they automatically marked only tokens belonging to the lexicon as negative (and the rest as positive).By using these human rationales for ER, they trained the NLM to be less biased w.r.t.these group identifiers.For the purpose of our study, we use the lexicons as used by (Jin et al., 2021) to generate distantly-supervised rationales for the Stormfromt (Stf) dataset (de Gibert et al., 2018).Each instance in the Stf dataset is matched to one or more lexicons by simple character-level matching, and the rationales are generated as described above.We train F with the Stf dataset.We report all accuracies in        instances are tweets and GHC instances are taken from the Gab forum.Table 16 shows that while the improvements in HatEval are not significant, there are significant accuracy improvements for the GHC test set, which are due to the Order ER criterion.
Fairness Tests In addition to generic performance metrics like accuracy, we also measure group identifier bias (against the groups detailed by group identifier lexicons) by evaluating the False Positive Rate Difference (FPRD) as shown by (Jin et al., 2021).FPRD is computed as z |FPR z − FPR overall |, where FPR z is the false positive rate of all of the test instances mentioning group identifier z, and FPR overall is the false positive rate of all the test instance.Essentially, FPRD evaluates if F is more biased against a given group identifier z, than all of the groups.A lower FPRD value indicates less biased against the listed group identifiers by F.
Table 16 lists the FPRD values of all the ER criteria in ID and OOD datasets.While all other criteria suffer with higher bias than None, we observe that Order criterion consistently leads to the least bias, both in-distribution and out-of-distribution.Furthermore, the reduction in bias is significant when compared to None.Interestingly, Order  ER criterion was initially conceived for distantlysupervised rationales (Huang et al., 2021), and the authors of the original paper also demonstrated experiments with rationales generated from lexicons where Order criterion leads to improvements.Our observations are in-line with theirs, and we also show its benefit in reducing bias in F.

A.6.1 Details for Instance Prioritisation Experiments
In this section, we provide further implementation details for confidence-based instance prioritisation experiments as described in Section 4.3.Given that we have 3-seed runs for the None model in Table 1, we extract the confidence scores based on the given metric (LC/HC/LIS/HIS), and then average these confidence/importance scores across the 3 seed runs to obtain a single score for every instance.This process is done for training set instances only.This is followed by ranking each instance by the aggregated confidence metric and selecting the top k% of samples from this ranking.For experiments with random sampling based prioritisation, we generate 3 random subsets selected in a uniform manner.
While training in this setting, we ensure that within each batch, certain (one third to be specific) set of instances have available rationales.For these instances, we calculate the ER loss L ER , whereas, for the rest of the instances in the batch, we compute the task loss L task .All prioritisation settings are trained with 3 different model seeds and the aggregated results for ID and OOD datasets are shown in Table 4.

A.7.2 Task Setup
Each task is timed and have the same set of 200 instances to be annotated.Each instance is annotated by three annotators.
Using the annotations we receive, we aggregate the time taken across all annotators and instances to obtain a rough time estimate taken to annotate one instance for a given task.

Figure 1 :
Figure 1: Explanation Regularization (ER).ER improves model generalization on NLP tasks by pushing the model's machine rationales (Which input tokens did the model focus on?) to align with human rationales (Which input tokens would humans focus on?) (Sec.2).
) unseen dataset tests; (2) contrast set tests; and (3) functional tests.Let D be an M -class text classification dataset, which we call the ID dataset.Assume D can be partitioned into training set D train , development set D dev , and test set D test , where D test is the ID test set for D. After training F on D train with ER, we measure F's ID generalization via task performance on D test and F's OOD generalization via (1)-(3).

Figure 5 :
Figure 5: RQ3: How is ER affected by the number/choice of training instances with human rationales?: Task Performance (Accuracy) vs. % of rationale-annotated data for different sample selection criteria on four sentiment analysis datasets.

Figure 6 :
Figure 6: RQ4: How is ER affected by the time taken to annotate human rationales?: Task Performance (Accuracy) vs. time budget for rationale annotation, for each kind of annotation strategy, for each of the four sentiment analysis datasets.There are 1000 instances in the baseline training set, and 1 hr of annotation corresponds to 42, 98 and 77 instances each for Label + Expl, Only Expl and Only Label annotation tasks respectively.Annotation is done on ID dataset (SST) only.highinter-annotator agreement.For Only Label and Label+Expl, the Fleiss' kappa scores were 0.74 and 0.70.For Only Expl and Label+Expl, the rationale overlap rates (Zaidan and Eisner, 2008) were 0.78 and 0.66.We replicated this experiment in a small-scale study with nine CS students and observed similar trends.Using these time estimates, we devise three experiment settings.Given a baseline labelled training set D base of 1000 instances and a time budget T , we can -1) Add human rationales for a subset S T expl of D base , 2) Add new instances D T label with only labels to D base , or 3) Add new instances D T label+expl with labels and rationales to D base .Note that, the number of new instances added in each of the above experiment settings depend on the time taken to annotate the Only Expl, Only Label and Label + Expl tasks respectively.

Figure 9 :
Figure 9: Task Performance vs. ER Loss.Here we use IxG as rationale extractor An ER-trained NLM F task, ER and a non-ER-trained NLM F task, No-ER are likely to yield different outputs given the same inputs.Let D + ER ⊆ D and D + No-ER ⊆ D denote the sets of instances predicted correctly by F task, ER and F task, No-ER , respectively.Ideally, we would have D +

Figure 11 :
Figure 11: ER Loss Curves (Learning Rate).Here we use IxG as rationale extractor ily be the case, so we measure ER's opportunity cost as follows.Let n + ER = |D + ER \(D + ER ∩ D + No-ER )| be the number of instances predicted correctly by F task, ER , but not by F task, No-ER .Let n + No-ER = |D + No-ER \(D + No-ER ∩ D + ER )| be the number of instances predicted correctly by F task, No-ER , but not by F task, ER .Then, the opportunity cost of using ER is defined as:o ER = n + No-ER − n + ER |D|(3)In practice, instead of defining o ER for all of D, we only consider test sets D test and Dtest .

Figure 12 :
Figure 12: Change in Target Class Confidence

Figure 13 :
Figure 13: Functional Tests' Failure Rates (lower the better): We plot the failure rates of the four functional tests (vocab., robust., logic, entity) as described in Section 3.3, as well as the overall failure rate on all of the tests combined (mean).Each of the values are out of 100, but plotted accordingly for visible comparison.Here we use IxG as rationale extractor.

Figure 14 :
Figure 14: Label + Expl: Instructions and setup for Label + Expl annotation ; Kennedy  Nolan is a great director.

Table 3 :
RQ2: Instance-level vs. Task-level rationales.(Sec. strategies (LC, HC).Furthermore, certain instance selection criterion (like LC) perform significantly worse as the instance budget is increased.Zoominginto LIS, we compare it to random sampling and Non-ER settings in Figure5.We can observe that

Table 4 :
Sample Selection Methods: Settings mentioned in blue are significantly better than None settings.(See Appendix 17 for more details)

Table 5 :
Relative Decrease in ER Loss.For various ER strengths, we report the percentage decrease in ER train loss (on SST), from max point to min point.IxG+Huber 0.10 7.75 16.67 9.30 IxG+Order 7.21 9.38 47.97 1.89

Table 6 :
Relative Decrease in ER Loss.For various ER rationale scaling factors, we report the percentage decrease in ER train loss (on SST), from max point to min point.

Table 7 :
ER Loss vs. Task Performance.We summarize the line plots in Fig.9(ER Loss vs. Task Performance), using slope and R 2 score (Sec.A.3.8).Ideally, Fig.9's lines would have low slope and high R 2 , indicating that ER helps improve task performance.We see that MAE yields the best ER results.

Table 9 :
ID/OOD Opportunity Cost.Lower values are better.

Table 10 :
Change in Target Class Confidence.For bins where FNo-ER's target class confidence is low, there is a higher percentage of instances that are predicted incorrectly/correctly without/with ER.This suggests that instances with low target class confidence are more likely to benefit from ER.

Table 16 .
As it was

Table 11 :
Functional Tests: Sentiment Analysis

Table 12 :
Contrast Set Tests: Lexicon-based VS Instance-based.We use observed in Section 5.2, ER does not lead to a significant improvement in performance for the Stf test set.However, it is important to note that "blacklisting" group identifier lexicons does not lead to a drop in ID performance either.Benefits of "blacklisting" are then observed in

Table 13 :
Functional Tests: Lexicon-based VS Instance-based.Here we use MAE criterion and IxG as rationale extractor.

Table 16 :
ID/OOD Task Performance (Distantlysupervised Human Rationales): Higher values for accuracy and lower values for FPRD are considered better.All models displayed are trained on the ID dataset (Stf) with distantly supervised rationales (for ER criteria) and no rationales (for None) and evaluated on ID and OOD test splits.

Table 17 :
Instance Prioritisation Methods (with ID/OOD Performance): All values are accuracy (higher the better) on sentiment analysis.None corresponds to models trained without ER, where k = 100% corresponds to no annotation budget.Each of the k = [5, 15, 50]% have 3 instance prioritisation methods.□corresponds to cases where HC and Random are significantly similar and greater than LC.* corresponds to cases where HC is significantly greater than Random and greater than LC.• corresponds to cases where all the three methods are significantly similar.⋄and⋆ correspond to cases where the 100% ER setup is significantly similar and greater than None respectively.All tests are conducted with (p < 0.05).In this section, we demonstrate the MTurk experiment setup.Each MTurk annotator is paid minimum wage.Figures 14, 15 and 16 demonstrate UIs used by MTurk annotators for time-budget experiments.