Discovering Highly Influential Shortcut Reasoning: An Automated Template-Free Approach

Shortcut reasoning is an irrational process of inference, which degrades the robustness of an NLP model. While a number of previous work has tackled the identification of shortcut reasoning, there are still two major limitations: (i) a method for quantifying the severity of the discovered shortcut reasoning is not provided; (ii) certain types of shortcut reasoning may be missed. To address these issues, we propose a novel method for identifying shortcut reasoning. The proposed method quantifies the severity of the shortcut reasoning by leveraging out-of-distribution data and does not make any assumptions about the type of tokens triggering the shortcut reasoning. Our experiments on Natural Language Inference and Sentiment Analysis demonstrate that our framework successfully discovers known and unknown shortcut reasoning in the previous work.


Introduction
While Transformer-based large language models have remarkably improved various NLP tasks, the issue of shortcut reasoning has been identified as a severe problem (Schlegel et al., 2020;Wang et al., 2022b;Ho et al., 2022).Shortcut reasoning usually refers to the irrational inference of a model, which is derived from spurious correlations in the training data (Gururangan et al., 2018;Poliak et al., 2018;McCoy et al., 2019).For example, sentiment analysis models could learn to classify any sentences containing the word Spielberg into POSI-TIVE, given a training dataset with many positive movie reviews containing Spielberg (e.g., Spielberg is a great director!).
Shortcut reasoning makes models brittle against Out of Distribution (OOD) data (i.e., data from a different distribution from the training data) compared to Independent and Identically Distributed (IID) data (i.e., data from an identical distribution as the training data) (Geirhos et al., 2020).In the aforementioned example, the reasoning for movie reviews would not be valid for OOD data (e.g., news articles) because the sentiment of news articles containing Spielberg could be arbitrary.
Although many studies have explored the detection of spurious correlations or shortcut reasoning (Ribeiro et al., 2020;Pezeshkpour et al., 2022;Han et al., 2020), several challenges persist.Wang et al. (2022a) propose a state-of-the-art method for discovering shortcut reasoning, which implements an automated framework to discover shortcuts without predefining shortcut templates.Still, their approach suffers from two major limitations.
Firstly, their framework lacks a method for quantifying the severity of the discovered shortcut reasoning on OOD data.Even if the shortcuts are identified, we do not have to necessarily be concerned about them as long as they have little negative impact on the model's robustness.Secondly, their approach assumes that genuine tokens, useful tokens for predicting labels across different datasets (e.g., "good", "bad"), do not lead to shortcut reasoning.While this assumption seems reasonable, Joshi et al. (2022) argue that such tokens are still prevalent among spurious correlations.This is because such tokens are indeed necessary to predict the label, but these tokens alone may not provide sufficient information to accurately predict labels.For example, a genuine token "good" in a sentence This movie is not good can be spurious, since "good" is a necessary but insufficient token for determining its sentiment label.Therefore, genuine tokens can not be ignored when identifying shortcut reasoning.
To address these problems, we propose a new method for discovering shortcut reasoning.Our contributions can be summarized as follows: • We present an automated method for identifying shortcut reasoning.• By applying the less subjective definition of shortcut reasoning (Geirhos et al., 2020), our method does not require laborious human evaluation for detected shortcuts.
• Our method quantifies the severity of shortcut reasoning by leveraging OOD data and does not make any assumptions about the type of tokens triggering the shortcut reasoning.
• We demonstrate that our method successfully discovers previously unknown shortcut reasoning as well as ones reported in previous research.

Discovering Shortcut Reasoning
Fig. 1 shows the overall procedure of the proposed method.Given (i) a target model f , (ii) IID data D IID , and (iii) OOD data D OOD as inputs, shortcut reasoning is extracted as an output.The procedure consists of the following three steps.
Step 1 extracts inference patterns, an abstract representation that characterizes the inferential process of a given model ( §2.1).To extract inference patterns, we use input reduction, an algorithm that automatically derives the inference patterns ( §2.2).
Step 2 estimates the generality of the extracted inference patterns.Generality is a measure of the strength of an inference pattern, indicating its degree of regularity ( §2.3).Step 3 identifies shortcut reasoning.We automatically determine whether an inference pattern exhibits shortcut reasoning without human intervention, comparing the effectiveness of inference patterns in D IID with that in D OOD and leveraging the estimated generality as a proxy for the severity of the identified shortcut reasoning ( §2.4).

Inference Pattern
An inference pattern is a crucial pattern that activates a label during a model's inference process.Given a target model f , we formally define an inference pattern p as follows: where t denotes a trigger that induces a certain label, and l is the induced label.Pezeshkpour et al. (2022) classified spurious correlations into two types: (i) granular features, namely discrete units such as an individual token "Spielberg", and (ii) abstract features, namely highlevel patterns such as lexical overlap.
This paper focuses on granular features and leaves the detection of shortcut reasoning with abstract features for future work.We thus adopt the following definition as an inference pattern: where w is a sequence of tokens Although we limit ourselves to granular features, this definition still enables us to detect shortcut reasoning with a variety of forms, such as combinations of tokens as well as a single token.For example, a sentiment analysis model f may have inference patterns such as ["not", "bad"] → NEUTRAL or ["Spielberg"] → POSITIVE (possibly shortcut reasoning).

Extracting Inference Patterns
Given a target model f and an IID dataset D IID = {(x i , y i )} N i=1 , we extract a set C of inference patterns by applying input reduction (IR) to each input x i .
IR gradually reduces the number of tokens in x i by masking each token one by one, incrementally increasing the number of masked tokens after each step.In each step, IR feeds the masked x i into f , and obtains a predicted label ŷi .IR stops when the predicted label ŷi flips.As the final step, IR extracts a sequence w i of unmasked tokens in x i and ŷi that is the last predicted label before the prediction flips as an inference pattern, namely w i f → ŷi .To prioritize which tokens should be masked, we employ Integrated Gradient (Sundararajan et al., 2017), which computes the importance of each token for prediction.IR sorts the tokens in x i according to their IG score and then incrementally applies masks to the tokens in ascending order of their rank in the sorted sequence.For each mask applied, IR leaves the corresponding token masked and proceeds to the next token in the sequence.
Fig. 2 shows an example of the extracting process by IR.The tokens are replaced with [MASK] in the ascending order of the IG scores (values in orange in the figure).When the predicted label is flipped from NEGATIVE to POSITIVE, the remained tokens at the previous step and NEGATIVE label are extracted as the inference pattern ["don't", "like"] → NEGATIVE.See Appendix A for the pseudocode of IR.
The IR-based extraction algorithm ensures that the trigger w i of the extracted inference patterns is concise and not redundant.

Calculating Generality
In order to verify the validity of an inference pattern in C as a universal pattern, we assess its generality on D OOD .This measurement determines the degree to which the pattern exhibits regularity in the OOD dataset.
To estimate the generality of inference pattern p i = w i → l i ∈ C, we collect a set E OOD (w i ) of examples from D OOD such that the input contains w i .For example, given p i = ["Spielberg"] → POSITIVE, E OOD (w i ) may contain sentences such as I grew up with Steven Spielberg's films.His films are always great!! and Spielberg is overrated.We then estimate the generality g of the inference pattern p i as follows: Intuitively, g(p i ) explains how much the inference pattern is dominant on the OOD dataset.

Identifying Shortcut Reasoning
In this section, we define shortcut reasoning and describe the method for its detection.According to Geirhos et al. (2020), shortcut reasoning satisfies both of the following conditions: (i) performs well on D IID , and (ii) underperforms on D OOD .We apply these conditions to inference patterns.Given p i = w i → l i extracted from D IID by IR, the condition (i) is satisfied when p i works well on D IID .In other words, when the model performs well on IID examples that contain w i (i.e., E IID (w i )).Thus, we evaluate the performance of each inference pattern using E IID (w i ).Specifically, we define a new metric iid_acc i , which computes This metric counts the right prediction for inputs that contain trigger (w i ).
The condition (ii) is satisfied when p i does not deliver accurate results on D OOD , i.e., when the model operates poorly on OOD inputs that contain w i (i.e., E OOD (w i )).As a metric to evaluate how much the p i underperforms over E(w i ), we define ∆ as follows: This metric compares the F1 score on E OOD (w i ) to that on D OOD , employed as a baseline for comparison.
To sum up, shortcut reasoning is defined as pi = w i → l i such that g(p i ) is sufficiently large, iid_acc is large enough and ∆ i is small enough.The set P of shortcut reasoning is defined as λ 1 , λ 2 , and λ 3 are the pre-defined thresholds.Note that λ 2 and λ 3 have to be an above-chance score and less than 0 at least, respectively.This definition enables us to automatically identify shortcut reasoning that has a substantial impact on OOD, unlike previous studies (Pezeshkpour et al., 2022;Wang et al., 2022a).

Setup
One straightforward approach for assessing our method is to annotate NLP models with their ground-truth shortcut reasoning.However, recent NLP models are known to be hard to interpret, which makes it difficult to create such a reference dataset.We thus resort to existing datasets for Natural Language Inference (NLI) and Sentiment Analysis (SA) that have been shown to contain spurious features and check if our inference patterns can reveal such features (and unknown ones).Datasets For NLI, we adopt MNLI (Williams et al., 2018) as D IID and ANLI (Nie et al., 2020) as D OOD .ANLI is an NLI dataset based on MNLI, but adversarially redesigned, which makes it harder to answer.For SA, we apply Sentiment subset in Tweeteval (Barbieri et al., 2020) as D IID and MARC (Multilingual Amazon Reviews Corpus) (Keung et al., 2020) as D OOD .For all OOD datasets, we use training split of each.See Appendix B.1 for the dataset details.Note that all the datasets are trinary classifications, so the chance accuracy is 0.33.Models We apply our method to RoBERTa (Liu et al., 2019) fine-tuned with the D IID mentioned above, available at Hugging Face Hub.See Appendix B.2 for the details.Configuration The inference patterns of a model are obtained by learning training data.Thus, aside from test (or validation) sets, we extract C from the training set of D IID , expecting to better simulate the model's reasoning process.In addition, we randomly choose 1,000 examples as input to IR considering its runtime.We set hyperparameters λ 1 = 50, λ 2 = 70 and λ 3 = −0.05.To reliably obtain g(p i ), we filter out p i with |E OOD (w i )| < 100 from C.

Results
We show samples of the results in Table 1.We selected representative p, which have large g, iid_acc, and |∆|.The column train/test denotes whether shortcut reasoning is discovered in the training or the test split of IID dataset (i.e., input for IR).
NLI For OOD data, the model performed 77.8 of F1(D OOD , f )."/s" denotes a separation token between premise and hypothesis.We observed that most of t identified as shortcut reasoning belonged to the hypothesis, while only a small proportion was present in the premises.This observation suggests that the model heavily relies on the hypothesis to predict labels, corroborating the findings of Poliak et al. (2018).Furthermore, we found that negation representation in t, such as "not" or "never", often led the model to predict CONTRADICTION.This phenomenon manifests itself even when the gold label indicates otherwise, as indicated by the values of ∆ at the first and fourth p in the Table 1.This finding aligns with the results reported by Gururangan et al. (2018).With this observed consistency with previous work, our method seems to be effective at accurately identifying shortcuts.

SA
The model showed 60.3 points at F1(D OOD , f ).We found that sentiment words, such as "worst" or "Excellent", emerged in almost all t that were classified as shortcut reasoning.Further analysis showed that reviews with neutral labels in MARC frequently contained both positive and negative sentiments (e.g., I hate the wrapping, but it works pretty well.).Considering a sufficient number of p with small ∆ are annotated with neutral label for the original input, we estimate that the model relies on one among multiple sentiments in the input and ignores the rest.Therefore, it is possible to say that these inference patterns are shortcuts, whose t are necessary but insufficient.Train/Test No significant difference was observed between the train and test experiments.Although the extracted shortcut reasoning from the experiments differed, they were essentially similar in terms of their characteristics (such as negations in NLI or sentiment words in SA).Unknown Shortcut Reasoning In the NLI experiment, it is interesting to note that we revealed several previously unknown shortcut reasoning, such as ["soon"] → NEUTRAL and ["is", "always"] → NEUTRAL.Both have sufficiently small ∆, and large iid_acc and g to be considered as p.

Related work
Numerous studies have tackled the problem of detecting spurious correlations or shortcut reasoning (Ribeiro et al., 2020;Han et al., 2020).One major limitation of earlier studies is that they predefine a specific format or structure for the shortcuts (e.g., a single token or a predefined set of tokens).This can hinder the discovery of new and unexplored shortcuts, which may be manifested in diverse forms.
Recently, Pezeshkpour et al. (2022) address this problem by combining multiple interpretability techniques, such as influence function (Koh and Liang, 2017) and feature attribution methods (e.g., Integrated Gradient).However, they rely on human assessment to identify shortcut reasoning, which can result in misjudgment between rational and irrational reasoning.Besides, human evaluation is laborious and time-consuming.Wang et al. (2022a) solve this issue by automatically identifying genuine tokens, important tokens that appear across different datasets, and spurious tokens, important tokens that appear only in an indomain dataset.Still, as discussed in §1, there is a major limitation with their approach in that they fail to consider the influence of the identified shortcut reasoning on OOD data.Our work attempts to address this issue by estimating the generality of inference patterns.Besides, our definition of shortcut reasoning aligns well with more practical scenarios.While they rely on a subjective definition of shortcut reasoning (i.e., whether reasoning is irrational from humans' point of view), our work targets shortcut reasoning that performs well on IID but underperforms on OOD (Geirhos et al., 2020), namely the one that clearly hurts the robustness of NLP models by definition.

Conclusion
We introduced a method to automatically discover shortcut reasoning.With minimal predefinition, our method successfully identified known and previously unknown examples of shortcut reasoning.For future research, we plan to adapt our method to large language models and other tasks such as machine reading comprehension.Overall, we hope that our study provides a promising approach towards understanding the behavior of deep learning models and improving their trustworthiness.

Limitations
Firstly, we have yet to develop an evaluation process to validate the discovered shortcut reasoning.Even though we indicate the metrics or measurement of shortcut reasoning, knowing the actual reasoning process is impossible if we use black box models.Unfortunately, this problem would require significant effort to be solved.
Secondly, as our method is not compatible with abstract inference patterns, it cannot cover all kinds of shortcut reasoning other than the granular one.
Thirdly, preparing two datasets, i.e., IID and OOD, is challenging for low-resource languages or some tasks.This problem limits the further studies or application of this method.Fortunately, now that we can access large language models that have surprising linguistic capabilities and are well-aligned with the user's instruction.The generated examples by LLMs have a certain distribution which can be treated as OOD for target models, or we can prompt them to generate examples with specific distribution.
The fourth limitation is about IR.If the prediction for the masked input does not flip during the reduction, then we alternatively output the last token left in the input.Therefore, in some cases, we cannot guarantee that the extracted inference pattern is genuine.

Figure 1 :
Figure 1: Our method to discover shortcut reasoning with 3 steps.

Figure 2 :
Figure 2: An example of inference pattern extracted by Input Reduction.

Table 1 :
Samples of identified shortcut reasoning on NLI (above) and SA (bottom).We show several interesting results and p that holds large g and iid_acc and small ∆.
Adina Williams, Nikita Nangia, and Samuel Bowman.2018.A broad-coverage challenge corpus for sentence understanding through inference.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112-1122, New Orleans, Louisiana.Association for Computational Linguistics.