On the Interaction of Belief Bias and Explanations

A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn't clear how such metrics reflect human interaction with explanations. Human evaluation is of paramount importance, but previous protocols fail to account for belief biases affecting human performance, which may lead to misleading conclusions. We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it. For two experimental paradigms, we present a case study of gradient-based explainability introducing simple ways to account for humans' prior beliefs: models of varying quality and adversarial examples. We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.


Introduction
Machine learning has become an integral part of our lives; from everyday use (e.g., search, translation, recommendations) to high-stake applications in healthcare, law, or transportation. However, its impact is controversial: neural models have been shown to make confident predictions relying on artifacts (McCoy et al., 2019;Wallace et al., 2019) and have shown to encode and amplify negative social biases (Manzini et al., 2019;Caliskan et al., 2017;May et al., 2019;Tan and Celis, 2019;González et al., 2020;Rudinger et al., 2018).
Explainability aims to make model decisions transparent and predictable to humans; it serves as a tool for model diagnosis, detecting failure modes and biases, and more generally, to increase trust by providing transparency (Amershi et al., 2019). While automatic metrics have been proposed to  (Atanasova et al., 2020;Robnik-Sikonja and Bohanec, 2018;DeYoung et al., 2020), these metrics do not inform us about human interaction with explanations.
Doshi-Velez and Kim (2017) suggested human forward prediction, a simulation task in which humans are given an input and an explanation, and their task is to predict the expected model output, regardless of the gold answer. Recent studies include Nguyen (2018); Lage et al. (2019); Hase and Bansal (2020); Poursabzi-Sangdeh et al. (2021). Such protocols are widely used and can provide valuable insight into human understanding of explanations. However, prior work has not accounted for how humans' prior beliefs (belief biases) interact with the evaluation; simulating model decisions becomes an easier task when the model being evaluated makes predictions which align with human expectations. We argue that not considering belief bias in such protocols may lead to misleading conclusions about which explainability methods perform best.
Other protocols have evaluated participant's ability to select the best model based on explanations offered by different interpretability methods (e.g. decide which model would generalize 'in the wild') (Ribeiro et al., 2016a). However, comparisons have been made between a model which is clearly in line with human beliefs, and another which exploits spurious correlations diverging from human expectations. When differences are less obvious, humans may not be able to leverage their belief biases, and conclusions may change.
This paper, which includes evaluations for both of the previously mentioned tasks, closes an important gap: to the best of our knowledge, no prior work in NLP addresses the interaction of belief bias with current human evaluations of explainability.
Contributions. We provide an overview of belief bias meant to highlight its role in human evaluation and provide some preliminary ideas for NLP practitioners on how to handle such cases. Using human forward prediction and best model selection (Figure 1), we present a case-study where we compare two gradient-based explainability methods in the context of reading comprehension (RC), introducing conditions to take into account belief bias. We find that both explainability methods are helpful to participants in the standard settings (in line with most previous work), but the conclusions about the best performing models change when incorporating additional control conditions, reinforcing the importance of accounting for such biases.

Belief Bias
Belief bias is a type of cognitive bias, defined in psychology as the systematic (non-logical) tendency to evaluate a statement on the basis of prior belief rather than its logical strength (Evans et al., 1983;Klauer et al., 2000;Barston, 1986). Cognitive biases are not necessarily bad; they help us filter and process a great deal of information (Bierema et al., 2020), and have been widely studied in real human-decision making (Tversky and Kahneman, 1974;Kahneman, 2003;Furnham and Boo, 2011). However, in evaluations involving human participants, such biases may alter results and affect conclusions (Anderson and Hartzler, 2014;Wall et al., 2017).
Classic psychology studies of belief bias have assessed how prior beliefs affect syllogistic reasoning (Newstead et al., 1992;Klauer et al., 2000;Evans et al., 1983;Markovits and Nantel, 1989;. In syllogistic reasoning, the task for humans is to assess the logical validity of such arguments while ignoring believability. While both arguments are logically valid, most work converges on the finding that humans will rate argument (a) as invalid more often than (b), biased by the fact that the premise in (a) is less believable.
In psychology, belief bias has been tied to the dual-processing theory, which assumes that reasoning is performed by two competing cognitive systems: (1) system 1 which takes care of fast, heuristic processes and (2) system 2 which handles slower, more analytical processes (Evans, 2003;Trippas and Handley, 2018;Evans and Curtis-Holmes, 2005;Croskerry, 2009). Generally, humans tend to have a cognitive preference for relying on fast, intuitive system 1 processes, rather than engaging in the slow and more analytical system 2 processes. Belief bias is attributed to system 1 (Evans and Curtis-Holmes, 2005;Evans, 2008;Evans and Frankish, 2009;Stanovich and West, 2008) due to several factors, reviewed in detail by Evans (2003);Caravona et al. (2019).
For the purposes of NLP studies relying on crowd workers, one relevant finding is that time pressures exacerbate reliance on previous beliefs (Evans and Curtis-Holmes, 2005). Since crowd workers generally are incentivized to work as quickly as possible to maximize their hourly pay, reliance on belief bias is to be expected.
Another relevant finding for NLP is that threatening or negatively charged arguments (e.g. content violating political correctness and social norms) leads to greater engagement of system 2, whereas neutral content leads to increased reliance on belief bias (Goel and Vartanian, 2011;Klaczynski et al., 1997). Since NLP studies tend to be performed on neutral content such as passages from Wikipedia -content which may not sufficiently engage participants' system 2 processes -belief bias is more likely to play a role in human performance.
This study aims to highlight the phenomenon of belief bias to encourage NLP practitioners to assess the role it plays in their evaluations, and introduce mechanisms to account for belief bias effects. We illustrate how belief bias effects can significantly affect the results of human evaluation of explainability for two paradigms: human forward prediction and best model selection.

Related Work
Human forward prediction. Human forward prediction experiments have been recently presented in the context of synthetic data (Poursabzi-Sangdeh et al., 2021;Lage et al., 2019;Slack et al., 2019) to evaluate explainability methods for their ability to make model decisions predictable to humans. In this paradigm, humans are presented with explanations and tasked with predicting the model's decision regardless of the ground truth (Doshi-Velez and Kim, 2017). 1 In NLP, Nguyen (2018) introduced human forward prediction for LIME explanations (Ribeiro et al., 2016b) of sentiment analysis of product reviews and correlated the results with automatic evaluations. Unlike with synthetic data, participants have prior beliefs on what the true outcome is. Since participants in Nguyen (2018) had no training phase to learn how explanations correlate with predictions and the model being evaluated sufficiently matched human behavior, humans likely relied exclusively on their prior knowledge and beliefs to complete the task at hand. Hase and Bansal (2020) improved on this protocol by adding a training phase. This is something we also do in our experiments (section 5), but it is unlikely to solve the belief bias problem because even after training, humans will naturally opt for fast, heuristic mechanisms (e.g. belief bias) in order to simplify tasks ; this is particularly true if the model is high performing (i.e. likely aligns with human beliefs).
The protocol by Hase and Bansal (2020) had another key feature: they leave out the explanations for the test data points. This would seem like an advantage for evaluating explainability methods in the context of reading comprehension where explanations can, in theory, simply highlight the answer span, making it easy to guess the model output from the explanations. However, it is easy to control for the amount of explanation provided by the explanation methods we compare; in our experiments below, we highlight the top 10 tokens with highest attribution scores. This key feature in their protocol is problematic for two reasons: • It makes the human learning problem much harder, and we argue it is infeasible to expose 1 Using synthetic data from fictitious domains effectively controls for belief bias (Lage et al., 2019;Slack et al., 2019). Slack et al. (2019), for example, evaluate explanations in the domain of recommending recipes and medicines to aliens. participants to enough examples to make human forward prediction learnable (unless the task is made very easy on purpose; again by only evaluating high performing models). If it is not learnable, participants fall back on belief bias. • It introduces a systematic bias between the training and test scenarios.
The protocol in Hase and Bansal (2020) also does not randomize the order in which participants are exposed to problems with or without explanations.
We improve on the above protocol by introducing a condition which can help account for belief bias effects: evaluating explainability methods on low-quality models, the predictions of which substantially differ from human beliefs. This means that in order to succeed in the task, humans cannot simply rely on their previous beliefs, therefore, helping us assess the ability of explanations in helping humans to realign their expectations of model behavior. The predictions of reading comprehension models can also be made different from human answers by introducing distractor sentences that fool machine reading models, but not humans (Jia and Liang, 2017). If in human forward prediction, participants predict the true answer rather than spans in the distractor sentences, this suggests participants may be relying on their belief biases.
Best model selection. Ribeiro et al. (2016b) presented an evaluation of explainability methods for text classification, where explanations for decisions of two different models on the same instance are presented side by side, and humans decide which model is likely to generalize better. With some exceptions (Lertvittayakumjorn and Toni, 2019), there has not been much follow up work on this task, but this scenario is important: it mimicks the decisions about what model is safer for deployment. Ribeiro et al. (2016b) and Lertvittayakumjorn and Toni (2019) both make a single comparison between a model which clearly diverges from human intuition, and a model that generalizes and aligns with humans' beliefs. Accounting for the extent to which belief biases are leveraged (e.g. by introducing additional model comparisons where differences are not so obvious or where models are of low quality) is important in such paradigms, and can allow us to better evaluate where explanation methods may fail.
In the following sections, we show that introducing conditions which take into account belief biases can have an effect on the conclusions for both human forward prediction and best model selection. We emphasize that many other potential strategies can be introduced and this is largely dependent on the goals of the evaluation protocol; we merely provide one example case with the following strategies: (1) Introducing low quality models which considerably diverge from humans' prior beliefs (human forward prediction) (2) Introducing evaluation problems with distractor sentences (human forward prediction) (3) Introducing model comparisons where relying on belief bias is not enough to obtain high performance (best model selection)

Experimental Setup
This section introduces the general setup of the experiments, with details specific to each experimental paradigm described in section 5 and section 6. tinyBERT, a 6-layer distilled version of BERT (Jiao et al., 2020), fine-tuned on SQuAD 2.0. It performs about 20 F 1 points below HIGH. This model somewhat aligns with human intuition, but performs significantly lower. (c) a low performing model (LOW): BERT-base, fine-tuned to always choose the first occurrence of the last word of the question. This system mimicks a rule-based system 2 ; however, we evaluate gradient-based methods requiring a neural model. This model diverges significantly from human beliefs.

Data
We use SQuAD 2.0 (Rajpurkar et al., 2018), a RC dataset consisting of 150k factoid question-answer pairs, with texts coming from Wikipedia articles.
We opt for this data as it contains short passages that can be read by humans in a short time. In the human forward prediction experiments, we refer to experiments using this data as ORIG. As described in section 2, Wikipedia texts could by themselves induce people to rely on their belief bias, but this particular dataset allows us to also introduce controls for the bias: the adversarial version of the data (Jia and Liang, 2017), has been shown to distract models but not humans. This means that in order to perform the task with success, humans need disregard their belief biases, and in some cases align with distractor sentences. We refer to this data in our simulation experiments as ADV.

Explainability Methods
We focus on gradient-based approaches, as they require no modifications to the original network, and are considerably faster than perturbation-based methods. We compare two explainability methods: Gradients. Computing the gradient of the prediction output with regard to the features of the input is a common way to interpret deep neural networks (Simonyan et al., 2013) and capture relevant information regarding the underlying model.
Integrated gradients. Integrated gradients approach (IG) (Sundararajan et al., 2017) attributes an importance score to each input feature by approximating the integral of gradients of the model's output with respect to the inputs along the path, from the references to the inputs. IG was introduced to address the sensitivity issues which are present in vanilla gradients and implementation invariance.

Experiment 1: Human Forward Prediction
Human forward prediction for evaluating explainability was proposed by Doshi-Velez and Kim (2017). They argue that if a human is able to simulate the model's behavior, they understand why the model predicts in that manner. For the reasons previously outlined, we suspect that belief biases may be affecting performance and the conclusions once can draw from this task. We investigate this by asking the following: Can humans predict model decisions, if model behavior considerably diverges from their own beliefs?
Stimuli presentation. We include: (i) HIGH, which is finetuned to solve SQuAD 2.0 and (ii) LOW, which is finetuned to select the first appearance in the context of the last word in the question. We evaluate each of the two models twice: with or without adversarial data. We contrast using vanilla gradients and IG with a baseline condition, in which no explanations are shown (BASELINE). We highlight the top-10 tokens 3 with the highest attribution scores wrt. the start and end positions of the predicted span, and zero out the rest. 4 The two sets of tokens often overlap.
Participants were provided with a question and a passage (with or without explanations) and were told to pick the shortest span of text which matched the model prediction. They saw the actual model answers before the next example (done for both baseline and explanation conditions), which was an important part of training to infer model behavior. Before the model prediction was shown, their answers were locked to prevent any further changes. An example of our interface can be found in Fig We ran these experiments on Amazon Mechanical Turk, recruiting participants with approval ratings greater than 95% 5 and ensuring different groups of participants per condition by specifying that participation is only allowed once, otherwise risking rejection 6 . We paid participants $5.25 for about 20 minutes of work (to ensure at least a $15 hourly pay) and obtained at least three annotations per example. The data included 120 unique questions divided into small fixed batches (the same questions across conditions). About 75% of questions are accurate in the HIGH model, and around 15% are accurate for the LOW model. In total, we obtained 4,300 data points across 123 participants (35 data points per participant).
Results. As humans often did not select the exact span that was provided as ground truth, we manually labeled the spans as correct or incorrect. We also inspected the impact of training in human forward prediction, e.g., the learning effect of multiple exposures on annotator accuracy. Both with vanilla gradients and integrated gradients, we observe an increase in the participants' accuracy at around 15 examples. In contrast, in our baseline condition, 3 Explanations should be selective (Mittelstadt et al., 2019) 4 Ribeiro et al. (2016a) use the top 6 attributes; we opt for 10 given that our texts are slightly longer. 5 Previous research has shown that proper filtering and selection of participants on Mechanical Turk, can be enough to ensure high quality data (Peer et al., 2014). 6 We also remove such (few) repetitions at analysis performance either stays constant or drops slightly.
To reduce the noise introduced due to the training period, we remove the first 15 examples of each participant. The results without this preprocessing (Appendix A) suggest that the effect of training differed across explainability methods, as will be discussed later in the section. Using the average human accuracies per example, we run a one-way ANOVA to test for significant differences across the groups. As we obtained statistically significant results, we then ran the Tukey honest significant difference (HSD) test (Tukey, 1949), comparing the means of every condition to the means of every other condition. The results are presented in Table 1.
As expected, in the absence of explanations (BASELINE), humans rely on belief bias and predict the gold standard answer more often than the model prediction (y in Table 1). Even with training (seeing the true model prediction), humans fail to catch onto the simple rule used by the LOW model, when no explanations are presented.
Overall, explanations derived from both of the gradient-based approaches lead to statistically significant improvements over the baseline. This indicates that the explanations allow humans to realign their expectations of the model behavior, better than with no explanations.
For HIGH-ORIG, the standard setting explored in previous evaluations, both IG gradients and vanilla gradients perform well, with IG gradients performing better. Given these results and the theoretical advantages of IG over vanilla gradients, one could arrive at the conclusion that IG are better  Table 1: Human forward prediction results (HUMAN(ŷ)) for LOW and HIGH models, compared to no explanations (BASELINE). Each experiment is run on vanilla SQuAD 2.0 data (ORIG) and adversarial SQuAD 2.0 data (ADV). HUMAN(y) is the dataset ground truth and an indicator of belief bias. Statistically significant results are indicated with an asterisk. Time is the average time per question. The bestŷ results in each condition are bolded. for simulatability. However, the differences between the two gradient-based methods are reversed in the conditions where humans cannot rely on their previous beliefs (LOW). The gap between gradients and IG as large as 0.11, and being statistically significant. This finding is surprising and points again to the importance of not drawing incorrect conclusions about the best performing method using the standard paradigm.
Finally, in the HIGH conditions, model behavior decreases about 13% F1 score with the presence of adversarial examples, meaning that the model we used does get affected by adversarial inputs. We observe that human performance is considerably lower in HIGH-ADV as opposed to HIGH-ORIG. With vanilla gradients, performance is more aligned with the ground truth labels than with model behavior, showing that in this condition humans are also relying on their prior beliefs. With IG, where performance is less aligned with prior beliefs (ground truth), the end performance increases, but it seems that this condition is considerably more difficult for humans.
Effect of training. In BASELINE, training does not affect either the LOW or HIGH conditions (see Table 3 in Appendix A for the raw results). For the LOW model, multiple factors can be taking place (possibly at the same time): (1) the task is too far from the humans' beliefs and there is no mechanism to help participants realign their expectations, (2) participants may not be incentivized to seriously engage and look for patterns, (3) participants opt for a mixed strategy, where for some questions they go with their prior beliefs and for others, choice is random (as seen in their performance in y).
For HIGH conditions in BASELINE, performance remains higher than LOW but this is likely due to belief bias and not training, given that performance remains constant after removing the training data points. We hypothesize that for HIGH, instances where the model does not align to human intuition might be more detrimental than in explanation conditions. More specifically, if humans are aware that the model aligns with their beliefs after some examples but encounter instances where it doesn't (model is not 100% accurate), they will likely develop an expectation that the model is bound to make some errors, without any indication of when.
In addition, our raw results suggest IG required longer training. While this does not mean IG is a worse method than vanilla gradients, explanations derived from IG may have confused participants due to containing information which was irrelevant to them. It may be that experts (e.g. system engineers knowledgeable about neural networks) can take better advantage of such explanations; however, we leave this exploration of the interaction of human expertise with explanations as a direction for future work.

Experiment 2: Best Model Selection
This section presents the setup and results of our model selection experiments; a task where humans select the model that is more likely to succeed in the wild. We present the participants with the explanations from two models (HIGH vs LOW and HIGH vs MEDIUM), and ask them to decide which model is likely to perform better. As a follow-up, we also experimented with soliciting explanations about what leads the worse model to fail. Intuitively, comparative evaluation difficulty depends on how clear the difference is between the compared objects. Explanations should at least show the difference between a high-performing model and a low-performing one, enabling human participants to predict which is better (standard setting).
Stimuli presentation. We presented participants with saliency information from both models (a high performing model + one of the lower performing models), and their task was to determine which model performs best in the wild. We shuffled the order at random so that the best model would not remain in a fixed position. We obtain 120 samples (question-context pairs), and show the explanations next to each other as seen in Figure 1. The participants are told that the highlighted attributes are the words the model found important in making its decision. A screenshot of the UI is shown in Figure 4 in section B and the instructions provided to the participants are also shown in section B. These experiments were also ran on Amazon Mechanical Turk with the same general procedures and pay. The same subset of 120 examples is used in all conditions. We obtained at least three annotations per example and ended with a total of 1440 data points across 48 participants (30 examples each).

Results.
For each example shown to annotators, we obtained the average accuracy scores and performed a standard T-test to compare the performance of the two methods. The results are shown in Table 2. Using explanations from both methods, when shown the HIGH and LOW model, humans are clearly able to correctly select the better one. With IG, humans achieve 0.95 accuracy on average, while with vanilla gradients they achieve 0.89. The difference is not statistically significant. The fact that users are consistently able to discriminate between HIGH and LOW models is expected, and serves as a sanity check that these explanations are meaningful for humans.  When the same experiment was repeated in the HIGH vs MEDIUM condition, we found clear and statistically significant differences between the two explainability methods. Using IG, participants reach only 0.52 accuracy, while with vanilla gradients their performance is 0.85. This is surprising, given that the difference in performance between the two models is still quite large (about 20% F1); the expectation is that both methods would capture this difference relatively well. It appears that when both models more or less align with human beliefs, the task is much more difficult. To solve the task, humans now need to engage in more analytical thinking and cannot simply rely on belief biases to solve the task. We further investigate these differences through qualitative coding.
Qualitative analysis. After each instance, we asked participants to describe how the worse model will fail. We do not provide detailed guidelines in order to not further bias the participants by introducing specific criteria. The instructions given to the participants are shown in Appendix B.
We collected 1440 responses, which were all inspected manually to uncover categories (codes). After multiple iterations, we tagged each response with one code (categories are mutually exclusive, no response can be placed in two). A description of the categories and their distribution are shown in Figure 3, and examples of feedback per category are provided in the Appendix B.
In the HIGH vs LOW condition, feedback for both methods was generic (about 70-80% of the time), e.g., model B is likely incorrect so it is worse. This was expected: this task should be easy when model differences are large and humans can rely on their system 1 processes to get through the task without thinking deeply about the explanations.
In the HIGH vs MEDIUM condition, the distribution of the feedback categories is very different. For IG, 50% of the time participants felt the highlighted tokens where irrelevant. This is not the case for gradients, where only about 15% of responses fell in that category. Additionally, for vanilla gradients, 50% of feedback is generic, signaling that in this condition, it may have been an easy task as well; explanations are making model behavior clear enough. It remains an open question whether IG explanations may in fact be more faithful to the model reasoning. In that case, expert users (e.g. a system engineer debugging a system) may not find IG attributions irrelevant and would be able take better advantage of the information provided. For this reason, other kinds of human participants may show different results. Nevertheless, as evaluating on non-experts (crowdsourced workers for example) is common, this preliminary result is important: it shows that conclusions can shift dramatically when introducing additional model comparisons which reduce the participants' ability to rely on prior knowledge.

Discussion: Mitigating Belief Bias
This study introduced additional conditions in which the human participants could not rely on their belief biases to facilitate the task at hand. We presented a case study on evaluating reading comprehension models in model selection and human forward prediction paradigms, and we showed that this simple addition led to different conclusions in the evaluation and a better understanding of how humans interacted with explanations. Other tasks and paradigms might call for different setups, but generally including conditions with models of varying quality would be helpful both for the purposes of bias control, and for simulation of real-life use of explainability techniques to support decisions about which model is safer to deploy.
To conclude, we will briefly mention other directions for mitigating belief biases that can also be explored in future work and which should be kept in mind when developing evaluation protocols for explainability.
Reducing ambiguity. Ambiguity of task instructions leads humans to align interpretations to their own prior beliefs (Heath and Tversky, 1991); this may lead to misinterpretation and results which do not reflect the intended interaction with explanations. Ambiguity may also be present in other parts of the evaluation setup. For example, Lamm et al. (2020) evaluate the effectiveness of explanations in helping humans detect model errors for open-domain QA, but the data they use contains questions where multiple answers can be true. Users may deem an answer to be correct or incorrect based on their understanding of the question, which makes the effect of explanations blurry. Removing ambiguous instances from the data can be a way of reducing such confounds.
Removing time constraints. Time constraints exacerbate reliance of system 1 processes, which leads to humans relying on belief biases. In crowdsourced evaluations, it is common practice to to provide workers with enough time to perform tasks, but workers may have intrinsic motivations for performing tasks quickly. A major challenge for evaluation research with crowd workers is creating better incentives for engaging in system 2 processes, e.g. pay schemes which encourage workers to be more analytical and accurate (Bansal et al., 2019).
Include fictitious domains. Using data from domains from which subjects have no prior beliefs e.g. fictitious domains, may be an efficient way of controlling for belief bias in some tasks 7 . This strategy has been used outside of NLP (Poursabzi-Sangdeh et al., 2021;Lage et al., 2019;Slack et al., 2019), where subjects are asked to imagine alternative worlds such as scenarios involving aliens. In QA for example, one could introduce context-question pairs that describe facts about fictitious scenarios that sufficiently differ from human reality.

Conclusion
The main contribution of this paper is bringing the discussion of belief bias from psychology into the context of evaluating explainability methods in NLP. Belief bias is a phenomenon which plays a role in human decision making and which interacts with previous evaluations in a way which may affect the conclusions we draw from these paradigms. We provide an overview of belief bias, making a connection between findings in psychology and the field of NLP, and present a case study of evaluating explanations for BERT-based reading comprehension models. We show that introducing models of various quality and adversarial examples can help to account for belief bias, and that introducing such conditions affects the conclusions about which explainability method works better. Finally, we provide additional insights and ideas for how to account for belief bias effects in human evaluation.

Broader Impact Statement
The work presented here makes strides towards a better understanding about the interaction of humans with explanations of model decisions. We have highlighted a phenomenon studied in psychology with hope that this opens the door to more NLP research involving a wider and more interdisciplinary understanding of humans, and the effect of explainability.
This study involved human participants recruited on Mechanical Turk platform. No personally identifiable data was collected from the participants, they were made aware that the data would only be used for research, and they were not exposed to any emotionally traumatizing or offensive stimuli. We ensured a minimum $15 hourly wage.

A Experiment 1: Human Forward Prediction
Below we show the instructions provided to the participants, as well as an example of the saliency maps presented to participants for adversarial examples.
Instructions. Question-answering systems are a particular form of artificial intelligence. The task here is for you to learn to predict how the system answers questions. In other words, when in a bit, you are presented with questions, the task is not to provide the right answer, but to guess the answer the system provided. For each question, you will also see a context paragraph. The answer is a span of text in this paragraph. Instead of writing out the answer, you can simply mark the relevant span. If you want to select a new answer, please click reset answer, if you are ready to see the model answer, please click show answer. Note that your answer will lock at that time.
Raw Results. In our evaluation, we use the first 15 points as training, therefore, we discard them from the main evaluation but show them in this section. Overall, we see that training, for the most part has a positive effect, or not so much of an effect. These scores can be seen in Table 3

B Experiment 2: Best Model Selection
Below we show the instructions given to the participants, and more details about the qualitative analysis of the feedback we obtained.
Instructions. Question-answering (QA) systems are a particular form of artificial intelligence. We have trained two QA systems and have extracted the most important words the model uses to make its final decision. Based on these highlighted words, your task is to select the model that you think is more likely to perform best. Additionally, please write how the low-performing model fails and/or how it could be better (try to be detailed)

User
Interface. An example instance, as shown to the participants, can be seen in Figure 4. Qualitative analysis of feedback. In Table 4, we include a few examples of the sentence that were categorized using the qualitative codes. Unsurprisingly, once participants found a strategy for giving feedback , they mostly stuck to it.
After categorizing all the feedback into each category, we visualize the distribution per condition. This can be found in Figure 3. We find that for the HIGH vs LOW conditions, the distribution is very similar between gradients and integrated gradients. Many participants gave very generic feedback , for example by simply saying that "model A is better because it is correct, and model B is wrong". This was not surprising, as here the differences were supposed to be clear and it is likely most participants did not have to think too hard before making