Benchmarking Machine Reading Comprehension: A Psychological Perspective

Machine reading comprehension (MRC) has received considerable attention as a benchmark for natural language understanding. However, the conventional task design of MRC lacks explainability beyond the model interpretation, i.e., reading comprehension by a model cannot be explained in human terms. To this end, this position paper provides a theoretical basis for the design of MRC datasets based on psychology as well as psychometrics, and summarizes it in terms of the prerequisites for benchmarking MRC. We conclude that future datasets should (i) evaluate the capability of the model for constructing a coherent and grounded representation to understand context-dependent situations and (ii) ensure substantive validity by shortcut-proof questions and explanation as a part of the task design.


Introduction
Evaluation of natural language understanding (NLU) is a long-standing goal in the field of artificial intelligence. Machine reading comprehension (MRC) is a task that tests the ability of a machine to read and understand unstructured text and could be the most suitable task for evaluating NLU because of its generic formulation (Chen, 2018). Recently, many large-scale datasets have been proposed, and deep learning systems have achieved human-level performance for some of these datasets.
However, analytical studies have shown that MRC models do not necessarily achieve humanlevel understanding. For example, Jia and Liang (2017) use manually crafted adversarial examples to show that successful systems are easily distracted. Sugawara et al. (2020) show that a significant part of already solved questions is solvable even after shuffling the words in a sentence or dropping content words. These studies demonstrate that we cannot explain what type of understanding is required by the datasets and is actually acquired by models. Although benchmarking MRC is related to the intent behind questions and is critical to test hypotheses from a top-down viewpoint (Bender and Koller, 2020), its theoretical foundation is poorly investigated in the literature.
In this position paper, we examine the prerequisites for benchmarking MRC based on the following two questions: (i) What does reading comprehension involve? (ii) How can we evaluate it? Our motivation is to provide a theoretical basis for the creation of MRC datasets. As Gilpin et al. (2018) indicate, interpreting the internals of a system is closely related to only the system's architecture and is insufficient for explaining how the task is accomplished. This is because even if the internals of models can be interpreted, we cannot explain what is measured by the datasets. Therefore, our study focuses on the explainability of the task rather than the interpretability of models.
We first overview MRC and review the analytical literature that indicates that existing datasets might fail to correctly evaluate their intended behavior (Section 2). Subsequently, we present a psychological study of human reading comprehension in Section 3 for answering the what question. We argue that the concept of representation levels can serve as a conceptual hierarchy for organizing the technologies in MRC. Section 4 focuses on answering the how question. Here, we implement psychometrics to analyze the prerequisites for the task design of MRC. Furthermore, we introduce the concept of construct validity, which emphasizes validating the interpretation of the task's outcome. Finally, in Section 5, we explain the application of the proposed concepts into practical approaches, highlighting potential future directions toward the advancement of MRC. Regarding the what question, we indicate that datasets should evaluate the capability of the situation model, which refers to the construction Representation levels in human reading comprehension: (A) surface structure, (B) textbase, and (C) situation model. (A) Linguistic-level sentence understanding, (B) comprehensiveness of skills for inter-sentence understanding, and (C) evaluation of coherent representation grounded to non-textual information.
(C) Dependence of context on defeasibility and novelty, and grounding to non-textual information with a long passage.
How can we evaluate reading comprehension?
(2) Creating shortcutproof questions by filtering and ablation, and designing a task for validating the internal process. of a coherent and grounded representation of text based on human understanding. Regarding the how question, we argue that among the important aspects of the construct validity, substantive validity must be ensured, which requires the verification of the internal mechanism of comprehension. Table 1 provides an overview of the perspectives taken in this paper. Our answers and suggestions to the what and how questions are summarized as follows: (1) Reading comprehension is the process of creating a situation model that best explains given texts and the reader's background knowledge. The situation model should be the next focal point in future datasets for benchmarking the human-level reading comprehension. (2) To evaluate reading comprehension correctly, the task needs to provide a rubric (scoring guide) for sufficiently covering the aspects of the construct validity. In particular, the substantive validity should be ensured by creating shortcut-proof questions and by designing a task formulation that is explanatory itself.
2 Task Overview

Task Variations and Existing Datasets
MRC is a task in which a machine is given a document (context) and it answers the questions based on the context. Burges (2013) provides a general definition of MRC, i.e., a machine comprehends a passage of text if, for any question regarding that text that can be answered correctly by a majority of native speakers, that machine can provide a string which those speakers would agree both answers that question. We overview various aspects of the task along with representative datasets as follows. Existing datasets are listed in Appendix A.
Context Styles A context can be given in various forms with different lengths such as a single pas-sage (MCTest (Richardson et al., 2013)), a set of passages (HotpotQA (Yang et al., 2018)), a longer document (CBT (Hill et al., 2016)), or open domain (Chen et al., 2017). In some datasets, a context includes non-textual information such as images (RecipeQA (Yagcioglu et al., 2018)).

Benchmarking Issues
In some datasets, the performance of machines has already reached human-level performance. However, Jia and Liang (2017) indicate that models can easily be fooled by manual injection of distracting sentences. Their study revealed that questions simply gathered by crowdsourcing without careful guidelines or constraints are insufficient to evaluate precise language understanding. This argument is supported by further studies across a variety of datasets. For example, Min et al. (2018) find that more than 90% of the questions in SQuAD (Rajpurkar et al., 2016) require obtaining an answer from a single sentence despite being provided with a passage. Sugawara et al. (2018) show that large parts of twelve datasets are easily solved only by looking at a few first question tokens and attending the similarity between the given questions and the context. Similarly, Feng et al. (2018) and Mudrakarta et al. (2018) demonstrate that models trained on SQuAD do not change their predictions even when the question tokens are partly dropped. Kaushik and Lipton (2018) also observe that question-and passage-only models perform well for some popular datasets. Min et al. (2019) and Chen and Durrett (2019) concurrently indicate that for multi-hop reasoning datasets, the questions are solvable only with a single paragraph and thus do not require multi-hop reasoning over multiple paragraphs. Zellers et al. (2019b) report that their dataset unintentionally contains stylistic biases in the answer options which are embedded by a language-based model.
Overall, these investigations highlight a grave issue of the task design, i.e., even if the models achieve human-level accuracies, we cannot prove that they successfully perform reading comprehen-sion. This issue may be attributed to the low interpretability of black-box neural network models. However, a problem is that we cannot explain what is measured by the datasets even if we can interpret the internals of models. We speculate that this benchmarking issue in MRC can be attributed to the following two points: (i) we do not have a comprehensive theoretical basis of reading comprehension for specifying what we should ask (Section 3) and (ii) we do not have a well-established methodology for creating a dataset and for analyzing a model based on it (Section 4). 1 In the remainder of this paper, we argue that these issues can be addressed by using insights from the psychological study of reading comprehension and by implementing psychometric means of validation.

Reading Comprehension from
Psychology to MRC

Computational Model in Psychology
Human text comprehension has been studied in psychology for a long time (Kintsch and Rawson, 2005;Graesser et al., 1994;Kintsch, 1988). Connectionist and computational architectures have been proposed for such comprehension including a mechanism pertinent to knowledge activation and memory storing. Among the computational models, the construction-integration (CI) model is the most influential and provides a strong foundation of the field (McNamara and Magliano, 2009). The CI model assumes three different representation levels as follows: • Surface structure is the linguistic information of particular words, phrases, and syntax obtained by decoding the raw textual input.
• Textbase is a set of propositions in the text, where the propositions are locally connected by inferences (microstructure).
• Situation model is a situational and coherent mental representation in which the propositions are globally connected (macrostructure), and it is often grounded to not only texts but also to sounds, images, and background information.
The CI model first decodes textual information (i.e., the surface structure) from the raw textual 1 These two issues loosely correspond to the plausibility and faithfulness of explanation (Jacovi and Goldberg, 2020). The plausibility is linked to what we expect as an explanation, whereas the faithfulness refers to how accurately we explain models' reasoning process. input, then creates the propositions (i.e., textbase) and their local connections occasionally using the reader's knowledge (construction), and finally constructs a coherent representation (i.e., situation model) that is organized according to five dimensions including time, space, causation, intentionality, and objects (Zwaan and Radvansky, 1998), which provides a global description of the events (integration). These steps are not exclusive, i.e., propositions are iteratively updated in accordance with the surrounding ones with which they are linked. Although the definition of successful text comprehension can vary, Hernández-Orallo (2017) indicates that comprehension implies the process of creating (or searching for) a situation model that best explains the given text and the reader's background knowledge (Zwaan and Radvansky, 1998). We use this definition to highlight that the creation of a situation model plays a vital role in human reading comprehension.
Our aim in this section is to provide a basis for explaining what reading comprehension is, which requires terms for explanation. In the computational model above, the representation levels appear to be useful for organizing such terms. We ground existing NLP technologies and tasks to different representation levels in the next section.

Skill Hierarchy for MRC
Here, we associate the existing NLP tasks with the three representation levels introduced above. The biggest advantage of MRC is its general formulation, which makes it the most general task for evaluating NLU. This emphasizes the importance of the requirement of various skills in MRC, which can serve as the units for the explanation of reading comprehension. Therefore, our motivation is to provide an overview of the skills as a hierarchical taxonomy and to highlight the missing aspects in existing MRC datasets that are required for comprehensively covering the representation levels.
Existing Taxonomies We first provide a brief overview of the existing taxonomies of skills in NLU tasks. For recognizing textual entailment (Dagan et al., 2006), several studies present a classification of reasoning and commonsense knowledge (Bentivogli et al., 2010;Sammons et al., 2010;LoBue and Yates, 2011).  MRC (Clark et al., 2018). A limitation of both these studies is that the proposed sets of knowledge and inference are limited to the domain of elementary-level science. Although some existing datasets for MRC have their own classifications of skills, they are coarse and only cover a limited extent of typical NLP tasks (e.g., word matching and paraphrasing). In contrast, for a more generalizable definition, Sugawara et al. (2017) propose a set of 13 skills for MRC. Rogers et al. (2020) pursue this direction by proposing a set of questions with eight question types. In addition, Schlegel et al. (2020) propose an annotation schema to investigate requisite knowledge and reasoning. Dunietz et al. (2020) propose a template of understanding that consists of spatial, temporal, causal, and motivational questions to evaluate precise understanding of narratives with reference to human text comprehension.
In what follows, we describe the three representation levels that basically follow the three representations of the CI model but are modified for MRC. The three levels are shown in Figure 1. We emphasize that we do not intend to create exhaustive and rigid definitions of skills. Rather, we aim to place them in a hierarchical organization, which can serve as a foundation to highlight the missing aspects in the current MRC.
Surface Structure This level broadly covers the linguistic information and its semantic meaning, which can be based on the raw textual input. Although these features form a proposition according to psychology, it should be viewed as sentencelevel semantic representation in computational linguistics. This level includes part-of-speech tagging, syntactic parsing, dependency parsing, punctuation recognition, named entity recognition (NER), and semantic role labeling (SRL). Although these basic tasks can be accomplished by some recent pretraining-based neural language models (Liu et al., 2019), they are hardly required in NLU tasks including MRC. In the natural language inference task, McCoy et al. (2019) indicate that existing datasets (e.g., Bowman et al. (2015)) may fail to elucidate the syntactic understanding of given sentences. Although it is not obvious that these basic tasks should be included in MRC and it is not easy to circumscribe linguistic knowledge from concrete and abstract knowledge (Zaenen et al., 2005;Manning, 2006), we should always care about the capabilities of basic tasks (e.g., use of checklists Construct the global structure of propositions. Skills: creating a coherent representation and grounding it to other media.
Construct the local relations of propositions. Skills: recognizing relations between sentences such as coreference resolution, knowledge reasoning, and understanding discourse relations.
Creating propositions from the textual input. Skills: syntactic and dependency parsing, POS tagging, SRL, and NER. Textbase This level covers local relations of propositions in the computational model of reading comprehension. In the context of NLP, it refers to various types of relations linked between sentences. These relations not only include the typical relations between sentences (discourse relations) but also the links between entities. Consequently, this level includes coreference resolution, causality, temporal relations, spatial relations, text structuring relations, logical reasoning, knowledge reasoning, commonsense reasoning, and mathematical reasoning. We also include multi-hop reasoning (Welbl et al., 2018) at this level because it does not necessarily require a coherent global representation over a given context. For studying the generalizability of MRC, Fisch et al. (2019) propose a shared task featuring training and testing on multiple domains.  and Khashabi et al. (2020) also find that training on multiple datasets leads to robust generalization. However, unless we make sure that datasets require various skills with sufficient coverage, it might remain unclear whether we evaluate a model's transferability of the reading comprehension ability.
Situation Model This level targets the global structure of propositions in human reading comprehension. It includes a coherent and situational representation of a given context and its grounding to the non-textual information. A coherent representation has well-organized sentence-to-sentence transitions (Barzilay and Lapata, 2008), which are vital for using procedural and script knowledge (Schank and Abelson, 1977). This level also includes characters' goals and plans, meta perspective including author's intent and attitude, thematic understanding, and grounding to other media. Most existing MRC datasets seem to struggle to target the situation model. We discuss further in Section 5.1.
Passage: The princess climbed out the window of the high tower and climbed down the south wall when her mother was sleeping. She wandered out a good way. Finally, she went into the forest where there are no electric poles. Example The representation levels in the example shown in Figure 2 are described as follows. Q1 is at the surface-structure level where a reader only needs to understand the subject of the first event. We expect that Q2 requires understanding of relations among described entities and events at the textbase level; the reader may need to understand who she means using coreference resolution.
Escaping in Q2 also requires the reader's commonsense to associate it with the first event. However, the reader might be able to answer this question only by looking for a place (specified by where) described in the passage, thereby necessitating the validity of the question to correctly evaluate the understanding of the described events. Q3 is an example that requires imagining a different situation at the situation-model level, which could be further associated with a grounding question such as which figure best depicts the given passage?
In summary, we indicate that the following features might be missing in existing datasets: • Considering the capability to acquire basic understanding of the linguistic-level information.
• Ensuring that the questions comprehensively specify and evaluate textbase-level skills.
• Evaluating the capability of the situation model in which propositions are coherently organized and are grounded to non-textual information.
Should MRC models mimic human text comprehension? In this paper, we do not argue that MRC models should mimic human text comprehension. However, when we design an NLU task and create datasets for testing human-like linguistic generalization, we can refer to the aforementioned features to frame the intended behavior to evaluate in the task. As Linzen (2020) discusses, the task design is orthogonal to how the intended behavior is realized at the implementation level (Marr, 1982).

MRC on Psychometrics
In this section, we provide a theoretical foundation for the evaluation of MRC models. When MRC measures the capability of reading comprehension, validation of the measurement is crucial to obtain a reliable and useful explanation. Therefore, we focus on psychometrics-a field of study concerned with the assessment of the quality of psychological measurement (Furr, 2018). We expect that the insights obtained from psychometrics can facilitate a better task design. In Section 4.1, we first review the concept of validity in psychometrics. Subsequently, in Section 4.2, we examine the aspects that correspond to construct validity in MRC and then indicate the prerequisites for verifying the intended explanation of MRC in its task design.

Construct Validity in Psychometrics
According to psychometrics, construct validity is necessary to validate the interpretation of outcomes of psychological experiments. 2 Messick (1995) report that construct validity consists of the six aspects shown in Table 2.
In the design of educational and psychological measurement, these aspects collectively provide verification questions that need to be answered for justifying the interpretation and use of test scores. In this sense, the construct validation can be viewed as an empirical evaluation of the meaning and consequence of measurement. Given that MRC is intended to capture the reading comprehension ability, the task designers need to be aware of these validity aspects. Otherwise, users of the task cannot justify the score interpretation, i.e., it cannot be confirmed that successful systems actually perform intended reading comprehension. Table 2 also raises MRC features corresponding to the six aspects of construct validity. In what follows, we elaborate on these correspondings and discuss the missing aspects that are needed to achieve the construct validity of the current MRC.

Construct Validity in MRC
Content Aspect As discussed in Section 3, sufficiently covering the skills across all the representation levels is an important requirement of MRC. It may be desirable that an MRC model is simultaneously evaluated on various skill-oriented examples.
Substantive Aspect This aspect appraises the evidence for the consistency of model behavior. We consider that this is the most important aspect for explaining reading comprehension, a process that subsumes various implicit and complex steps. To obtain a consistent response from an MRC system, it is necessary to ensure that the questions correctly assess the internal steps in the process of reading comprehension. However, as stated in Section 2.2, most existing datasets fail to verify that a question is solved by using an intended skill, which implies that it cannot be proved that a successful system can actually perform intended comprehension.
Structural Aspect Another issue in the current MRC is that they only provide simple accuracy as a metric. Given that the substantive aspect necessitates the evaluation of the internal process of reading comprehension, the structure of metrics needs to reflect it. However, a few studies have attempted to provide a dataset with multiple metrics. For example, Yang et al. (2018) not only ask for the answers to questions but also provide sentencelevel supporting facts. This metric can also evaluate the process of multi-hop reasoning whenever the supporting sentences need to be understood for answering a question. Therefore, we need to consider both substantive and structural aspects.
Generalizability Aspect The generalizability of MRC can be understood from the reliability of metrics and the reproducibility of findings. For the reliability of metrics, we need to take care of the reliability of gold answers and model predictions. Regarding the accuracy of answers, the performance of the model becomes unreliable when the answers are unintentionally ambiguous or impractical. Because the gold answers in most datasets are only decided by the majority vote of crowd workers, the ambiguity of the answers is not considered. It

Validity aspects
Definition in psychometrics Correspondence in MRC 1. Content Evidence of content relevance, representativeness, and technical quality.
Questions require reading comprehension skills with sufficient coverage and representativeness over the representation levels.

Substantive
Theoretical rationales for the observed consistencies in the test responses including task performance of models.
Questions correctly evaluate the intended intermediate process of reading comprehension and provide rationales to the interpreters.
3. Structural Fidelity of the scoring structure to the structure of the construct domain at issue.
Correspondence between the task structure and the score structure.
4. Generalizability Extent to which score properties and interpretations can be generalized to and across population groups, settings, and tasks.
Reliability of test scores in correct answers and model predictions, and applicability to other datasets and models.

External
Extent to which the assessment scores' relationship with other measures and non-assessment behaviors reflect the expected relations.
Comparison of the performance of MRC with that of other NLU tasks and measurements.

Consequential
Value implications of score interpretation as a basis for the consequences of test use, especially regarding the sources of invalidity related to issues of bias, fairness, and distributive justice.
Considering the model vulnerabilities to adversarial attacks and social biases of models and datasets to ensure the fairness of model outputs. Summary: Design of Rubric Given the validity aspects, our suggestion is to design a rubric (scoring guide used in education) of what reading comprehension we expect is evaluated in a dataset; this helps to inspect detailed strengths and weaknesses of models that cannot be obtained only by simple accuracy. The rubric should not only cover various linguistic phenomena (the content aspect) but also involve different levels of intermediate evaluation in the reading comprehension process (the substantive and structural aspects) as well as stress testing of adversarial attacks (the consequential aspect). The rubric is in a similar motivation with dataset statements (Bender and Friedman, 2018;Gebru et al., 2018); however, taking the validity aspects into account would improve its substance.

Future Directions
This section discusses future potential directions toward answering the what and how questions in Sections 3 and 4. In particular, we infer that the situation model and substantive validity are critical for benchmarking human-level MRC.

What Question: Situation Model
As mentioned in Section 3, existing datasets fail to fully assess the ability of creating the situation model. As a future direction, we suggest that the task should deal with two features of the situation model: context dependency and grounding.

Context-dependent Situations
A vital feature of the situation model is that it is conditioned on a given text, i.e., a representation is constructed distinctively depending on the given context. We elaborate it by discussing the two key features: defeasibility and novelty.
Defeasibility The defeasibility of a constructed representation implies that a reader can modify and revise it according to the newly acquired information (Davis and Marcus, 2015;Schubert, 2015). The defeasibility of NLU has been tackled in the task of if-then reasoning (Sap et al., 2019a), abductive reasoning (Bhagavatula et al., 2020), counterfactual reasoning (Qin et al., 2019), or contrast sets (Gardner et al., 2020). A possible approach in MRC is that we ask questions against a set of modified passages that describe slightly different situations, where the same question can lead to different conclusions.
Novelty An example showing the importance of contextual novelty is Could a crocodile run a steeplechase? by Levesque (2014). This question poses a novel situation where the solver needs to combine multiple commonsense knowledge to derive the correct answer. If non-fiction documents, such as newspaper and Wikipedia articles, are only used, some questions require only the reasoning of facts already known in web-based corpus. Fictional narratives may be a better source for creating questions on novel situations.

Grounding to Other Media
In MRC, grounding texts to non-textual information is not fully explored yet. Kembhavi et al. We might also need to account for the scope of grounding (Bisk et al., 2020), i.e., ultimately understanding human language in a social context beyond simply associating texts with perceptual information.

How Question: Substantive Validity
Substantive validity requires us to ensure that the questions correctly assess the internal steps of reading comprehension. We discuss two approaches for this challenge: creating shortcut-proof questions and ensuring the explanation by design.

Shortcut-proof Questions
Gururangan et al. (2018) Zellers et al. (2018) propose a model-based adversarial filtering method that iteratively trains an ensemble of stylistic classifiers and uses them to filter out the questions. Sakaguchi et al. (2020) also propose filtering methods based on both machines and humans to alleviate dataset-specific and word-association biases. However, a major issue is the inability to discern knowledge from bias in a closed domain. When the domain is equal to a dataset, patterns that are valid only in the domain are called dataset-specific biases (or annotation artifacts in the labeled data). When the domain covers larger corpora, the patterns (e.g., frequency) are called word-association biases. When the domain includes everyday experience, patterns are called commonsense. However, as mentioned in Section 5.1, commonsense knowledge can be defeasible, which implies that the knowledge can be false in unusual situations. In contrast, when the domain is our real world, indefeasible patterns are called factual knowledge. Therefore, the distinction of bias and knowledge depends on where the pattern is recognized. This means that a dataset should be created so that it can evaluate reasoning on the intended knowledge. For example, to test defeasible reasoning, we must filter out questions that are solvable by usual commonsense only. If we want to investigate the reading comprehension ability without depending on factual knowledge, we can consider counterfactual or fictional situations.

Identifying Requisite Skills by Ablating Input
Features Another approach is to verify shortcutproof questions by analyzing the human answerability of questions regarding their key features. We speculate that if a question is still answerable by humans even after removing the intended features, the question does not require understanding of the ablated features (e.g., checking the necessity of resolving pronoun coreference after replacing pronouns with dummy nouns). Even if we cannot accurately identify such necessary features, by identifying partial features of them in a sufficient number of questions, we could expect that the questions evaluate the corresponding intended skill. In a similar vein, Geirhos et al. (2020) argue that a dataset is useful only if it is a good proxy for the underlying ability one is actually interested in.

Explanation by Design
Another approach for ensuring the substantive validity is to include explicit explanation in the task formulation. Although gathering human explanations is costly, the following approaches can facilitate the explicit verification of a model's understanding using a few test examples.
Generating Introspective Explanation Inoue et al. (2020) classify two types of explanation in text comprehension: justification explanation and introspective explanation. The justification explanation only provides a collection of supporting facts for making a certain decision, whereas the introspective explanation provides the derivation of the answer for making the decision, which can cover linguistic phenomena and commonsense knowledge not explicitly mentioned in the text. They annotate multi-hop reasoning questions with introspective explanation and propose a task that requires the derivation of the correct answer of a given question to improve the explainability. Rajani et al. (2019) collect human explanations for commonsense reasoning and improve the system's performance by modeling the generation of the explanation. Although we must take into account the faithfulness of explanation, asking for introspective explanations could be useful in inspecting the internal reasoning process, e.g., by extending the task formulation so that it includes auxiliary questions that consider the intermediate facts in a reasoning process. For example, before answering Q2 in Figure 2, a reader should be able to answer who escaped? and where did she escape from? at the surface-structure level.
Creating Dependency Between Questions Another approach for improving the substantive validity is to create dependency between questions by which answering them correctly involves answering some other questions correctly. For example, Dalvi et al. (2018) propose a dataset that requires a procedural understanding of scientific facts.
In their dataset, a set of questions corresponds to the steps of the entire process of a scientific phenomenon. Therefore, this set can be viewed as a single question that requires a complete understanding of the scientific phenomenon. In CoQA (Reddy et al., 2019), it is noted that questions often have pronouns that refer back to nouns appearing in previous questions. These mutually-dependent questions can probably facilitate the explicit validation of the models' understanding of given texts.

Conclusion
In this paper, we outlined current issues and future directions for benchmarking machine reading comprehension. We visited the psychology study to analyze what we should ask of reading comprehension and the construct validity in psychometrics to analyze how we should correctly evaluate it. We deduced that future datasets should evaluate the capability of the situation model for understanding context-dependent situations and for grounding to non-textual information and ensure the substantive validity by creating shortcut-proof questions and designing an explanatory task formulation.