Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

Large-scale, pre-trained language models (LMs) have achieved human-level performance on a breadth of language understanding tasks. However, evaluations only based on end task performance shed little light on machines' true ability in language understanding and reasoning. In this paper, we highlight the importance of evaluating the underlying reasoning process in addition to end performance. Toward this goal, we introduce Tiered Reasoning for Intuitive Physics (TRIP), a novel commonsense reasoning dataset with dense annotations that enable multi-tiered evaluation of machines' reasoning process. Our empirical results show that while large LMs can achieve high end performance, they struggle to support their predictions with valid supporting evidence. The TRIP dataset and our baseline results will motivate verifiable evaluation of commonsense reasoning and facilitate future research toward developing better language understanding and reasoning models.


Introduction
Recent years have seen a surge of research activities toward commonsense reasoning in natural language understanding. Dozens of relevant, large-scale benchmark datasets have been developed, and online leaderboards encourage broad participation in solving them. In the last few years, extraordinary performance gains on these benchmarks have come from large-scale language models (LMs) pre-trained on massive amounts of online text (Peters et al., 2018;Radford et al., 2018a,b;Raffel et al., 2020;Brown et al., 2020). Today's best models can achieve impressive performance and have surpassed human performance in challenging language understanding tasks, including benchmarks for commonsense inference (Bowman et al., 2015;Zellers et al., 2018;Bhagavatula et al., 2020). This rapid period of growth and progress has been an undoubtedly exciting time for NLP.
Despite these exciting results, it is a subject of scrutiny whether these models have a deep understanding of the tasks they are applied to (Bender and Koller, 2020;Linzen, 2020). A key concern is widespread bias in language benchmarks leading to superficial correlations between context and class labels (Schwartz et al., 2017;Gururangan et al., 2018;Poliak et al., 2018), allowing systems to bypass reasoning and achieve artificially high performance (Niven and Kao, 2019;McCoy et al., 2019). Consequently, it remains unclear whether the problems are truly solved, and whether machines can perform verifiable reasoning as humans do.
In this work, we first introduce Tiered Reasoning for Intuitive Physics (TRIP), a benchmark targeting physical commonsense reasoning. TRIP poses a high-level end task for story plausibility classification, a common proxy task for commonsense reasoning problems (Roemmele et al., 2011;Mostafazadeh et al., 2016;Sap et al., 2019b;Bisk et al., 2020b). Notably, however, it includes dense annotations for each story capturing multiple tiers of reasoning beyond the end task. From these annotations, we propose a tiered evaluation, where given a pair of highly similar stories (differing only by one sentence which makes one of the stories implausible), systems must jointly identify (1) the plausible story, (2) a pair of conflicting sentences in the implausible story, and (3) the underlying physical states in those sentences causing the conflict. The goal of TRIP is to enable a systematic evaluation of machine coherence toward the end task prediction of plausibility. In particular, we evaluate whether a high-level plausibility prediction can be verified based on lower-level understanding, for example, physical state changes that would support the prediction.
We further present several baseline systems powered by large LMs. Our empirical results show that while large LMs can achieve high end task performance (up to 78% accuracy), they struggle to jointly support their predictions with the proper evidence (only up to 11% of examples supported with correct physical states and conflicting sentences). Consequently, the predictions from these powerful systems are overwhelmingly not accountable to their understanding of how the world works. The contributions of this work are the first-ofits-kind dataset TRIP and new metrics that facilitate quantitative evaluation of coherent reasoning in commonsense language understanding. Our detailed analysis by applying large LMs on this dataset demonstrates key disconnections between low-level and high-level predictions in the reasoning process. This dataset and our baseline results motivate future work to develop systems that are capable of verifiable language understanding and reasoning.

Tiered Reasoning for Intuitive Physics
Physical commonsense reasoning, also referred to as naïve physics (Davis and Marcus, 2015) or intuitive physics (Lake et al., 2017), has recently gained attention in the NLP community (Gao et al., 2016;Forbes and Choi, 2017;Mishra et al., 2018;Bosselut et al., 2018;Forbes et al., 2019;Bisk et al., 2020b). From a young age, humans possess commonsense knowledge and reasoning skills about a wide variety of physical phenomena, such as movement, rigidity, and balance (Bliss, 2008). This problem is consequently thought to be especially challenging for machines because physical commonsense is considered obvious to most humans, and suffers from reporting bias (Forbes and Choi, 2017). As NLP systems are typically trained only on written communications, it remains unclear whether they can learn this (Bisk et al., 2020a). We have developed a dataset in English to target this domain and shed more light on this question.

TRIP Dataset
The Tiered Reasoning for Intuitive Physics (TRIP) is a benchmark for physical commonsense reason-ing that provides traces of reasoning for an end task of plausibility prediction. The dataset consists of human-authored stories, such as those in Figure 1, describing sequences of concrete physical actions. Given two stories composed of individually plausible sentences and only differing by one sentence (i.e., Sentence 5), the proposed task is to determine which story is more plausible. To understand stories like these and make such a prediction, one must have knowledge of verb causality 1 and precondition 2 , and rules of intuitive physics. 3 Plausible stories were crowd-sourced from Amazon Mechanical Turk. 4 To convert each story into several implausible stories, we hired separate workers to each write a new sentence to replace a sentence in the original story, such that the new story after replacement is no longer realistic in the physical world. To ensure quality, these workers flagged stories which were incoherent or did not describe realistic actions. We eliminated those stories and performed a manual round of validation to remove any remaining bad stories and correct typos.

Controlled Data Curation
TRIP was carefully curated and restricted to support probing of reasoning abilities possessed by text classifiers. Compared to current benchmark trends, this dataset has the following unique properties.
Objectivity in physical commonsense. As commonsense knowledge differs between humans based on region, culture, and other factors (Davis, 2017), plausible reasoning tasks can become ambiguous and subjective, for example, in opendomain commonsense reasoning problems (Zhang et al., 2017;Bhagavatula et al., 2020). To address this issue, we directed story authors to write sentences involving concrete actions, which can be unambiguously visualized in the physical world, while avoiding mental actions such as to think or like. We limit stories to typical household happenings by directing annotators to write stories in one of six possible "rooms" seen in everyday life.
To further reduce subjectivity and block other confounding factors that may result from complex use of language, we encourage crowd workers to write sentences in a simple declarative form, typically starting with the agent of the story, followed by a verb, a direct object, and an optional indirect object. The simplicity of language use would additionally allow us to focus less on linguistic processing and semantic phenomena, and more on investigating machines' reasoning ability.
Plausibility in longer context. Many benchmarks for plausible reasoning only (or most frequently) provide one sentence of context, with similarly short choices to complete the context (Roemmele et al., 2011;Zellers et al., 2018;Bisk et al., 2020a). In TRIP, we imposed several restrictions to require reasoning over multiple sentences with associated physical state changes. First, we required annotators to write stories at least five sentences long. Further, when collecting new sentences to convert plausible stories into implausible stories, we required that the new sentence should be plausible in isolation, and only become implausible when considering the world state implied by other sentences in the story. This constraint encourages stories to be rich in interesting action dynamics rather than nonsense sentences such as "Mary fried eggs on the printer" or "Tom ate the spoon," which may be easier to recognize through distributional biases. As this new sentence can conflict with any other sentence(s) in the story, solving the task requires reasoning over the entire context.
Multi-tier annotation. To enable a systematic investigation of a system's reasoning process, we manually provided three levels of annotation. As shown in Figure 1, the first level is the end task label to indicate which of the two story choices are more plausible. By design, most implausible story choices have exactly one pair of conflicting sentences, e.g., Sentences 2 and 5 in the example. The second level of annotation identifies these sentences in each story. On a random set of 100 implausible stories from the training data, a second annotator labeled these pairs of sentences, reaching a near-perfect Cohen's κ (Cohen, 1960) of 0.929, supporting the objectivity of these labels. The third level justifies the implausibility with labels for the underlying physical states, giving a detailed account of the physical changes associated with each sentence. In our example, unplugging the phone in Sentence 2 causes it to lose power, while Sentence 5 requires that the phone is powered in order to ring.
In order to generate these rich annotations, we defined a space of 20 physical attributes (5 for humans, 15 for objects) which capture most conflicts found in the stories. This was collected in part from related attribute spaces proposed by Gao et al. (2016) and Bosselut et al. (2018), and chosen based on a random set of implausible training stories, specifically the nature of their conflicts and physical changes objects underwent during the stories. For each entity in each sentence in the dataset, we annotate the implied values of these attributes before (precondition) and after (effect) the events of the sentence take place. This step of the annotation was a substantial effort. Note that while relevant entities in each sentence are provided in the data for convenient evaluation, these can be fairly reliably extracted using the noun chunk parser from spaCy. 5 To verify the quality of annotations, we measured inter-annotator agreement on a representative subset of 157 sentences from 31 stories in the training data, finding a substantial Cohen's κ of 0.7917. A detailed description of this annotation process can be found in Appendix A. Table 1 lists the overall statistics of the resulting dataset. While this dataset is small by today's standards, our goal is depth, not breadth. Rather than training models on a surplus of data to simply achieve high accuracy on the end task, we aim to use our deep, multi-tiered annotations to probe the capability of NLP models to perform coherent reasoning toward the end task.

Proposed Tasks
From the TRIP dataset, we propose several tiered tasks as shown in Figure 1. Together, these tasks form a human-interpretable reasoning process supported by a chain of evidence.
Physical state classification. From our physical state annotations, we propose two tasks for each sentence-entity pair in each story choice: precondition and effect state classification. For example, consider the entity potato in the sentence "John cut the cooked potato in half." First, we should predict that the potato was solid in order to be cut, i.e., the precondition label for the solidity attribute is true. Second, we should predict that the potato was in pieces as a result of being cut, i.e., the effect label for the in pieces attribute is true.
Conflict detection. Next, we define the task of conflict detection as identifying a pair of sentences in the form S i → S j . S j is a breakpoint, i.e., the point where the story first becomes implausible given the context so far, while S i serves as evidence that explains the breakpoint, usually causing a conflicting world state. For example, in Figure 1, Sentence 5 is a breakpoint, while Sentence 2 is the evidence that explains why the story becomes implausible after Sentence 5. Note that it is possible that a story may have multiple pairs of conflicting sentences beyond the breakpoint and evidence pair. However, across the dataset, the average number of conflicting sentence pairs is only 1.2, so one conflicting sentence pair is a sufficient and simpler explanation for the conflict (albeit not exhaustive).
Story classification. Lastly, the end task is to determine which of two stories is the plausible one. This should be determined based on any conflicts detected within the two stories.

Benchmark Goals
It is important to note that while one can treat these tasks separately, the goal of this benchmark is to solve them jointly to form a coherent reasoning chain: physical state classification explains conflict detection, which further explains story classification. Unlike most existing benchmarks in this area, which assess language understanding ability through some high-level end tasks, the goal of our benchmark is to enable development of systems for interpretable and consistent reasoning toward language understanding. Our baseline models (Section 3) and evaluation metrics (Section 4.1) are developed to serve this purpose.
It is also worth noting that although data bias is an issue for high-level benchmark tasks where systems are not required to justify their predictions, we are not directly targeting this issue. Recent work has attempted to remove biases from benchmark data and thus prevent exploitation of them in performing high-level tasks (Zellers et al., 2018;Nie et al., 2020). In contrast, our framing of language understanding as being built from the ground up (i.e., from low-level to high-level tasks) provides systems with the proper supporting evidence toward high-level tasks, and thus can potentially mitigate some of the problems around data bias. Figure 2 displays a high-level view of our proposed baseline system to solve TRIP. It individually embeds each sentence-entity pair in each story, classifies physical precondition and effect states, then identifies conflicting sentences from these. Given a pair of stories, it aggregates conflict predictions for each story to decide which is more plausible.

Module Implementations
Each module is implemented as some kind of neural network architecture. Here, we describe some details of the implementations.
Contextual Embedding. The Contextual Embedding module is implemented as a pre-trained, transformer-based language model. Generally, this module takes as input a sentence and the name of an entity from a story, following an entity-first input formulation (Gupta and Durrett, 2019), and outputs a dense, contextualized numerical representation.
Precondition and Effect Classifiers. The Precondition and Effect Classifiers are implemented as typical feedforward classification heads for contextual embeddings, with one precondition classifier and one effect classifier for each of the 20 physical attribute tracked in the dataset. Softmax is applied to the output for classification. Altogether, the predictions from these classifiers label physical states of each entity in each sentence of the story.
Conflict Detector. For each entity and its predicted physical states over all sentences in a story, stove, S1 1. Mary turned on the stove. 2. Mary cracked the egg into the pan.
3. Mary heated up the pan.  the Conflict Detector predicts whether there is some conflict in the entity's physical states, specifically flagging a pair of conflicting sentences through multi-label classification. We use another transformer for this module, but model the high-level sequence of sentences in a story rather than the lowlevel sequence of tokens in a sentence. For each sentence-entity pair, we input the contextual embedding, as well as the classification logits behind all physical state predictions. We apply an additional feedforward classification layer and sigmoid function to the generated hidden states in order to model the belief probability of each sentence conflicting with another sentence in the story.

Mary
Story choice prediction. Given any detected conflicts, we lastly select which of the two given stories is plausible. As each Conflict Detector output represents a belief that the physical states of an entity in a particular sentence conflict with that of another sentence, we can simply sum the negative outputs for each story and apply softmax to determine which story is least likely to have a conflict.

Model Training
We train the architecture's parameters through gradient descent on the overall loss L: L sums individual cross-entropy loss functions L p for precondition classification, L f for effect classification, L c for conflict detection, and L s for story choice classification, each balanced by respective weights λ p , λ f , λ c , λ s summing to 1.

Experiments
Using TRIP, we evaluate several variations of the  6 These models offer a range in design choices such as model complexity and size of pre-training data. We begin with an evaluation from the perspective of the end task, then take a detailed look at the lower-level tasks.

Evaluation Metrics
To enable a better understanding of machines' ability in coherent reasoning toward end task performance, we apply the following evaluation metrics. Accuracy. The traditional metric of end task accuracy, i.e., the proportion of testing examples where plausible stories are correctly identified.

Consistency. The proportion of testing examples
where not only the plausible story is correctly identified, but also the conflicting sentence pair for the implausible story is correctly identified. This is to demonstrate the consistency with identified conflicts when reasoning about plausibility. Verifiability. The proportion of testing examples where not only the plausible story and the conflicting sentence pair for the implausible story are correctly identified, but also underlying physical states (i.e., preconditions and effects) that contribute to the conflict are correctly identified. 7 This is to demonstrate that the detected conflict can be verified by a correct understanding of the underlying implausible change of physical states.
It is worth noting that this notion of verifiability, although different, is motivated by the notion of verification in software engineering (Pierce, 1996). This term refers to determining whether a given software solution satisfies its architectural and design requirements, and is built from the correct sub-components. Along this line, our notion of verifiability can be seen as a method to evaluate whether a language understanding system's reasoning process is built up from the correct components.
Each successive metric dives deeper into the coherence of reasoning that supports the end task prediction. Consequently, if accuracy is a, consistency is b, and verifiability is c, then a ≥ b ≥ c. A system that reliably produces a coherent chain of reasoning is demonstrated by a ≈ b ≈ c.

Results
Recall that we consider four loss functions for training the tiered system: L p for precondition classification, L f for effect classification, L c for conflicting sentence detection, and L s for story choice classification. To investigate how each loss affects model performance, we train instances using several combinations of them. The results of this study on the validation set are listed in Table 2.
The role of end task supervision. In the first section of Table 2, we train the system jointly on all four loss functions. Here, we see low verifiability and consistency for all three LMs, while the end task accuracy is relatively high, reaching 78.3% when using BERT. When we omit the story classification loss in the second section, however, we see sharp gains in verifiability and consistency for all models, with ROBERTA jumping from 0.9% verifiability and 6.8% consistency to 10.6% and 22.4%, respectively. This comes at a slight cost of end task accuracy for BERT and ROBERTA.
This suggests that while fine-tuning systems based on a high-level classification loss targeting the end task can improve the end task accuracy, this drastically reduces the interpretability of the underlying reasoning process. One potential explanation for this is that this loss drives the system to exploit spurious statistical cues in order to further increase the end task accuracy. This gives us motivation to the breakpoint sentence and effects of the evidence sentence, and all such predictions must be correct.  move away from using over-simplified end tasks to train and evaluate language understanding. In fact, if we fine-tune ROBERTA's contextual embedding directly on the end task of TRIP without intermediate classification layers, we can achieve up to 97% accuracy, but have no insight toward verifiability or consistency of the system. This raises questions about the validity of such a result.
Natural emergence of intermediate predictions.
In the third and fourth sections of Table 2, we respectively omit conflict detection loss and state classification losses to explore whether conflicting sentences or physical states would emerge naturally in the reasoning process. When omitting conflict detection loss, all metrics degrade to near or below random performance. Clearly, conflict detection is not implicitly learned from the downstream story classification loss, and since the story choice classification directly depends on the conflict detection output, the end task accuracy drops as well.
Meanwhile, when omitting physical state classification loss, verifiability unsurprisingly drops to zero, but high accuracy on the end task can still be achieved by all models (up to 75.2%). Notably, this suggests that reasonable supporting evidence is not required in order to achieve high accuracy on the end task. This casts further doubt that existing state-of-the-art results on other commonsense lan-   guage understanding benchmarks possess any kind of coherent reasoning beyond end classification tasks which over-simplify the problem.
In Table 3, we present the testing results for the best loss function configuration of the system, i.e., omitting story choice classification loss. Compared to the validation set results in Table 2, we see slight drops in consistency and verifiability, further demonstrating the difficulty of this problem.

Analysis
Given the poor performance along our proposed metrics, we next consider the connections between the tiered tasks, and what goes wrong in unverifiable end task instances. We focus our analysis on the systems achieving the highest verifiability on the validation set in Section 4.2.
Failure mode distribution. Figure 3 provides a detailed breakdown of the combinations of failure modes on the validation set. Of the 73.6% of validation instances that are classified correctly on the end task, almost half of these (31.4% overall) are entirely unverified, with incorrect physical states and conflicts predicted by the system. Similarly,  of the 26.4% of instances with incorrect end task predictions, about half (13% overall) have incorrect physical state and conflict predictions. Meanwhile, a combined 31.1% of instances correctly predict physical states in the conflicting sentences of the implausible story, but fail to detect a conflict in those sentences (19.9% are correct at the end task, while 11.2% are not). These instances, represented by orange wedges in the graph, are a significant disconnect in the reasoning process.
Low-level task performance. To further address this disconnect, we examined system performance from the perspective of physical state classification and conflict detection. First, Table 4 lists the validation metrics for our best baselines on the tasks of precondition and effect classification (by sentenceentity pair), as well as conflicting sentence detection (by end task instance). Across the board, we find reasonable performance on all tasks. The best performing baseline from Table 2 is trained using loss functions for both physical state classification and conflict detection. Given this configuration, we further examined how each task is learned. Figure 4 shows training curves for the loss functions of physical state classification (averaged for precondition and effect), conflicting sentence detection, and story choice classification. Notably, though story choice classification is not used as a training objective, this end task is learned fairly well (albeit overfitting), with training and validation losses generally decreasing through training. This shows that learning to reason from the lowerlevel tasks is successful to some degree. However, the lower-level tasks appear challenging to learn. For physical state classification, losses decrease steadily, but slowly. For conflict detection, the losses also decrease slowly, and the model begins overfitting the training data, perhaps indicating a need for more training data at this challenging step. Future work may consider automatic data augmentation techniques to resolve this.  Figure 5: Contribution of correct ROBERTA-predicted physical states to consistency evaluation for selected attributes. The macro-F1 score of precondition and effect predictions is shown by blue stars. Among all correctly predicted states (for both effects and preconditions), the bar regions indicate whether these states appear in successfully detected conflicting sentences.
Connecting states to conflicts. To dig deeper into the connection between physical states and plausibility conflicts, we next examined correct physical state predictions by attribute in Figure 5.
In the graph, we indicate the percentage of predictions supporting a successfully detected conflict, which may be interpreted as a utility measure of each attribute toward conflict detection. We find that some attributes, like whether an electrical object is running, rarely contribute to successful conflict detections (only 26.1%) despite having reasonably high F1 score (0.69). Other attributes, like wet, are more likely to appear in successful conflict detections when predicted correctly, even though their overall classification performance is lower. This provides strong insights for targeted improvement, for example, to better take advantage of lower-level predictions toward high-level tasks.
Sample system outputs. Figure 6 presents sample outputs from the tiered ROBERTA system. In Example (a), the prediction is entirely verifiable. Here I translate "location:2" to "location:6" (put in container) to make the prediction looks more reason Can I do this?
Also not sure what is solid(box) means.

Physical State Predictions
(a) A verifiable prediction.
1.Ann put the pants and towel in the washing machine. 2.Ann turned the washing machine on. 3.Ann turned on the faucet, and filled the sink with water. 4.Ann put bleach in the water. 5.Ann used the brush to clean the sink.
1.Ann realized that the washing machine was broken. 2.Ann turned the washing machine on. 3.Ann turned on the faucet, and filled the sink with water. 4.Ann put bleach in the water. 5.Ann used the brush to clean the sink. A B S1 N/A N/A

S2
Power (  The system correctly chooses the plausible story, identifies Sentences 4 and 5 as the conflicting sentences in the implausible story, and even predicts that the box is in pieces after Sentence 4, and thus cannot become open in Sentence 5. In Example (b), the prediction is consistent but unverifiable, as the system identifies a conflict between Sentences 1 and 2, but cannot support the conflict with correct underlying physical states in either sentence. Although some relevant attributes are identified for the breakpoint sentence, e.g., power and running, they are not quite right. Meanwhile, no states are predicted for the evidence sentence.  Zhang et al., 2019) in order to enable knowledge-supported language understanding and on-the-fly explanation. Different from these efforts, this paper enables direct training and evaluation of consistent and verifiable language inference by providing a dataset that makes explicit the underlying evidence chains behind a high-level text classification task.

Conclusion and Discussion
In this work, we proposed TRIP, a tiered benchmark dataset for physical commonsense reasoning posing a new challenge of jointly solving low-level to high-level tasks to form a coherent reasoning process. We experimented with several variations of tiered systems to solve the tasks. Our results show that in many cases, supervising large LMs based on high-level classification tasks in order to learn commonsense language understanding leads to inconsistent and unverifiable reasoning, and inability to capture intermediate evidence toward the end task. Instead, we should train systems to jointly incorporate multiple types of lower-level evidence to solve reasoning tasks coherently. Our detailed analysis of results offers strong intuition for future progress toward this goal. As such, TRIP and our baselines provide an important first step toward verifiable, human-aligned commonsense language understanding, and a direction for development of AI systems in this area. 8 Broader impact. We use physical commonsense reasoning as an example in this work, but expect that a similar approach can apply to many aspects of language understanding. Our results have shown that a new challenge for the future will be to build machines that can reason logically and coherently, similar to what we expect from human reasoning. As these machines ultimately will work with humans, such alignment in reasoning is critical, as it will improve accountability and transparency in human-machine enterprise.

A Physical State Annotations
To collect our physical state annotations, we defined a space of 20 physical attributes (5 for humans, 15 for objects) which capture most conflicts found in the stories, collected in part from related attribute spaces proposed by Gao et al. (2016) and Bosselut et al. (2018). For humans, we track location, hygiene, and whether a human is conscious, dressed, or wet. For objects, we consider location and whether or not an object exists, is clean, connected to power, functional, in pieces, wet, open, hot, solid, occupied (i.e., containing another object), running (i.e., turned on), movable, mixed, or edible.
The values of these attributes each represent directions of physical state change (e.g., attribute became true or attribute became false), as listed in Appendix A.1. In the training data, we manually labeled each entity in the sentence with these attributes and values. For the other partitions, we used a semi-automatic approach described in Appendix A.2.

A.1 Physical Annotation Label Space
When labeling entities for directions of physical state changes in sentences, we adopted the label space in Table 5. For predicting precondition and effect in non-location attributes as done in this work, it is straightforward to collapse this space into true, false, or unknown for each. For human location labels, we use the full label space for predicting both precondition and effects for simplicity. Meanwhile, for object location labels, we simplify the problem by mapping them to smaller precondition and effect label spaces. While this does not significantly affect verifiability, this should be expanded in a full solution for better interpretability. For more detailed explanations, future work may consider tracking spans of text describing entity locations along the lines of Amini et al. (2020).

A.2 Completing Physical State Annotations
To expand our manual physical state annotations to the validation and testing data, we used the existing annotations to train classifiers to predict values for each attribute given a sentence-noun pair. First, each story was broken down into all possible sentence-noun pairs, using spaCy 9 to identify noun phrases. These sentence-noun 9 https://spacy.io/

Label
Human Location   pairs were passed into the physical state classifier, 10 implemented as 20 parallel branches of ROBERTA, one for each physical attribute, as shown in Figure 7. For efficiency, we use the pretrained DISTILROBERTA BASE parameters (82M), distilled from ROBERTA BASE by Liu et al. (2019) with a small performance reduction (Sanh et al., 2019). Using this module, we generated candidate physical state annotations for the remaining data, then manually revised them. As a different annotator completed this work from the annotator who completed the training data, we measured interannotator agreement on a representative subset of 157 sentences from 31 stories in the training data, finding a substantial Cohen's κ (Cohen, 1960) of 0.7917.

B Model Implementation Details
Each module in our tiered systems is implemented as some kind of neural network architecture. Here, we describe low-level details of the implementations.
Contextual Embedding. The Contextual Embedding module is implemented as a pre-trained transformer language model. Generally, this module takes as input a sentence and the name of an entity from a story, and outputs a dense numerical representation. We follow Gupta and Durrett (2019) in using an entity-first input to the language model to generate entity-centric embeddings. While there are some model-specific variations in special tokens, given an entity e and a sentence t 1 , t 2 , · · · , t n , we structure the input sequence as " [CLS] e [SEP] t 1 t 2 · · · t n [SEP]," where [CLS] is a special token meant for input to classification layers, and [SEP] is a special separator token for multi-text inputs.
Precondition and Effect Classifiers. The Precondition and Effect Classifiers are implemented like typical classification heads for contextual embeddings, with one precondition classifier and one effect classifier for each of the 20 physical attribute tracked in the dataset. Specifically, each classifier is made up of two feedforward layers, each preceded by a dropout layer (using model specific defaults for dropout probability), with tanh activation in between them. The first layer performs a linear transformation on an input contextual embedding, while the second layer projects the hidden state to the size of the label space for the corresponding attribute. Argmax is applied to the output for classification. Altogether, the predictions from these classifiers label physical states of each entity in each sentence of the story.
Conflict Detector. For each entity and its predicted physical states over all sentences in a story, the Conflict Detector predicts whether there is some conflict in the entity's physical states, specifically flagging a pair of conflicting sentences through multi-label classification. Again, we use a transformer (6 additional layers with 8 attention heads) for this module, but model the high-level sequence of sentences in a story rather than the low-level sequence of tokens in a sentence. For each sentenceentity pair, we consider the contextual embedding generated earlier, as well as the logits for all predicted precondition and effect states. We project both representations through linear layers to the same size, then concatenate them to form an entity dynamics representation. This representation for each sentence is input to the transformer, and the resulting hidden states are concatenated. Lastly, we use a feedforward layer followed by sigmoid activation to transform the hidden state to a belief probability of each sentence conflicting with another sentence in the story.
Story choice prediction. Given the output from the Conflict Detector, we lastly need to select which of the two given stories is plausible. As each Conflict Detector output represents the belief that a particular sentence conflicts with another sentence, we can simply sum the negative outputs for each story and apply softmax to determine which story is least likely to have a conflict.
Loss function details. To jointly train these various modules, we must balance several loss functions. The loss functions are weighted by corresponding scalar weights λ p , λ f , λ c , and λ s . In preliminary experiments, we found the best balance between state classification and the other tasks with the following assignment of weights: where |A| is the number of attributes tracked, i.e., 20. When omitting different loss functions, we rebalance the weights by ensuring λ c + λ s = 0.2, or λ c = λ s where state classification losses are omitted.

C Model Training Details
The ROBERTA, BERT, and DEBERTA models are built from HuggingFace's Transformers library (Wolf et al., 2020), particularly their implementation for multiple-choice classification, and the pre-trained BERT LARGE parameters (336M), ROBERTA LARGE parameters (355M), and DEBERTA BASE parameters (140M) respectively. For all models, we use the AdamW optimizer (Loshchilov and Hutter, 2018). Batch size is fixed at 1 story pair for all models, the maximum allowed by our GPU memory. To select the optimizer learning rate and number of training epochs, all models are trained by grid search over these two, maximizing the validation set verifiability as defined in Section 4.1. Learning rate is selected from the set {1 × 10 −6 , 5 × 10 −6 , 1 × 10 −5 , 5 × 10 −5 , 1 × 10 −4 }, while the maximum number of epochs is fixed at 10. Ties are broken first by validation accuracy on the end plausibility classification

Model
Learning Rate Epochs   task, then by selecting the model instance trained for fewer epochs (to avoid overfitting). The selected learning rate and number of epochs for each model presented in the main paper are listed in Table 6.

D Supplementary Results
Lastly, we provide additional results that were omitted from the main paper. 11

D.1 Conflict Detector Ablations
The Conflict Detector module takes in two types of inputs: 1) contextual embeddings of sentenceentity pairs, and 2) physical state logits from the Precondition and Effect Classifiers. To determine the impact of each, we present ablations omitting them for the best-performing instances from the previous section, i.e., those not considering story choice classification loss. Table 7 presents these results for the validation set, while Table 8 presents these results for the test set. Without including the physical state inputs, we see a slight drop in consistency and verifiability of some models. For example, ROBERTA drops from 9.7% verifiability and 23.4% consistency to 4.6% and 17.7%, respectively. Meanwhile, DEBERTA increases from 8.0% verifiabiliy and 20.2% consistency to 11.4% and 24.5%. While ROBERTA seems to depend slightly on the predicted physical 11 Note that the results in this appendix use a slightly simpler label space for location state classification, and thus are not directly comparable to the results presented in the main paper.   states in performing conflict detection, DEBERTA favors the contextual embedding.
Without including the contextual embeddings, we see a drastic drop across the board to belowrandom performance, with ROBERTA dropping to 0% verifiability and consistency, and DEBERTA to 2.3% and 6.6% respectively. This suggests that while forcing the model to track physical states enables greater explanation, they are not sufficient for models to learn conflict detection, or they are not incorporated successfully into the higher-level predictions. The contextual embedding, which is fine-tuned on physical state classification and conflict detection jointly, seems to be most powerful for solving the end task. Future work should further explore how to harness the rich information provided by the physical states to improve system performance and interpretability.

D.2 State Classification Performance by Attribute
Figure 8 breaks down the F1 score for predicting precondition and effect states by attribute across the TRIP dataset. We find that for preconditions, openness and whether objects are running, i.e., activated, are best captured, and for effects, existence and consciousness are. Meanwhile, wetness and temperature are challenging for predicting both preconditions and effects.