RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms

Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference using Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations. To generate data for this challenge, we develop a systematic and scalable procedure using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Extensive experiments on our generated probe sets with more than 10k statements show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks. We also find that fine-tuning on similar statements offer limited gains, as PTLMs still fail to generalize to unseen inferences. Our new large-scale benchmark exposes a significant gap between PTLMs and human-level language understanding and offers a new challenge for PTLMs to demonstrate commonsense.


Introduction
Smooth and effective communication requires the ability to make various forms of commonsense inferences (Clark and Brennan, 1991). When a friend texts, "I'm going to perform in front of thousands tomorrow," you may reply reassuringly, "Deep breaths, you'll do great!" Implicit to this communication is a commonsense logical inference that a person performing in front of a crowd may feel anxious, and that a reassuring remark helps ease anxiety (Figure 1). A growing body of literature (Bosselut et al., 2019;Petroni et al., 2019)   shows pre-trained language models (PTLMs) are able to catalog the types of commonsense relationships necessary for fluid communication. However, as we show in this paper, PTLMs have a shocking inability to leverage such commonsense knowledge to make robust inferences.
Here we focus on two specific characteristics crucial to human-AI communications: (1) combining commonsense knowledge with information expressed in natural language to make inferences and (2) producing consistent inferences amidst logically-equivalent yet linguistically-varied paraphrases. We focus on commonsense axioms, such as "Performing in front of people can cause anxiety", and exploit the flexibility of language to express the same axiom in many forms -e.g., "Performing in front of people makes it hard to stay calm." We test these characteristics by generating self-contained commonsense statements involving novel entities ("Prindag is going to perform in front of a crowd, so prindag is more likely to feel nervous.") and adapt them to two evaluation settings.
To fill this gap, we introduce RICA, a challenge to evaluate a model's Robust Inference using Commonsense Axioms in English. RICA draws on linguistic and cognitive science research (Schank and Abelson, 1977;Alshawi and van Eijck, 1989) suggesting humans translate language to logical representations and reason using these abstract representations. RICA consists of a set of natural language statements in the "premise-conclusion" format that require reasoning using latent (implicit) commonsense relationships. We formulate these abstract commonsense relations between entities in first-order logic and refer to them as commonsense axioms (see Fig. 1). To insulate from PTLM biases and test human-like acquisition ability on new words (Carey and Bartlett, 1978), RICA uses novel entities, which are unseen strings used to ground axioms into natural language. Finally, we introduce a set of linguistic perturbations that paraphrase a commonsense axiom into natural language in various forms.
Each component of RICA is generalizable, providing a systematic procedure to generate myriad commonsense statements. In this paper, we generate 257k commonsense statements capturing 43k axioms comprising different types of commonsense, such as physical, material, and social properties. To demonstrate the quality of RICA, we create a manually-curated set of 1.6k probes based on commonsense axioms, and also undertake a large-scale, crowdsourced verification of 10k generated statements with multiple human annotators. RICA is built by leveraging existing commonsense knowledge bases such as ConceptNet (Liu and Singh, 2004) and ATOMIC (Sap et al., 2019a) to support easy expansion. Furthermore, RICA's statements can be posed as popular PTLM tasks such as masked word prediction or sentence probability, making our benchmark widely applicable. RICA provides an extensible platform for evaluating commonsense reasoning in a variety of PTLMs.

Masked Word Prediction:
A prindag is lighter than a fluberg, so a prindag should float [MASK] than a fluberg. [more] or [less] Novel Entity Pair: prindag and fluberg
When evaluating state-of-the-art transformerbased PTLMs on the RICA probes following a zero-shot setting (e.g., predicting "more" vs. "less" in the first example in Fig. 2), we consistently discover their performance is on par with random guessing. Even after fine-tuning with large amounts of labeled examples, PTLMs exhibit a significant gap relative to human performance. We drill down into this finding through (1) zero-shot, (2) low-resource, (3) high-resource, and (4) noisy training settings and find that even with appreciable performance gains on automatically generated probes in high resource settings, PTLMs still remain on par with random guessing on difficult, human-curated RICA probes. To better understand these results, we identify a pervasive intrinsic bias in PTLMs that demonstrates positivity bias in human languages (Dodds et al., 2015).

The RICA Challenge
The RICA challenge is posed as a set of textual statements (sentences), each expressing a latent commonsense relationship in the "premiseconclusion" format (see Stage 5 in Fig. 3 for examples). These statements use generated novel entities such as "prindag" and "fluberg" instead of real-world entities such as "thimble" and "elephant" to separate factual recalling from reasoning. Each statement can be viewed as an instantiation of a commonsense principle, such as "smaller objects cannot contain larger objects." We express these commonsense principles in first-order logic, further generalizing statements through the use of general predicates for object properties (e.g., size) and object-object relations (e.g., containment). We turn these logical formulae into the associated textual statements using a set of perturbation operators and a conversion module, which together produce a logically-equivalent set of commonsense statements. In the rest of this sec-

Commonsense Statement Set
A is B's lawyer, so A is more knowledgeable about law than B B is A's lawyer, so A is not more knowledgeable about law than B A is B's lawyer, so A is less clueless about law than B A is B's lawyer, so B is less informed on the law than A … Replace A and B with Novel Entities: A à prindag B à fluberg

Text Conversion Module
Perturbation Functions Table   Figure 3: Overview of the workflow of our statement construction process. The output is a set linguisticallydiverse of masked sentences that follow the same reasoning template.

TERMINOLOGY Description
Logical Template (LT) General FOL formula constructed from predicates and logical connectives

Specific entities and relations to fill predicates in LTs
Axiom

Commonsense relationship expressed in FOL by filling a LT with arguments
Commonsense Statement Natural language sentence after converting an axiom using a TT

Statement Set
Statements that inform the same axiom after applying perturbations Evaluation Instances/Probe A set of statements after adapting to an evaluation task tion, we first provide a formal definition of RICA challenge, then provide a detailed description of the statement construction process.

Challenge Formulation
Formally, we define a commonsense axiom a i , expressed via a first-order-logic (FOL) formula, as a relationship between entities that can be inferred using commonsense knowledge (see Stage 4 in Fig. 3). To test whether PTLMs understand an axiom a i , as well as examine their robustness to linguistic variations, we instantiate the axiom a i by a set of m syntactically-different commonsense statements s i 1 , s i 2 , ..., s i m , each expressing the same logic as the axiom. Each statement takes the form of an inferential implication with a premise and conclusion. Finally, depending on the PLTM, we select an appropriate task (Section 3), transform each statement in the set into its task-specific probe, and evaluate how well the PTLM can leverage the logic of a i to solve each of a i 's corresponding probes. We deem a model "successful" on the challenge (or, understands the axioms) only if it can perform like humans on all probes of the axioms.

Statement Set Construction Process
This subsection introduces our proposed procedure for the construction of commonsense inference statement sets for the challenge. A list of terminologies and descriptions can be found in Table 1 and an overview of our workflow is shown in Figure 3.
Stage 1. Define Predicates. In FOL, predicates are used to denote a property of objects or a relation between objects and every predicate symbol comes with an arity larger or equal to 1. We define three general high-level predicates that serve as the backbone for the logical formulations of our axioms: Property, Comparator and Relation.
(1) PROP(A, p) represents that entity A has a certain property p. "PROP(A, glass)" indicates that A is made of glass. (2) REL(A, B, r) represents that A and B have a certain relation r. "REL(A, B, lawyer)" indicates that A is B's lawyer. (3) COMP(x, y) represents a comparative relationship between values x and y, where "COMP" will be replaced with comparison words like "better," "more," or "easier." We will later define multiple sub-types of these predicates to crawl from Knowledge Bases (KBs) to ensure a wide coverage of common knowledge.
Stage 2. Compose Logical Templates. We manually create first-order logical formulae, referred to as logical templates (LT), using the predicates defined in Stage 1. Each formula takes the form of an implication, expressing an inference based on commonsense knowledge. For example, REL(A, B, r) → COMP(PROP(A, p), PROP(B, p)) expresses a logical inference comparing a property of two entities, A and B, based on a relationship between them. An instantiated version of this template can be REL(A, B, lawyer) → Stage 3. Populating Knowledge Tables. Materializing the abstract relationships in a logical template requires connecting abstract logic to commonsense knowledge. We define a structure called knowledge table (KT) that contains valid arguments to populate a specific LT and form a FOL representation of the axiom. KTs are generated by crawlining commonsense KBs such as ConceptNet (Liu and Singh, 2004) and ATOMIC (Sap et al., 2019a). The first step of the crawling process is to narrow down the predicates to specific types. For example, PROP is general enough to capture an entity's capabilities (e.g., knowledge of law) or its intrinsic properties (e.g., hardness). We pre-define several type constraints for both properties (PROP) and relations (REL). For PROP, we consider Capability, Attribute, and Condition. For REL, we consider Role and Action. Note that these categories can be extended for wider coverage of knowledge and allow our LTs to be adapted to a broader range of KB schemas. After specifying type constraints, we specify steps for crawling the arguments either from commonsense KBs such as Concept and ATOMIC or general web KB such as Wikipedia. In our example in Fig. 3, we can crawl occupations from Wikipedia, and then query ConceptNet for triples with the occupation as the subject and Ca-pableOf as the relationship to create a KT with professions and capabilities. We show all crawling steps for KTs in Appendix A.
Stage 4. Creating Axioms. Combining knowledge tables and logical templates allows us to generate commonsense axioms at scale, which are partially-filled LT formulae. For example in Fig. 3 Stage 3, the arguments of predicates REL, PROP, and COMP are set in order to reflect the commonsense relationship between lawyer and knowledge of law, while leaving the entities A and B ungrounded. Once the predicates are instantiated, we call this partially-filled LT a commonsense axiom.
Stage 5. Generate Statement Sets. After filling the logical templates, each partially-filled LT represents one commonsense axiom. To comprehensively challenge models' understanding of an abstract axiom, we construct a statement set expressing the same axiom with different phrasings, i.e., logically-equivalent yet linguistically-varied. We define several perturbations to apply on the arguments from knowledge tables.
(1) Linguistic Operators. We define seven types of linguistic operators to facilitate and formalize  perturbations, shown in Table 2. We construct the last four operators by combining some of the single operators listed in the first three rows. Note that for NEGATION, ANTONYM, PARAPHRASE IN-VERSION, and NEGATION PARAPHRASE types, the logic of the original phrase is changed, so words in the statements have to be changed accordingly. For example, if we apply ANTONYM to "fit into" in the probe "A is smaller than B, so A is more likely to fit into B," we will get "A is smaller than B, so A is less likely to contain B." (2) Asymmetry Operator.
Most of our logical templates use several stronglyordered comparisons and relationships allowing us to introduce asymmetries that preserve meaning. Using this invariant, we can swap the positions of two entities for these predicates and the logic will also be negated, so we denote this perturbation as ASYM(P(A, B)) → P(B, A) = ¬P(A, B).
We apply the defined operators to the arguments in the predicates to first form a set of partially-filled LTs (axioms) and use for a conversion module to convert axioms to statements with diverse perturbations. In practice, this module can be a sequence-tosequence (seq2seq) model (that takes in FOL and outputs natural language text), or human-written templates. Finally, commonsense axioms are general logical relationships that hold for all entities. To formulate specific commonsense statements, we generate specific novel entities. These entities are randomly generated character strings from length 3 to 12 that are not seen in the training data of the PTLMs. Using novel entities enables us to avoid conflating fact-based recall with commonsense reasoning when evaluating PTLMs.  Table 3: Example first-order logical templates we construct for our probes and an example for each template.

Probing Tasks
To examine transformer-based PTLMs' performance on RICA challenge, we draw conclusions from evaluation results on two distinct probing tasks shown in Figure 2, described as follows. Masked Word Prediction (MWP) Inspired by the masked word prediction objective in BERT (Devlin et al., 2019), we examine if the models can recover masked-out keywords in the statement given the remaining context. Since RICA's statements take the form of implications, we mask words in the consequent to evaluate the inference performance, given the premise. Specifically, we choose to mask the comparative words (from COMP) such as ''more/less" and "better/worse" as shown in Figure 2, since they not only capture the commonsense relationship, but also focus on masking positions where only a few options are appropriate logically and syntactically. Sentence Probability (SP) evaluates if PTLMs assign higher probability for statements that express commonsense axioms versus contradictory statements. RICA statements are input to PTLMs, computing SP by taking the product of each word's probability conditioned on previous words, i.e., the left-to-right language modeling loss. For each RICA statement, we pair it with an incorrect (noncommonsense) statement by swapping the comparative word (i.e., the masked word in MWP) with its opposite word, as shown in Figure 2.

Probing Data Details
Raw Set Following the process in Section 2, we use the three high-level predicates to generate five LTs as shown in Table 3. Then we construct knowledge tables to fill in each template by crawling from two commonsense KBs: ConceptNet (Liu and Singh, 2004) and ATOMIC (Sap et al., 2019a). Specifically, for each LT, we design 1 to 4 crawling strategies based on the type constraints we impose on the predicates so that it covers multiple aspects of commonsense knowledge (for all strategies please see Table 4 in Appendix A). For example, the example shown for LT1 in Table 3 is about inference of physical properties based on the material of two objects as we constrain PROP in the premise to be materials. However, we can also constrain PROP in the premise to be animals so that we can use the same template to examine inference of properties based on the animal types of A and B, e.g., "A is a fish, B is a horse, so A is more likely to be in the bottom of the sea than B." We have 11 type-constrained LTs and we populate the KTs using 11 human-designed crawling strategies shown in Appendix A, resulting in around 43k axioms. Then we apply the perturbation operators as described before to form a set of 257k perturbed axioms. For this large set, we apply negation and asymmetry operators automatically by adding negation and switching the order of entities. To convert FOL axioms to text, we train a seq2seq model based on BART (Lewis et al., 2020a) on 200 manually converted axiom-text pairs covering each type-constrained LT and each perturbation type. Finally, we replace entities to unseen entities to form a set of 257k commonsense statements.
Quality Check. To check for language quality of the generated probes from BART, we randomly sample 5% of the 10k set and ask a native English speaker to check for the naturalness. We found that only 4 out of 500 (0.8%) probes contain grammar or fluency issues. Since all probes follow a premise-conclusion format, we find that using 200 pairs of first-order logic (FOL) and aligned text for fine-tuning BART is sufficient to convert FOL into text, both from our manual inspection and the crowdsourcing verification of the generated probes. We tried increasing the training set size and didn't observe a clear difference in quality.

Human-Verified Set
To ensure the quality of crawled data, we conduct human evaluation using Amazon Mechanical Turk (AMT) on 10k of our collected 257k statements covering 1.7k different commonsense axioms. We present a pair of statements by flipping the comparative term in the original statement to its opposite, and ask two annotators to choose the one that follows commonsense. If they disagree, we subject the pairs to (c) Fine-tuning curve for RoBERTa-large Figure 4: Performance of different transformer-based models on different settings of our data. BERT, RoBERTa, ERNIE, and BART are evaluated using masked word prediction and GPT2 is evaluated using sentence probability. Zero-shot performance is no better than random guessing. More data helps greatly for human-verified test set (10k) although noisy training hinders the improvement. Increasing data does not help at all for our human-curated set.
a second round of turking with three annotators, and use majority voting to validate the statement. When annotators prefer a flipped statement over the original, we replace the statement accordingly. Quality Check.
The Fleiss-kappa agreement (Fleiss, 1971) on the two rounds of turking is 0.72 and 0.52, indicating that some statements are difficult for humans to verify. Of 10k statements in the verified set, we sample 10% (1k with 170 axioms) that 2 annotators agree on in the first or second round to form our Human-Verified Test Set.
Human-Curated Set To further challenge models on more flexible forms of text, we ask humans to write perturbed axioms. Specifically, given an axiom in FOL, a human annotator is asked to provide input that perturbs the conclusions following all 7 types of linguistic perturbations as shown in Table 2, including compositional combinations that are hard to generate using automated approaches, Then we apply the asymmetry operators either on the premises or conclusions. Thus we have in total of 24 types of perturbations, including the unperturbed one. We focus on 80 axioms covering physical, social, and temporal commonsense knowledge and create 1.6k statements. We show examples of all perturbations for one probe in Appendix Table 7 and sampled 60 probes in Appendix Table 8.

Joint Test Set Combines the Human-Curated and
Human-Verified sets, for a total of 2.6k statements.

Evaluation Settings
Using the collected probe data introduced above, we consider four evaluation settings to examine models' capabilities to perform robust inference on our dataset.

Zero-Shot:
In the zero-shot setting, we test models without any exposure to training data.
2. Low-Resource: For the low-resource setting, we fine-tune the models on 1k (10%) of the verified 10k set to determine how a small amount of indomain text influences PTLM performance.
3. High-Resource: We use 90% of the verified training set (8k for training, 1k for validation). We further increase the number of training instances by introducing 5 different novel entities for each statement, yielding 40k training instances that include 5 repetitions of each probe with different novel entities, providing models more opportunities to learn patterns in the training set.
4. Raw Large-Scale Training: Finally, to analyze the effects of training on an even larger but noisier set with the similar format. Starting from the raw set of 257k crawled statements, we sample 100k statements from 17k axioms ensuring no overlap with the test set.

Baseline Methods
We evaluate multiple state-of-the-art transformerbased PTLMs covering both masked and generative language models. For the masked word prediction task, we consider BERT (Devlin et al.

Results and Analysis
We examine the performance of multiple language models on each evaluation setting on our probe data, including zero-shot and fine-tuning on various splits, and present ablation studies to analyze   Figure 5: Results of fine-tuning and the ablation study on novel entities. Shows that (a) models are biased to positive words, requiring fine-tuning to correct (b) poor performance persists after replacing novel entities with real names--indicating the use of random strings is not hindering PTLMs' abilities, (c) fine-tuning mitigates the bias towards positive words, but the inconsistency issue for linguistic variation become obvious.
performance more thoroughly. All of our results are averages of testing on 3 seeds.

Zero-Shot Performance
As shown in the first group of bars in Figures 4a  and 4b, the average binary accuracies of all five models (we show the large version) on both MWP and SP tasks are around 0.5, regardless of the test data. A random baseline that chooses between the two comparative words would have an accuracy of 0.5. This shows that the tested models barely beat a random guessing baseline without training.
Is Knowledge-Augmented Model Better? To see if adding knowledge during training helps, we also test a knowledge-enhanced LM, ERNIE (Zhang et al., 2019). However, as we can see in Figures 4a and 4b, ERNIE also performs on par with random guessing, demonstrating that simply adding more knowledge does not help with the robust inference capability.
Human Performance To benchmark human performance, we sampled 5% of our joint test set consisting of both human-verified and human-curated data and gathered answers from 20 subjects (annotators) with diverse backgrounds who were not involved in the probe construction process. We consider this as zero-shot testing for humans as they have not seen the training set before. Humans obtained 91.7% accuracy, taking a majority vote for each probe, with a substantial inter-annotator agreement of 0.768 Kappa (Cohen, 1960)).

Fine-tuning Performance
To study if poor performance in §4.1 is from a lack of exposure to RICA's probe sets, we conduct experiments to fine-tune baseline language models. As in §3.3, we consider training on low-resource data by sampling a subset of the verified set, on high-resource by filling multiple novel entities in the verified set, and the noisy 100k data. We finetune BERT, RoBERTa, ERNIE, and BART using the same masking approach as MWP evaluation, and fine-tune GPT-2 on the causal language modeling task. Details for training are in the appendix.
More Data Helps on Human-Verified Set Figure 4a shows fine-tuning on our probe set helps the model on the human-verified set, especially for RoBERTa and ERNIE, where the high-resource setting almost reaches 90% accuracy. This demonstrates with enough data, PTLMs are able to reach near-human performance on generated axioms. The low-resource (except for ERNIE) and noisy training settings, however, pose an enduring challenge for most models.
Diversity of Curated Set Stumps All. Evaluating models fine-tuned on human-verified data on the human-curated set, where human editors provide greater diversity in probes, tells a different story. The model accuracy (Figure 4b) remains near 50%, on par with random guessing, for all models in all settings. This indicates that exposing these models to numerous linguistically similar sentences does not improve robust inference ability. Furthermore, we evaluate training data sensitivity for both the human-verified and human-curated set ( Figure 4c). We vary training set size from 0 to 80k for RoBERTa-large. Our results show that performance on the human-verified set saturates around 80% accuracy after 10k instances, but human-curated accuracy remains close to 50% throughout. This casts doubt on the model's generalizability and whether the improved performance may be due to pattern-matching seq2seq generation, not commonsense acquisition. An inability to im-prove on reasoning tasks after fine-tuning supports the challenging nature of RICA, which cannot be trivially solved by fine-tuning.

Performance Analysis
Positivity Bias in PTLMs. We find a pattern that when PTLMs are asked to infer a comparative relationship between the property of two entities, the model is heavily biased towards predicting words that evoke positive emotions (positive valence) regardless of what commonsense axiom is embedded in the statement. Figure 5a shows that the accuracy for "positive valence" words such as "more" and "easier" is much higher than "negative valence" words such as "less" and "harder". Fine-tuning on our probes, which have a balanced number of sentences containing positive and negative comparatives, helps mitigate this bias for RoBERTa (base and large) and GPT-2. We conjecture that this may be due to the frequency difference between positive valence words and negative valence words related to reporting bias in language (Gordon and Van Durme, 2013). Dodds et al. (2015) shows a universal positivety bias in human languages and to check if our comparators also possess it, we use Google Ngram Viewer 2 to find frequencies for the masked words, and confirm that the positive valence words are about 5 times more frequent than their negative counterparts. This correlation supports the claim that PTLMs do not reason as humans do, but are guided by statistical patterns. Our challenge clearly reveals this bias in PTLMs and suggests a potential mitigation using RICA.
Ablation of Novel Entities In order to ensure novel entities used in RICA did not impact PTLM performance, we conducted an ablation study on 4,800 of our human-curated set (each statement is repeated for 3 times). These probes involved social commonsense, where novel entities took the place of names. We conduct an ablation by choosing common names instead of novel entities, producing probes containing only previously-seen words. As Figure 5b shows, the performance of all models in three settings did not change significantly, strongly suggesting that novel entities are not critical to PTLM performance. We conclude novel entities do not introduce helpful or distracting sub-words.

Impact of Linguistic Perturbations
Before fine-tuning, a heavy bias for positive valence words interfered with the perturbations analysis, since each perturbation has a balanced number of positive and negative valence words. After fine-tuning, however, the bias is mitigated and we find significant variations in performance for different perturbation types (Figure 5c). This shows that language variation greatly affects a model's capability to make inference on our commonsense probes, while suggesting models do not comprehend the axioms. Interestingly, the composite perturbation types such as NEGATION ANTONYM are not necessarily harder for PTLMs, even though performance on ANTONYM is the lowest. We speculate that the model is exploiting some pattern in NEGATION ANTONYM that is not present for just ANTONYM.

Related Work
Commonsense Reasoning has a long history in AI, with classical work primarily focusing on executing symbolic rules as hand-crafted programs for machines to learn (Mccarthy, 1960 We differ from the second line of work by proposing a systematic procedure to generate probes and evaluate for robustness. Clark et al. (2020) shows that PTLMs can emulate deductive reasoning given explicit rules, but we focus on unstated commonsense relations.

Conclusion
We design RICA as an AI challenge to test robust inference capabilities on linguistically-varied probes covering different commonsense axioms. RICA is built on a systematic process to construct probes using FOL formulae, perturbation operators, and novel entities. Following this approach, we generate and verify more than 10k statements from 1.7k axioms and test multiple PTLMs in various settings. We find that PTLMs perform on par with random guessing on zero-shot setting, have strong positivity bias, and are not robust under linguistic perturbations.

Acknowledgments
We

Ethical Considerations
Our work aims to pose a new challenge to improve effective human-AI communications by collecting new data in English, which benefits English speakers more. We have conducted human evaluation using Amazon Mechanical Turks. We pay turkers around $11 per hour, above the national minimum wage and engage in constructive discussions if they have concerns about the process. We also give each annotation instance enough time so that we do not pressure annotators. Our data construction process makes uses of available public resources: Wikipedia, Concept-Net (Liu and Singh, 2004), and ATOMIC (Sap et al., 2019a), which could contain societal biases as shown by Mehrabi et al. (2021). Although our probes do not involve specific demographics, we admit the possibility that biases in knowledge resources are included in our data. We have provided detailed descriptions about our data construction process to minimize potential confusions.

A.1 Raw Set Collection
We define 1-4 combinations of type constraints on the predicates for each LT and designe crawling strategies accordingly using resources: Concept-Net, ATOMIC, and Wikipedia. Descriptions for each of the 11 strategies are included in Table 4. All data and code for crawling strategies is included in the supplementary materials.

A.2 Turking Details for Human-Verified Set
We present a pair of statements by flipping the comparative term in the original statement to its opposite, and ask two annotators to choose the one that follows commonsense. The AMT page for turkers to annotate is shown in Figure 8. If they disagree, we then take the pairs and do a second round of turking by asking three annotators and use majority voting to decide what is the right sentence in the pair. We replace the original statement with the opposite one if there are more annotators think that the other one in the pair follows more commonsense. In total, around 2500 pairs are sent to the second round and 300 pairs are flipped to the opposite according to annotators. The estimated time for completing each instance is around 20 seconds and we pay each instance $0.06, which translates to around $11 per hour.

A.3 Human-Curated Set Details
We show all perturbations for one probe in Table 7 and 60 of our human-curated set's unperturbed statement in Table 8 (for temporal refer to supplementary material). Full data is included in the supplementary material.

B Experimental Details
Model Detail We test our probes on in total 10 models, with the number of parameters and other details in Table 6. For RoBERTa-base, RoBERTalarge, RoBERTa-large-MNLI, and BART-large-MNLI, we use the fairseq implementation 3 . For BERT-base-uncased, BERT-large-uncased, AL-BERT, and GPT-2, we use the huggingface transformers library 4 . For COMET trained on Concept-Net and ATOMIC, we follow their github repo 5 . We use ERNIE from their original github 6 . Fine-tuning Details We fine-tune BERT-baseuncased, BERT-large-uncased, RoBERTa-base, and RoBERTa-large based on HappyTransformers 7 framework, using a consistent learning rate of 1e-5. We fine-tune GPT-2 based on huggingface transformers library's example code 8 , using their default parameters. We train them on one NVIDIA Quadro RTX 6000 GPU for 10 epochs and after each epoch we test the finetuned model on our validation set, and save the model with the highest validation set performance. Fine-tuning RoBERTa-base and GPT-2 takes around 30 minutes for each epoch and RoBERTa-large takes around 1 hour. The best validation performance for RoBERTa-base is the fourth epoch, with perplexity 1.3378140926361084 and evaluation loss': 0.2910370217429267. For RoBERTa-large, the best is epoch 5, with perplexity 1.3949965238571167 and evaluation loss 0.3328918993473053. For GPT-2, the best is epoch 3, with perplexity 1.2786548795017285. Interpretation Details We use the AllenInterpret demo 9 . To identify important context words, we run the algorithm over the same probe for 5 times, each with different entity names, and select the words that are ranked in the top 5 most important words at least 3 times. We find that the interpretations are not very consistent as the most important words change when we input the same sentence for multiple times and will also change when different names are used, so we conduct 5 trials with different names for each probe and pick the words that appear in the majority of the trials. Attribute-Material (126) Get a list of materials, and find properties in ConceptNet using HasProperty; then find a second material using NotMadeOf from the previous property.

C Additional Studies
Material (A, glass) and Action (10k) For each event in ATOMIC that involves people, we find properties by following the Attribute edge in ATOMIC, note that we replace PersonX with "themself" and PersonY with "another person" to sound natural    settings for this test, one where parroting the nowprovided commonsense fact is all that is needed to correctly answer the probe, and the other where a simple negation switch of the commonsense fact is needed to solve the probe: • A is made of glass, B is made of stone, and glass is more transparent than stone, so A is [MASK] transparent than stone. (parrot) • A is made of glass, B is made of stone, and glass is more transparent than stone, so A is not [MASK] transparent than stone. (negation switch) We do this so to investigate whether RoBERTa is actually able to use the provided commonsense fact, or is it possibly just pattern matching. We add this piece of background knowledge to the 60 original (unperturbed) statements along with their corresponding negated statements to form an "easier" setting of our task. As shown in Figure 6, we find two patterns PTLMs exhibit. For RoBERTa, ALBERT, and GPT-2, there is a stark difference in performance between the two settings. When they are being asked to parrot the commonsense fact, the performances jump up to near perfect scores, however when all they have to do is the equivalent of applying a negation operator on the fact, they fail even worse than when they are not provided the fact. These results suggest that in the parrot easier setting, it is likely RoBERTa, ALBERT, and GPT-2 are just parroting the commonsense fact they see in the sentence and not utilizing some sort of reasoning ability, as when asked to perform the simplest of logical operations they fail. The other pattern we notice is that providing background knowledge does not help or hurt the performances for COMET and models tested on the textual entailment task. For COMET models, this may be due to the fact that COMET is trained on triplets from knowledge bases: given a head entity and a relation, predict the tail entity, so it is not used to taking auxiliary knowledge into its input. As for models fine-tuned on MNLI, the performance stays unchanged because they still think most of the sentence pairs of our probes are neutral, failing to grasp the embedded logical inference step.

Case Study on Contextual Clues
To gain a better understanding on model behaviors, we conduct analysis to identify context words that the model relies on when solving our probes. We use we provide background knowledge in our probes. For RoBERTa, ALBERT, and GPT-2, we notice a huge increase in accuracy when provided knowledge. However, we find that they are merely parroting what appears in the context since when we apply a negation in the probe, which should change the prediction, they are simply predicting the same as the context shows, resulting in performance drop. For COMET moddels and models tested on the NLI setting, we do not observe the same pattern and it seems that adding knowledge does not help or hurt.
the SmoothGrad (Smilkov et al., 2017) algorithm from AllenNLP Interpret (Wallace et al., 2019) for masked word prediction on our probes with real people's names (the same set as our ablation study) using BERT. Aggregated across all probe sets, we find that the three words BERT finds most important are: "than", "not", and "so", which make sense as they are indicators for comparison, negation, and causality, respectively. "Not" and "so" are also the textual forms of the logical connectives ¬ and →, which we use to construct LTs.
Furthermore, we find that BERT also regards argument words (inputs into LTs' predicates via a knowledge table, such as "lawyer" or "knowledge of law") important. The model finds on average 3.4 words as contextual clues and 1.5 out of them are knowledge-specific argument words. This finding shows that a PTLM is able to recognize words specific to the commonsense axiom tested. However, noticing all these clues does not necessarily aid in a PTLM's ability to understand their logical implications, as evidenced by their performances. In other words, a PTLM, in this case BERT, knows that these words are important when making a decision, but it does not know how to properly answer RICA's questions based on these lexical signals. linguistic perturbation asymmetric perturbation probe original original A is wider than B, so A finds it harder to slip through cracks than B original asymmetric premise B is wider than A, so A finds it easier to slip through cracks than B original asymmetric conclusion A is wider than B, so B finds it easier to slip through cracks than A negation original A is wider than B, so A does not find it easier to slip through cracks than B negation asymmetric premise B is wider than A, so A does not find it harder to slip through cracks than B negation asymmetric conclusion A is wider than B, so B does not find it harder to slip through cracks than A antonym original A is wider than B, so A finds it easier to be blocked by cracks than B antonym asymmetric premise B is wider than A, so A finds it harder to be blocked by cracks than B antonym asymmetric conclusion A is wider than B, so B finds it harder to be blocked by cracks than A paraphrase original A is wider than B, so A is worse at fitting into openings than B paraphrase asymmetric premise B is wider than A, so A is better at fitting into openings than B paraphrase asymmetric conclusion A is wider than B, so B is better at fitting into openings than A paraphrase inversion original A is wider than B, so A is more impeded by small openings than B paraphrase inversion asymmetric premise B is wider than A, so A is less impeded by small openings than B paraphrase inversion asymmetric conclusion A is wider than B, so B is less impeded by small openings than A negation antonym original A is wider than B, so A does not find it harder to be blocked by cracks than B negation antonym asymmetric premise B is wider than A, so A does not find it easier to be blocked by cracks than B negation antonym asymmetric conclusion A is wider than B, so B does not find it easier to be blocked by cracks than A negation paraphrase original A is wider than B, so A is not better at fitting into openings than B negation paraphrase asymmetric premise B is wider than A, so A is not worse at fitting into openings than B negation paraphrase asymmetric conclusion A is wider than B, so B is not worse at fitting into openings than A negation paraphrase inversion original A is wider than B, so A is not less impeded by small openings than B negation paraphrase inversion asymmetric premise B is wider than A, so A is not more impeded by small openings than B negation paraphrase inversion asymmetric conclusion A is wider than B, so B is not more impeded by small openings than A