“I’m Not Mad”: Commonsense Implications of Negation and Contradiction

Natural language inference requires reasoning about contradictions, negations, and their commonsense implications. Given a simple premise (e.g., “I’m mad at you”), humans can reason about the varying shades of contradictory statements ranging from straightforward negations (“I’m not mad at you”) to commonsense contradictions (“I’m happy”). Moreover, these negated or contradictory statements shift the commonsense implications of the original premise in interesting and nontrivial ways. For example, while “I’m mad” implies “I’m unhappy about something,” negating the premise does not necessarily negate the corresponding commonsense implications. In this paper, we present the first comprehensive study focusing on commonsense implications of negated statements and contradictions. We introduce ANION, a new commonsense knowledge graph with 624K if-then rules focusing on negated and contradictory events. We then present joint generative and discriminative inference models for this new resource, providing novel empirical insights on how logical negations and commonsense contradictions reshape the commonsense implications of their original premises.


Introduction
Humans reason about underlying causes and effects of events described in text.For example, in Figure 1, the event "X wears a mask" is associated with many causal inferences such as "X is seen as responsible," or "Others get protected."Hypothesizing and reasoning about commonsense inferences is used for understanding complex situations encountered in everyday life (Sap et al., 2019;Bisk et al., 2020;Bhagavatula et al., 2020;Sakaguchi et al., 2020).This ability eludes AI systems, and has motivated the design of a wealth of 1 Data and code available at https://github.com/liweijiang/anion

X is seen as…
Figure 1: Commonsense inferences for the event "X wears a mask," its logical negation and commonsense contradiction events, and their associated inferences.
However, reasoning about negated observations remains a challenge (Hossain et al., 2020).While negation is often considered a poorer form of meaning than affirmation2 (Ackrill, 1963;Horn and Wansing, 2020), negated statements can still imply expressive commonsense inferences.In Figure 1, the negated event "X doesn't wear a mask," is connected to rich commonsense inferences, despite describing the absence of action.However, negated observations are rarely found in commonsense knowledge resources.For example, negated examples make up only ∼3% of examples in the ConceptNet knowledge graph (Li et al., 2016).
This scarcity poses downstream issues for systems that must understand negated situations.Commonsense knowledge models (Bosselut et al., 2019;Hwang et al., 2020) trained on resources of largely affirmative instances struggle particularly with negation examples.Their ability to hypothesize inferences for negated events is 35% lower than for affirmative events ( §4.2).Furthermore, since negated statements are asymmetrically mentioned in text compared to affirmative statements (Jowett et al., 1892;Horn and Wansing, 2020), large-scale pretraining does not implicitly learn negation scoping (Kim et al., 2019).As a result, when presented with negated concepts, pretrained neural language models (PLMs) often exhibit the same associations as affirmative statements (Kassner et al., 2020).Motivated by these observations, our work focuses on improving the ability of knowledge models to make commonsense inferences about events that convey denial, rejection or contradiction of actions.
We define our contributions as follows.First, we crowdsource a new large scale resource, Array of commonseNse Inferences for Oppositions and Negations (ANION), which contains inferences for different types of negated events.This new resource can be used to train knowledge models on commonsense inferences associated with the absence of actions.Second, we propose a new class of negation discriminators that can be applied to generated commonsense inferences.These discriminators partition inferences based on logical consistency, thereby mitigating the effects of common affirmative associations that violate negation constraints.Discriminators are trained using contrastive samples from paired affirmative and negated events in ANION.Finally, we conduct an empirical study of both of these techniques and show that using training-and discriminator-based approaches for modeling negation cuts the performance difference between affirmative and negated events by 73% -85% depending on the negation variety.

Commonsense Negation
Negation in Language In Categories and De Interpretatione, Aristotle classifies declarative state-ments into affirmation and negation, which respectively affirms or denies observations about an event (Ackrill, 1963).Despite this seeming simplicity, natural language often expresses negation in complex and subtle ways, using diverse syntactic, semantic and pragmatic formulations (Horn and Wansing, 2020).For example, syntactically, different negation determiners (i.e., negation cues) such as no, few and only result in distinct explicit and implicit negative perceptions (Xiang et al., 2016).
Despite their diversity, however, negated language expressions are much less likely to appear in text than affirmative statements (Reitan et al., 2015).Consequently, PLMs, which rely on largescale textual corpora as training data, are prone to decreased performance when confronted with negated constructions.In machine translation, for example, the presence of negation may heavily affect the quality of produced translations (Fancellu and Webber, 2015;Hossain et al., 2020).In factual knowledge understanding tasks, PLMs memorize positive and negative sentences seen during training, but generalize more poorly to unseen negated instances (Kassner and Schütze, 2020).
Negation in Commonsense Reasoning Understanding negation and oppositional expressions is critical for reasoning about commonsense knowledge, particularly in counterfactual scenarios (Qin et al., 2019).However, negation is rarely explicitly modeled in NLP studies on commonsense reasoning.As a result, in many NLP tasks, these models experience a performance drop when presented with examples exhibiting negated characteristics.
As a case study, the ATOMIC (Sap et al., 2020) knowledge graph encodes social commonsense knowledge about event pre-conditions, event postconditions, and static attributes in the form of natural language if-then rules.However, despite the fact that ATOMIC provides a rich set of seed events, it comprises an unbalanced set of affirmative events (97.9%) and negated events (2.1%).As a result, when systems link to ATOMIC to retrieve relevant social commonsense inferences, they are likely to recover inferences of affirmative events even when searching for negated instances.Furthermore, knowledge models that use this resource (e.g., COMET; Bosselut et al., 2019) are unlikely to learn implicit differences between inferences of affirmative and negated events.When given negated events, these models often produce associations of counterpart affirmative events.For example, for

Types Example Negation Cues Example Sentences
Affixes un-, ir-, non-, il-, im-, -less, etc. X addresses an irrelevant point X is unlikely to be a spy X unsaddles the horse Single-word not, no, nothing, nobody, few, little, without, never, hardly, rarely, barely, seldomly, etc.
X does not tell the truth to the public X never eats ice cream X went to a movie without his friends Multi-word no longer, barely/hardly ever, not at all, a lack of, be deprived of, in the absence of, on no condition, by no means, not by any means, under no circumstances, make no attempt to, etc.
X no longer wants to buy a car X is not at all impressed by Y's ideas X under no circumstances smokes X is by no means cheating on Y Negative Verbs oppose, refuse, resist, avoid, disapprove, lack, discontinue, stop, cease, halt, prohibit, forbid, prevent, reject, fail, etc.
X denies the existence of god X restrains himself from eating with Y X refuses to be in a relationship the negated event, "X opposes racism," COMET infers "X intends to be a racist," an association of the affirmative statement, "X supports racism."At the heart of this problem is that inferring commonsense knowledge about negations often requires implicit reasoning.In factual knowledge reasoning, applying logical rules over statements can be effective for handling negative queries (Asai and Hajishirzi, 2020;Ren and Leskovec, 2020).However, directly manipulating affirmative forms with logic-guided rules may fail for commonsense reasoning: the boundary of commonsense inferences between affirmative and negated statements is not always wholly contrastive.Many inferences can be relevant to both forms.The events "X puts the potato in the oven" and "X doesn't put the potato in the oven," could both have an associated inference: "X wants to make dinner."The affirmative event clearly implies this inference.For the negated event to be worth mentioning on its own (Grice et al., 1975), an implicit complementary event (e.g., "X puts the potato in the microwave") would likely hold, which might validate the inference w.r.t. the negated event.To model the defeasibility of commonsense reasoning (Pratt, 1994;Rudinger et al., 2020), modeling both common and contrastive inferences of negated forms is necessary.

ANION: Commonsense Inferences of Oppositions and Negations
To provide a rich resource of commonsense inferences for opposition and negation events, we design ANION.Using the same schema as the ATOMIC knowledge graph (Sap et al., 2020), we initialize 22,483 negated forms paired to original ATOMIC events and crowdsource 627,042 new inferences for these negated events.Consistent with ATOMIC, ANION is constructed using English formulations of events and inferences.We briefly recap ATOMIC and describe the construction of ANION below.

ATOMIC Background
The ATOMIC knowledge graph contains ∼24K base events (e.g., "X plays the piano") with 877K accompanying social commonsense inferences (e.g., "Before, X needs to buy a piano.")along nine dimensions (e.g., xNeed).The full description of ATOMIC relation types can be found in Table 12 in the Appendix.

Overview of ANION Construction
Our knowledge construction pipeline consists of two steps.First, we collect negated and contradictive events by deriving oppositions of events in ATOMIC.Inspired by the distinction made between negation contributed by semantic assertion (explicit negation) or non-asserted content (implicit negation) (Xiang et al., 2016), we define three varieties of negated events: logical negations, semi-logical negations, and commonsense contradictions, which we describe in detail below.Logical and semilogical negations were heuristically formulated from ATOMIC events.Commonsense contradiction events were crowdsourced from Amazon Mechanical Turk (MTurk).Negated events in ANION are assigned to the same data split as the corresponding affirmative event from which they are derived (e.g., negated events for ATOMIC training set events are found in the ANION training set).Once a list of negated events is compiled, we crowdsource inferences of these new events on MTurk using similar annotation templates as Sap et al. (2020).We design qualifying tasks to filter out unreliable workers and screen their answers manually for quality control purposes.Logical Negation We define logical negation events as events with the negation cue not added to their original formulation (e.g., "X does not play the piano").However, different positions of the not modifier in a clause can result in different negation scopes, which can alter the semantics of the event (Councill et al., 2010).To be consistent, we systematically insert not after the subject of the event clause.If necessary, we change verb forms and add auxiliary words (e.g., do, does, did, is, was, can, could, would, should, may, might).For quality control, we have human workers validate each logically negated event form and exclude events that annotators identify as uninterpretable or awkwardly worded.For each created event, we then collect the same nine dimensions of inferences as defined in ATOMIC.Consequently, we collected 8,285 logically negated events with 225K corresponding inferences (as shown in Table 2).Appendix A.1 provides further details of the compilation of logical negation events.
Semi-logical Negation We define semi-logical negation using explicit cues other than not.We categorize these negation cues (words or phrases) into four subtypes: affixes (e.g., legal/illegal), singleword cues (e.g., never), multi-word cues (e.g., no longer), and negative verbs (e.g., refuse).See Table 1 for examples.We create semi-logical negation events by heuristically adding these cues to different positions of ATOMIC events.Similar to logically-negated events, we avoid grammatically incorrect or semantically awkward events by removing auto-generated instances of low quality.The final set of data includes 5,019 semilogical negation events.We then crowdsource a total of 138K inferences for these new events.Appendix A.1 provides further details of the compilation of semi-logical negation events.

Event Commonsense Contradiction
X buys a bicycle X buys a car X donates a bicycle X walks in the door X stops at the door X walks out of the building X works hard all day X plays games all day X puts in minimal effort all day X finishes the story X starts the story X stops halfway through the story X turns the air blue X secretly curses X speaks appropriately Commonsense Contradiction We formulate commonsense contradiction as contradictory statements without negation cues.Commonsense contradiction events are not identifiable as negations on their own, but demonstrate reversed semantic or pragmatic meaning when paired with their affirmative counterparts (e.g., "X eats a hamburger" vs. "X eats a salad").To obtain commonsense contradictions, we crowdsource two oppositional events for each ATOMIC event, excluding events with blank placeholders representing generic objects, resulting in 40K new commonsense contradiction events.For 9,179 of these events, we crowdsource an additional 262K commonsense inferences.Appendix A.1 provides further details of the crowdsourcing of commonsense contradiction events.

Knowledge Models of Negated Events
ANION can be used as training data for commonsense models to make inferences about negated events.Here, we recap COMET (Bosselut et al., 2019), a commonsense knowledge model, and evaluate how training knowledge models on ANION affects their ability to hypothesize commonsense knowledge for negated and oppositional events.
In ATOMIC and ANION, h corresponds to events, such as "X has a nightmare," t corresponds to commonsense inferences about those events, such as "X wakes up," and r corresponds to commonsense inference types, such as "As a result, X does...".Following Bosselut et al. (2019) and Sap et al. (2020), for each event and relation type in ATOMIC, 10 candidate inferences are decoded from COMET using beam search with b=10.

Experiments
As oppositional instances remain challenging to knowledge models such as COMET, we evaluate how ANION can be used to augment the type of examples seen by COMET during training.
Evaluation Metrics Following Bosselut et al. (2019), we evaluate the quality of generated inferences using BLEU-2 (Papineni et al., 2002) as an automatic evaluation.We also compute the perplexity of models on their reference generations.
For the human evaluation, we employ human judges from MTurk to identify whether generated commonsense inferences are plausible.We randomly sample 100 events from the original ATOMIC test set along with their negated counterparts from ANION.For each event, we present every decoded inference to five crowdworkers and ask them to identify whether the inference is plausible given the event.For each model trained on a different combination of ATOMIC and ANION (i.e., ANION-L, ANION-S, ANION-C), we evaluate the same events for comparison.We calculate Precision @ 10 (P@10) across these human ratings, i.e., the average number of correct options per event-relation prompt.Specifically, we average the results from 45K ratings to compute the final human score (100 events × 9 relations × 10 options × 5 annotators).The pairwise agreement score of human evaluation is 63.6, which is on par with other similar commonsense reasoning annotation tasks (Rashkin et al., 2016).

Does negated event training improve commonsense inference for negated situations?
We train a COMET model on the events from ATOMIC (i.e., COMET-ATOMIC), and another on the examples from both ATOMIC and ANION (i.e., COMET-FULL).The combined dataset is shuffled so that the original and negated examples are uniformly mixed during training.
We report our comparison of these two models in Table 4.The performance of the original COMET model trained only on the ATOMIC knowledge graph drops significantly across all types of oppositional instances.Most surprisingly, a drop in performance is also observed on commonsense contradictions (ANION-C), which have no explicit negation cues.However, commonsense contradiction events can often be richer in content (see Table 3), making them more challenging for knowledge models.Meanwhile training on all negated examples in the ANION knowledge graph produces significant improvements across all negation categories (ANION-{L,S,C}), though we do observe a slight drop in human ratings on the examples from the original ATOMIC test set.

Does negated event training deteriorate commonsense inference of affirmative situations?
We note in Table 4 that training on ATOMIC + AN-ION hurts inference performance on the original ATOMIC evaluation set.To analyze why COMET-FULL does not improve on this set of examples, we perform a case study on inferences generated by COMET-ATOMIC and COMET-FULL under the same event and relation prompt, and note two qualitative patterns.
First, we observe that COMET-FULL tends to generate inferences that are less generic, but that may require additional implicit context.For example, for the event "X is really sad" and the relation xEffect (i.e., the effect of the event on X), COMET-ATOMIC generates inferences such as "cries," "gets depressed" and "takes medication."Conversely, COMET-FULL generates context-specific inferences such as "thinks about the past" and "thinks about what they did," which, while plausible in some context, may be less straightforward when evaluated broadly (not all feelings of sadness lead to reflection on the past or one's own actions).
Second, we find an overall improvement for certain compositional events in ATOMIC that contain conjunction words: "and" or "but."On these examples, COMET-FULL outperforms COMET-ATOMIC with 12.41 and 12.22 BLEU-2 scores respectively.For example, for the event "X is hot and humid" and the relation xEffect, COMET-ATOMIC's generation includes correct inferences, such as "to take a shower," "to cool down," "to drink some water," "to go outside," and incorrect inferences, such as "to turn on the heat" and "to drink a hot tea."COMET-FULL generates all of COMET-ATOMIC's correct inferences, but none of the incorrect inferences, demonstrating that training COMET jointly on ATOMIC and ANION can help avoid incorrect inferences involving commonsense mismatch in more compositional situations.
In summary, the ability to generate richer, contextual inferences for COMET-FULL is beneficial when handling complex events, but may not be necessary for many of the simple events in ATOMIC, and may backfire when subtler inferences are made.
Which variety of negated events are most crucial to include in training sets?As ablations, we train additional models using different subsets of ANION: logical negations (ATOMIC + ANION-L), semi-logical negations (ATOMIC + ANION-S), and commonsense contradictions (ATOMIC + AN-ION-C).These ablations evaluate whether knowledge models can adapt to certain types of negation more efficiently with additional data. In

Discriminating Inconsistent Inferences
While training on examples of negated events helps knowledge models generate commonsense inferences for these event types, there is still a large gap compared to their performance on affirmative events.To address this discrepancy, we introduce a discriminator-based approach for distinguishing inconsistent inferences of negated events.Our inference discriminator learns to identify plausible and invalid inferences of events by learning from contrastive samples from ATOMIC and ANION.

Experimental Setup
We fine-tune the RoBERTa-base model (Liu et al., 2019) as a binary classifier to identify whether a given knowledge tuple {h, r, t} is logically valid.The model is trained on paired original and negated events as described below.Such training examples inject implicit commonsense nuances that differ between oppositional events to teach the discriminator to identify logical pitfalls.Training details for discriminators can be found in Appendix A.3.
Data The paired events used to train the negation discriminator are automatically constructed from the ATOMIC and ANION knowledge graphs.
Positive examples can be constructed by sampling tuples from each knowledge graph.To construct negative training samples, we introduce the concept of common and contrast sets among inferences of events and their oppositions.
Common and contrast sets distinguish how commonsense inferences are not necessarily negated in the same manner as their corresponding events.While certain inferences of events are also in opposition to a negated event, some may be common.For the events "X eats a cheeseburger" and "X eats a salad," an inference such as "X is hungry" might be common to both events while inferences such as "X is unhealthy" or "X is healthy" would be viewed as contrastive.
Specifically, we assume two head events in ATOMIC and ANION, and their respective set of tail inferences regarding a common relation type.We define the common set of these inferences as the intersection of the two sets of tail inferences connected to each head event by applying the exact match of string forms.The contrast set is formed by distinct tail inferences connected to the two head events.Logically valid (i.e., positive) training examples consist of knowledge tuples from ATOMIC and ANION.Logically invalid (i.e., negative) training examples are formed by swapping the set of contrast set inferences between paired original and negated events. 3 To balance the training set, we sample the same number of positive and negative tuples for original and negation events.Statistics of the resulting training sets are in Table 6.

Experiments
Using different portions of ANION for training yields four unique discriminators (i.e., L, S, C and 3 We note that annotations in ATOMIC and ANION are finite (i.e., not covering the full space of possible commonsense inferences about events).As a result, it is possible that in a more expansive annotation, elements of the contrast sets would in fact be part of the common set of an event and its negation.For the purpose of this work, however, contrast sets were an efficient way of acquiring high-quality semantically negative examples for training discriminators.

Discriminator
Train  Table 7: The evaluation of the all, valid and invalid sets of inferences generated by COMET-ATOMIC as partitioned by the LSC discriminator.P@k corresponds to the human-rated precision of a set.k is the number of elements in all, valid, or invalid set.For the valid set, higher P@k is better (i.e., more valid inferences are being partitioned).For the invalid set, lower P@k is better (i.e., fewer valid inferences are being included).
LSC) that we apply to commonsense inferences generated by COMET.The discriminators classify each option as either logically valid or invalid, partitioning the candidates into two sets, which we evaluate with human judgements.As a baseline, we also record the precision of not using a discriminator, which assumes all generated inferences are valid candidates (i.e., the all set).
Metrics We evaluate and compare the quality of the all, valid and invalid sets using BLEU-2 and the same human evaluation as in §4.The all set contains the full set of 10 candidates, while the valid and invalid sets have varying number of elements depending on how discriminators classify them, summing to 10.To compute statistical significance between valid and all sets, we use a permutation test with 100K permutations.Details are provided in Appendix A.4. the valid set) that are more logically consistent with their seed event.This observation holds across all evaluation subsets of ANION, as well as the original ATOMIC evaluation set.LSC discriminator.The discriminator is notably good at identifying invalid inferences wrongly associated to corresponding affirmative events (e.g., "athletic" and "careless" for the event "X does not skate around" under the relation, xAttr).

Do discriminators effectively distinguish inconsistent inferences? The results in
However, this analysis leaves open the possibility that we are generating too many inferences for each event, but that the decoder could rank correct inferences higher among the full set of generated candidates.To evaluate this possibility, we count the number of elements in the valid sets for each example and only keep the same number of the top-scoring elements from the all set (scored using generation perplexity).In Table 9, we see the average precision score for the pruned all sets (P@{# valid}) still underperforms the precision of their corresponding valid sets.
Which negation categories are most important to provide a discriminator for?To examine the generalization effects of each negation type, we also train discriminators on a single negation subset of ANION examples (i.e., L, S, C) and compare the P@{# valid} score of the all and valid sets.Results in Table 9 indicate that each discriminator is best for identifying valid inferences for the types of events on which it was trained.The L, S, and C discriminators all achieve improvements when partitioning events similar to their training.However, the LSC discriminator trained on all negation forms shows the largest valid set improvement across all discriminators on ATOMIC, ANION-S, and ANION-C.On ANION-L, the LSC discriminator still yields a significantly improved valid set.

Discussion
Are learning-based and discriminator-based approaches complementary?We apply our discriminators to the generations of the COMET model trained on ANION.In Table 10, we see that the LSC discriminator, when to generations of COMET trained on ANION, achieves significant improvements over all evaluation sets, including the original events.The full evaluation of the P@{# valid} and P@3 scores of applying different discriminators to generations of COMET trained on different data over all evaluation sets are shown in Table 13 and 14 in Appendix A.
Can discriminators be used to more aggressively generate inferences?While applying discriminators to generated inferences yields a valid subset with higher accuracy, we are left with fewer correct inferences in total.Thus, we investigate the efficiency of using discriminators to expand the number of inferences generated.We decode inferences from COMET with beam size 25, and then apply the discriminator to this larger candidate set.
Table 11 shows that for logical negation, the valid set of beam 25 has higher accuracy and more correct options than the all set of beam 10.Thus, when we have a larger and potentially more noisy set of candidates, applying the negation discriminator yields a set of options that have higher quality than using all the candidates from a smaller set of initial generations.

Conclusion
We present the first comprehensive study on commonsense implications of negations and contradictions.To expand commonsense resources for the challenge of negation modeling, we introduce AN-ION, a large scale commonsense knowledge graph for negated and contradicted events.We use AN-ION to train commonsense knowledge models and demonstrate that it effectively enriches machine commonsense inference capabilities around negation.Lastly, we propose a negation discriminator capable of identifying logical flaws in commonsense inferences.By combining the model trained on ANION with the negation discriminator, we achieve a further performance boost.

Ethical Considerations ANION Language Choice and Implications
We select English as the base language of ANION so that our resource may be directly linked with the original ATOMIC knowledge graph.We acknowledge, however, that resources in English are more likely to reflect the mindsets and behaviors of English speakers.Furthermore, and in our case specifically, our annotators were primarily from the US.Consequently, this language choice biases the content of the knowledge graph toward North American perspectives, which affects what models trained on these resources would learn about social norms (Acharya et al., 2021).Future works may also include other languages and cultures to make the ANION resource more culturally and ideologically inclusive.

Crowdworker Recruitment, Quality Control and Remuneration
We recruit crowdworkers MTurk who are located within the US with HIT approval rates higher than 98%.To ensure high quality task completions, we post pilot batches and manually examine tens of thousands of responses to identify users who provide high quality annotations.We select 834 qualified users for the formal data collection and human evaluation tasks.Since the entire study spans multiple months, we regularly sample responses to re-examine their quality during the formal study, and remove HITs from crowdworkers who provide decreased-quality responses over time.We are particularly cautious about the human evaluation tasks, so even with qualified users, we still comprehensively examine tens of thousands of human evaluation tasks by grouping HITs per users, and look at their responses together to identify potential spamming behaviors and inconsistencies.
For the data collection and human evaluation tasks, we aimed to compensate crowdworkers with an average of $15 per hour.To ensure a fair payment, we first post a pilot task to evaluate average time cost of a specific task, and pay users at a high rate in this round to avoid underpayment during the pilot study.We then calculate new payment from the pilot task such that approximately 75% of the HITs would have been paid with more than $15 per hour at the adjusted rate in the pilot round.We then adopt this new rate for the formal study.We repeat the above procedure of determining payment periodically during the study to ensure the crowdworkers are consistently well-paid.

A Appendices
A.1 ANION Data Collection Details Heuristic of Creating Logical and Semi-logical Negation Events For logical negation, with the majority of the original events being simple sentences with one predicate, our general rule of thumb is to negate the original event at the sentence level.Specifically, with respect to each original event, we first identify each tokens' part of speech (POS) tags via the NLTK toolkit4 .Then, we insert the negation cue not after the subject of each sentence, with majority of the case the entity "PersonX," with few exceptions of "PersonX's" and "PersonX and PersonY." To ensure the grammar correctness of the heuristically generated logical negation events, we add appropriate auxiliary verbs (e.g., do, does, did, is, was, can, could, would, should, may, might) in accordance with the tenses (e.g., present, past, future) of the original events.Since NLTK's POS parser fails to recognize some of the verbs that have both noun and verb usage (e.g., "waters" the plant, "supports" her argument), we curate a list of dual-used words and map them manually.Also, while converting the original events to their logical negation counterparts, we revise grammar mistakes from ATOMIC and exclude awkward expressions as much as possible.In addition, to make the negation forms sound more natural, we replace the modifier "some" by "any" during conversion (e.g., "PersonX buys some shoes" is converted to "PersonX doesn't buy any shoes").For the minority of compound events with clauses or complex sentence structures, we disregard them for the purpose of ensuring the data quality.
For semi-logical negation events, we curate a list of semi-logical negation cues besides not from various sources5 (Councill et al., 2010;Hossain et al., 2020;Kim et al., 2019) and categorize them into four types including affixes, single-word cues, multi-word cues and negative verbs (Table 1).We identify appropriate rules to insert each semi-logical negation cue in simple base events from ATOMIC consisting of a subject and a predicate.We apply the rules to original events from ATOMIC and randomly select at least 200 automatically generated semi-logical negation events per each negation cue for manual screening by the first author to avoid misplacement of negation cues and awkward expressions.In the end, we were able to identify 5,019 high quality semi-logical negation events originating from ATOMIC.
As a final quality control step of the constructed logical and semi-logical events, after obtaining the crowdsourced inferences for each event, we remove all events that annotators comment as "unclear," "doesn't make sense" or "grammatically wrong." Crowdsourcing of Commonsense Contradiction Events For collecting commonsense contradiction events, we present an original ATOMIC event to the annotators and ask them to formulate corresponding opposite events.We exclude ATOMIC events with placeholders representing generic objects) to capture semantic and pragmatic subtlety.In the MTurk task, we present annotators detailed instructions of formulating the opposite events (e.g., avoid using negative words as much as possible, use complete sentences, follow grammar rules) and concrete examples as references.Figure 2 shows details of the MTurk task.Although we explicitly instruct annotators to avoid using negation cues, there are still some exceptions.Therefore, after the compilation of all commonsense contradiction events, we remove ones that contain any explicit negation cues to make sure the categorization is clean.

Crowdsourcing of ANION Event Inferences
For the collection of ANION event inferences, we adopt the MTurk templates used by the original ATOMIC data collection6 .Similarly to logical and semi-logical events, we remove all inferences of events that annotators comment as "unclear," "doesn't make sense" or "grammatically wrong."

A.2 Training Details of COMET Models
Input A knowledge tuple {h, r, t} is represented as a concatenated sequence with tokens of each element in the tuple: X = {X h , X r , X t } where X h = {x h 0 , ..., x h |h| } are the tokens comprising the event, X r = {x r 0 , ..., x r |r| } as tokens comprising the relation, and X t = {x t 0 , ..., x t |t| } are the tokens comprising the commonsense inference.
Initialization Similar to Bosselut et al. (2020), we initialize the trained parameters of COMET to the 345M parameter GPT2 model (GPT2-M) from Radford et al. (2019).Special tokens that represent relation types (e.g., xIntent) are added to the vocabulary and initialized via sampling from the normal distribution.
Hyperparameters Following Bosselut et al. (2019), we use a dropout rate of 0.1 and GeLU (Hendrycks and Gimpel, 2020) units as activation functions.During training, we use the Adam optimizer (Kingma and Ba, 2017) with a batch size of 64.For COMET models trained on different subsets of the ATOMIC and ANION datasets, we adopt a maximum learning rate of 6.25e-5 with a warmup period of 0.002 times of the total number of minibatches customized for each model, which decays linearly until finishing training.
We train different COMET models for different subsets of the full on original data (ATOMIC), original and logical negation data (ATOMIC + AN-ION-L), original and semi-logical negation data (ATOMIC + ANION-S), original and commonsense contradiction data (ATOMIC + ANION-C), and the overall dataset (ATOMIC + ANION), for 21k, 25K, 24K, 24K and 29K minibatches respectively, and apply early stopping for all models.The rest of the hyperparameters are the same as those of GPT2-M in Radford et al. (2019) implemented via the publicly available HuggingFace API 7 .
All models are fine-tuned and evaluated on a single NVIDIA QUADRO RTX 8000 GPU for six to twelve hours depending on the complexity of the experimental setup.

A.3 Training Details of Negation Discriminator
Input As input to the discriminator model, we design sentence patterns that express relation types in natural language and fill out the patterned sentences with events and conditions before encoding them (e.g., "PersonX addresses a talk.As a result, PersonX wants to convince others.").Relations and their corresponding patterned sentences are listed in Table 12.Adopting patterned sentences is found to be a more effective approach than concatenating components in knowledge tuples from the pilot study.

Loss Function
The negation discriminator is trained to minimized the binary cross-entropy loss:  where y is the label for an input (i.e., logically valid or invalid).
Hyperparameters Parameters are initialized with the trained weights of the RoBERTa-base model in Liu et al. (2019).During training, we use the Adam optimizer (Kingma and Ba, 2017) and train the model with a batch size of 64.We adopt a maximum learning rate of 4.5e-5 with a warmup period of 10 minibatches.We trained L, S, C, LSC discriminators, for 25K, 14K, 21K and 6K minibatches respectively, and apply early stopping for all models.We use a probability threshold of 0.7 to determine whether an input knowledge tuples to the discriminator is plausible based on pilot study on the development sets.The rest of the hyperparameters are the same as those of RoBERTa-base (Liu et al., 2019) implemented via the publicly available HuggingFace API8 .All models are fine-tuned and evaluated on a single NVIDIA QUADRO RTX 8000 GPU for four to six hours depending on the different experimental setups.

A.4 Statistical Significance Testing
To compare P@{# valid} for the all and valid sets, we use a Permutation Test9 with 1,000 permutations to test for statistical significance.For multiple comparisons, we use the Bonferroni method (Haynes, 2013) to correct significance thresholds.

A.5 Quality Check for the Human Evaluation
We conduct comprehensive pre-and postevaluation screening on the users and the tasks being completed to ensure the objectivity and high quality of the evaluations.Besides qualifying users during pilot batches, we double check to remove evaluation tasks that are not carefully conducted (e.g., tasks done by users that select all/no options for all hundreds of tasks that they perform).Figure 3 shows a snippet of the human evaluation MTurk Step 1: read a short event sentence.
Note that the names of speci!c people have been replaced by generic words (e.g., "PersonX", "PersonY").

b.
Step 2: given this event, you are asked to formulate TWO (up to FOUR) corresponding OPPOSITE events.
There might be multiple grammatically correct ways of expressing your interpretation of the OPPOSITE events.Please express them in natural language (i.e., in a way that you normally talk), and make sure your OPPOSITE events are in complete sentences.
Please don't trivially negate.Try not to use negative words directly, including but not limited to: no, not, nothing, no one, none, nobody, nowhere, neither, nor, never, lack of, unless you feel necessary.
Changing one of the characters (e.g., from "PersonX" to "PersonY"), or one of the objects (e.g., from "mother" to "father" or from "cat" to "dog") are not the goals of the OPPOSITE event, unless you think they are appropriate.
Please do not add unnecessary/unrelated additional details to the OPPOSITE event.
Examples (click to collapse/expand)

Event ${Event}
Given this event, can you formulate corresponding OPPOSITE events?Make sure they are in complete sentences.

${title}
Full Instructions (Expand/Collapse) You will read a sentence fragment depicting an event, and be asked to ${task}.
Events are short phrases possibly involving participants.The names of specific people have been by generic words (e.g.PersonX, PersonY, PersonZ).PersonX is always the subject of the event.

${instruction}
Notes on the events: some of the events may be figurative, and should not be taken literally (e.g., "PersonX kills two birds with one stone" does *not* make PersonX "murderous") Optional Feedback: Thanks for filling out the questions above!If something about the hit was unclear, please leave a comment in the box below.We would like to make this HIT easier for future workers, so we really appreciate feedback though it is optional.

Figure 2 :
Figure 2: Snippet of the annotation task used to collect commonsense contradiction events.

Figure 3 :
Figure 3: Snippet of the human evaluation task used to evaluate model generated tail inferences.

Table 1 :
Negation cues and examples from ANION.

Table 2 :
Statistics of ATOMIC and different subsets of ANION (ANION-L + ANION-S + ANION-C).

Table 3 :
Contradictions of events from ATOMIC

Table 4 :
Evaluations of COMET models trained on ATOMIC and ANION KGs.Training on examples of negated events leads to large improvements in the quality of generated inferences with minimal dropoff in the quality of inferences for affirmative events.Single (*) and double asterisks (**) indicate significance at p<0.05 and p<0.01, respectively.
4.1 SetupCommonsense transformers (COMET) are generative knowledge models that learn to hypothesize commonsense inferences by training on examples from a knowledge graph.Specifically, COMET receives knowledge tuples in {h, r, t} form during training, where h is a head entity, r is a relation type, and t is a tail entity.The model is trained to maximize the conditional loglikelihood of predicting the tokens of the tail entity t given the tokens of the head entity h and relation r: Table 5, we show that training with examples of each negation type improves performance

Table 5 :
Ablation results of models trained and evaluated on different portions of ANION.The best result on each subset of ANION comes from training on similar examples.The model trained on negated events from ANION-L performs the best at generating inferences for the original ATOMIC events.Double asterisks (**) indicate significance at p<0.01.
on the evaluation set related to that negation type.Interestingly, though, training on certain types of negation examples can also yield benefits downstream on other negation types.For example, training on commonsense contradictions (ANION-C) provides a clear benefit when evaluating on semilogically negated events (ANION-S) as opposed to merely training on ATOMIC.Notably, the knowledge model trained with logically negated examples (ATOMIC + ANION-L) outperforms the model trained only on ATOMIC on all test sets.

Table 6 :
Statistics of data used to train negation discriminators.

Table 8 :
Table7demonstrate that the discriminator trained on all subsets of ANION (LSC) can select subsets of inferences (i.e., Inferences of randomly selected ANION events by COMET-ATOMIC.The top 5 options are classified as valid or invalid by the LSC discriminator.V indicates whether an option is classified as valid by the LSC discriminator.P indicates whether an option is plausible judging by humans.

Table 9 :
P@{# valid} scores of the all and valid sets determined by the L, S, C and LSC discriminators.Generations are from COMET-ATOMIC.Asterisks (**) indicate significance at p<0.01.iprv% is the improvement of the valid over the all set.Underlines show the highest iprv% across discriminators.
Table 8 shows examples of valid and invalid candidates for negated and contradicted events from ANION as specified by the

Table 10 :
P@{# valid} scores of the all and valid sets determined by the L, S, C and LSC discriminators.Generations are from COMET-FULL .Single (*) and double asterisks (**) indicate significance at p<0.05 and p<0.01, respectively.iprv% is the improvement of the valid over the all set.Underlines indicate the highest iprv% across discriminators.
Table11: Number of correct generations from applying the LSC discriminator to generations of COMET-ATOMIC for beam size of 10 and 25 for logical negation events.# is the number of correct options.#total is the number of options in each set.

Table 12 :
Patterned sentences representing relation types in ATOMIC, used to construct inputs for training negation discriminators.

Table 13 :
For generations of COMET models trained on different subsets of ATOMIC and ANION, the Precision @ {# valid} scores of the all and valid sets determined by L, S, C and LSC discriminators with respect to the original and negation evaluation sets.The single (*) and double asterisks (**) indicate significance at p<0.05 and p<0.01 respectively.iprv% is the percentage improvement of the valid set over the all set.

Table 14 :
For generations of COMET models trained on different subsets of ATOMIC and ANION, the Precision @ 3 scores of the all and valid sets determined by L, S, C and LSC discriminators with respect to the original and negation evaluation sets.The single (*) and double asterisks (**) indicate significance at p<0.05 and p<0.01 respectively.iprv% is the percentage improvement of the valid set over the all set.

Table 15 :
Randomly selected generations of the original COMET model regarding logical negation events in ANION-L.The top 5 options are classified as either valid or invalid by the LSC discriminator.V indicates whether an option is classified as valid by the LSC discriminator.P indicates whether an option is plausible judging by humans.

Table 16 :
Randomly selected generations of the original COMET model regarding semi-logical negation events from ANION-S.The top 5 options are classified as either valid or invalid by the LSC discriminator.V indicates whether an option is classified as valid by the LSC discriminator.P indicates whether an option is plausible judging by humans. task.

Table 17 :
Randomly selected generations of the original COMET model regarding commonsense contradiction events from ANION-C.The top 5 options are classified as either valid or invalid by the LSC discriminator.V indicates whether an option is classified as valid by the LSC discriminator.P indicates whether an option is plausible judging by humans.file:///Users/liweijiang/Desktop/Useful/Projects/ANION/comet/crowdsourcing/mturk_template/atomic_neg_prag_event_camera_ready.html