SemEval-2020 Task 4: Commonsense Validation and Explanation

In this paper, we present SemEval-2020 Task 4,CommonsenseValidation andExplanation(ComVE), which includes three subtasks, aiming to evaluate whether a system can distinguish anatural language statement thatmakes senseto humans from one that does not, and provide thereasons. Specifically, in our first subtask, the participating systems are required to choose from twonatural language statements of similar wording the one thatmakes senseand the one does not. Thesecond subtask additionally asks a system to select the key reason from three options why a givenstatement does not make sense. In the third subtask, a participating system needs to generate thereason automatically. 39 teams submitted their valid systems to at least one subtask. For SubtaskA and Subtask B, top-performing teams have achieved results closed to human performance.However, for Subtask C, there is still a considerable gap between system and human performance.The dataset used in our task can be found athttps://github.com/wangcunxiang/SemEval2020-Task4-Commonsense-Validation-and-Explanation.


Introduction
In the past decades, computer' ability in processing natural language has significantly improved. However, its intelligence for understanding common sense expressed in language is still limited. For example, it is straightforward for humans to judge that the following sentence is plausible, or makes sense: "John put a turkey into a fridge" while "John put an elephant into the fridge" does not, but it is non-trivial for a computer to tell the difference. Arguably, commonsense reasoning plays a central role in a natural language understanding system (Davis, 2017). It is essential to gauge how well computers can understand whether a given statement makes sense. In our task, we take an operational definition of making sense by asking human subjects to generate natural language statements that obey or violate their commonsense knowledge about the world. 1 Many existing tasks embed the evaluation of commonsense understanding in other tasks such as coreference resolution (Levesque et al., 2012;Morgenstern and Ortiz, 2015), subsequent event prediction (Roemmele et al., 2011), ordinal common-sense inference (Zhang et al., 2017), situations with adversarial generations (Zellers et al., 2018), event validation (Wang et al., 2018), reading comprehension (Mostafazadeh et al., 2016;Ostermann et al., 2018b;Ostermann et al., 2018a), dialogue (Cui et al., 2020) and QA (Davis, 2016;Talmor et al., 2018;Mihaylov et al., 2018). They verify whether a system is equipped with common sense by testing whether the system can give a correct answer when the input does not contain such knowledge. The above tasks do not directly evaluate commonsense validation and they do not explicitly identify the key factor required in a commonsense validation process.
The SemEval-2020 Task 4 includes three subtasks on testing whether a system can distinguish natural language statements that make sense from those that do not, and probe the reasons. In the first subtask, a system needs to choose the against-common-sense statement from two natural language statements of similar wordings, e.g., "John put an elephant into the fridge" and "John put a turkey into the fridge", respectively. The second task aims to find the key reason from three provided options why a given nonsensical statement does not make sense. For example, for the nonsensical statement, "John put an elephant into the fridge", the three options are "An elephant is much bigger than a fridge", "Elephants are usually white while fridges are usually white", and "An elephant cannot eat a fridge." A system needs to identify the correct reason. In addition, the third task requires the participating systems to generate the reason automatically. We hope that the task and datasets can facilitate studies on commonsense validation, its interpretability, and the related natural language understanding and generation problems.
There are 39 teams submitting valid systems to at least one subtask. In Subtask A and Subtask B, top-performing systems achieve performances closed to that of human subjects. However, for Subtask C, there is still a relatively large between system and human performances.

Task Definition
Formally, each instance in our dataset is composed of eight sentences: {s 1 , s 2 , o 1 , o 2 , o 3 , r 1 , r 2 , r 3 }. s 1 and s 2 are two similar statements that differ by only a few words; one of them makes sense (i.e., conforms to common sense) while the other does not. They are used in our Subtask A: the Validation subtask, which requires a model to identify which one makes sense. For the statement that does not make sense, we have three candidate reasons, i.e., three options o 1 , o 2 , and o 3 ; one of them explains why the statement does not make sense. So, in our Subtask B, the Explanation (Multi-Choice) subtask, a model is required to find the correct reason from the three options. For the same nonsensical statement, in Subtask C, the Explanation (Generation) subtask, a participating system needs to generate the reason why it does not make sense. Three references, r 1 , r 2 , and r 3 , are used for evaluating Subtask C. Below we give an example for each subtask, in which we introduce some notations we will use in the paper.

• Subtask A: Validation
Task: Select the statement of the two that does not make sense. s 1 : John put a turkey into a fridge. s 2 : John put an elephant into the fridge.
In this example, s 1 is a sensical statement, also denoted as s c , while s 2 is the nonsensical statement, which is also denoted as s n .
• Subtask B: Explanation (Multi-Choice) Task: Select the best reason that explains why the given statement does not make sense. Nonsensical statement (s n ): John put an elephant into the fridge. o 1 : An elephant is much bigger than a fridge. o 2 : Elephants are usually white while fridges are usually white. o 3 : An elephant cannot eat a fridge.
In this example, the option o 1 is the correct reason, which is also denoted also as o c , while o 2 and o 3 are not the reason, which are also denoted as o n1 and o n2 .
• Subtask C: Explanation (Generation) Task: Generate the reason why this statement does not make sense. Nonsensical statement (s n ): John put an elephant into the fridge. Reference reasons (used for calculating the BLEU score): r 1 : An elephant is much bigger than a fridge. r 2 : A fridge is much smaller than an elephant. r 3 : Most of the fridges aren't large enough to contain an elephant.

0
The reason is not grammatically correct, or not comprehensible at all, or not related to the statement at all.

1
The reason is just the negation of the statement or a simple paraphrase. Obviously, a better explanation can be made.

2
The reason is relevant and appropriate, though it may contain a few grammatical errors or unnecessary parts. Or like case 1, but it's hard to write a proper reason. 3 The reason is appropriate and is a solid explanation of why the statement does not make sense. Table 1: Rubrics used in human evaluation in Subtask C.

Evaluation Metrics
The Subtasks A and B are evaluated using accuracy. Subtask C is evaluated with the BLEU score (Papineni et al., 2002). In addition, for Subtask C, we further perform human evaluation. We randomly select 100 instances from the test set and evaluate system outputs on Amazon Mechanical Turk. We ask three different crowd-sourcing workers to score each generated reason with a scale ranging from 0 to 3, inclusively, according the rubrics listed in Table 1.
Then we calculate the average score of the three scores as our final human evaluation score. Formally, the human evaluation score of system k is where score ijk means the score from the j th annotator for system k on the i th instance.

Data Construction
Our data construction is mainly performed on Amazon Mechanical Turk, which consists of two steps: • Step 1: In this step, we construct datasets for Subtask A and Subtask B. Specifically, we ask a crowd-sourcing worker to write a sensical statement s c and a nonsensical statement s n . For the nonsensical statement s n , the worker further writes three sentences, o 1 , o 2 , o 3 ; one of them, denoted as o c , explains why the nonsensical statement does not make sense; two of them, denoted as o n1 and o n2 , serve as the confusing choices. (Refer to Section 3.1 for details.) • Step 2: We then make three reference reasons, r 1 , r 2 , r 3 for Subtask C. We use o c as one of three references, and collect two more references in this step. We ask two different crowd-sourcing workers to write each of them. Note that instead of letting the same worker in step 1 to write these two references, we asked two more workers. The reason is to encourage diversity of the reference. (Refer to Section 3.2 for details.) Finally, each instance of the dataset have 8 sentences: {s 1 , s 2 , o 1 , o 2 , o 3 , r 1 , r 2 , r 3 }. Note that one sentence in o 1 , o 2 , o 3 is repeated in r 1 , r 2 , r 3 , but for convenience of description, we denote it differently.

3.1
Step 1: Collecting Data for Subtask A and B Annotation Guidelines. When writing instances, workers were asked to follow several principles: (1) Try to avoid complex knowledge and focus on daily common sense. Make the questions as understandable as possible, so that a literate person is able to give the right answers.
(2) The confusing reason options, o n1 and o n2 , should better contain more content words or information such as entities and activities in the nonsensical statements s n . For example, the confusing reasons of "John put an elephant into the fridge" should better contain both "elephant" and "fridge".
(3) The confusing reasons, o n1 and o n2 , should be related to the statements s n and the correct reason o c and not deviate from the context; otherwise it may be easily captured by pretrained models like BERT (Talmor et al., 2018). (4)   and o 3 should only be related to the incorrect statements s n rather than the correct statements s c , because we want further studies to be able to estimate nonsensical statements s n without the correct statement s c . (5) The confusing reasons, o n1 and o n2 , should make sense themselves. Otherwise, the models may simply ignore the incorrect options o n1 , o n2 without considering the casual semantics. This concern is raised from and motivated by the fact that models can achieve high performance in the ROC Story Cloze Task, when only looking at the alternative endings and ignoring the story content (Schwartz et al., 2017). (6) We ask the annotators to make the nonsensical statement s n contain about the same number of words as the sensical statement s c , and the correct reason o c have similar length with other two options. We drop the instances which do not meet such requirements.
Use of Inspirational Materials. It is not easy for all crowd-sourcing workers to write instances from scratch. To address this issue, we also provide them with external reading materials to stimulate inspiration, such as the sentences of the Open Mind Common Sense (OMCS) project (Havasi et al., 2010). For example, "he was sent to a (restaurant)/(hospital) for treatment after a car crash" can be inspired by the two sentences "restaurants provide food" and "hospitals provide medical care".
Quality Control. To ensure the quality of the data, we manually check the instances and drop or request a rewriting of the low-quality ones. If one worker writes too many low-quality instances, we will remove her or him from our annotator pool. With such process, we finally accept around 30% submitted instances.

3.2
Step 2: Collecting Data for Subtask C Annotation Guidelines. To collect data for Subtask C, each worker is given a nonsensical statement s n and a sensical statement s c and asked to write a reason to explain why the nonsensical statement s n does not make sense. They shall follow the following rules: (1) Do not explain why the sensical statement s c makes sense.
(2) Avoid mentioning the sensical statement s c .
(3) Write the reason, rather than simply add the word "not" or "can't" to the nonsensical statement s n to form an explanation. (4) Write the reason, don't use patterns like "XXX is not for YYY" to create an explanation. (5) Do not try to justify why the nonsensical statement s n makes sense. (6) Write only one sentence, do not be overly formal. (7) Refrain from using "because" at the beginning of a sentence. (8) Do not try to correct the statement s n , but just give the reason. Quality Control. As the same as in Step 1, after the annotators write the reasons in Step 2, the first two authors of the paper perform the check process again. We reject low-quality reasons (that violate the rules significantly) and low-quality annotators (who write many low-quality reasons with the number above a threshold).

Data Summary and Analysis
For SemEval-2020, we created 11,997 instances (i.e., 11,997 8-sentence tuples). We further split the instances into three subsets with 10,000 (the training set), 997 (the development set), and 1,000 (the test set) instances, respectively. We randomly assign the label of the correct options in subtask A and B to avoid unbalanced correct labels. We conduct three more data analysis experiments to evaluate data quality, including sentence length, common words and repetition. Average Length. In Table 2, we present the average length of each type of sentence in the training/dev/test set. The sentences in the development and test set have shorter lengths than those in the training set. This is because we check the development and test more carefully and more strictly, thus removing longer and more incomprehensible instances, which lowers the average lengths of the dev/test set. The  Table 3: Top-5 common words and their frequencies in different types of sentences in the training and dev+test set. 1.000‰ means this word appear once in every 1000 words. sensical statements and nonsensical statements almost have the same average lengths in the three sets (the differences are equal or smaller than 1%), which is balanced. However, there is an obvious gap between the correct reasons and confusing reasons in terms of the average lengths (roughly 4% in the training set and 10% in the dev/test set). Common Word Analysis. The most common words are important for showing the differences between sentences. We only present those words which have obvious different frequencies between sensical statements and nonsensical statements or between correct/referential reasons and confusing reasons. So, we skip most uninformative words, including 'a', 'an', 'the', 'to', 'in', 'on', 'of', 'for', 'and', 'is', 'are' and 'be'. After removing those words, we can list the top-5 common words in each type of sentence in the training/dev+test sets. For sensical statements s c and nonsensical statements s n , there are no significant differences between the training, dev, and test set. However, there is an obvious gap in the correct reasons o c and confusing reasons o n in negative words such as "not", "no", and "cannot". In the training data, negative words are about 3 times more common in the correct option o c than in the confusing options o n . In the dev+test data, the gap is about 40%, which indicates that the dev+test data has a higher quality than the training data. However, as discussed in (Niven and Kao, 2019), spurious statistical cues can affect BERT's results. We conjure that the negative words are also spurious effective clues, which make the Subtask B potentially easier. Repetition. The dev+test set have 12 instances (0.6%) that repeat the same nonsensical statements in the training data and 36 instances (1.8%) that repeat the same correct reasons with the training data.

Cautions of using the data
The following advice is given to all task participants and future users: (1) Feel free to use whatever additional data they deem appropriate for the tasks to train their model. (2) Do not use the input of Subtask B/C to help Subtask A and do not use the option o of Subtask B to help Subtask C. Otherwise the task will be artificially easy. This is because of two reasons: a) The nonsensical statements s n of Subtask B and Subtask C is exactly the nonsensical statements s c of Subtask A and, participants can use the input of the Subtask B/C to directly obtain the answer of Subtask A and the option answers o of Subtask B will also reduce the difficulty of Subtask A; b) the correct reason o c of Subtask B is also one of the reference reason o c in Subtask C.

Systems and Results
In this section, we show the evaluation results of all the submitted systems for the three subtasks. Since most systems share similar model architecture for subtasks A and B, we discuss the two subtasks together.

Subtask A and Subtask B
The formal evaluation results of Subtask A and B are shown in Table 4 and 5. There are in total 39 valid submissions for Subtask A and 27 valid submissions for Subtask B. Most top-performing submissions Figure 1: The most commonly used model architectures used in the three subtasks. This figure is mostly based on Team Solomon's system. For Subtask B and C, the connector can be simply "No, ", to help in constraining the model to learn a choice that explains the unreasonability of the statement. For Subtask A and B, the pretrained models are finetuned on the task-specific data with MLM-objective, and then trained as a binary classification task to score each input. For Subtask C, the cross-entropy loss of next-token-prediction is used to train the model, and beam search is used at inference. adopted the pretrained language models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019c), XLNET (Yang et al., 2019) and ALBERT (Lan et al., 2019) as the encoder of the model, and then finetune on the training set of the task. See Figure 1 for the most commonly-used model architectures for Subtask A and B. Also, the top-performing systems take advantage of external knowledge graphs such as ConceptNet (Speer et al., 2017), or unstructured text containing commonsense knowledge. Below we introduce in detail several top-performing systems and their main features.
• CN-HIT-IT.NLP  ranks top in Subtask A. They use a variant of K-BERT (Liu et al., 2019a) as the encoder to enhance language representations through knowledge graphs. K-BERT is a Transformer-based model, which enhances the language representations of the text by injecting relevant triples from a knowledge graph to form a knowledge-rich sentence tree, and then uses a mask-Transformer to make the triples visible only to the corresponding entity. They use ConceptNet as the commonsense repository to extract the triples for the statements.
• ECNU-SenseMaker (Zhao et al., 2020) al., 2002). They also explore several prompt templates to constructs as the inputs to the model.  (Markchom et al., 2020), Masked Reasoner (Lu, 2020) have similar model architecture, with RoBERTa as the encoder. In addition, UoR finetunes the pretrained language model on NLI and STS dataset, and UI finetunes on MNLI data. TR combines RoBERTa features with additional features from text-to-image generation using Gradient Boosted Decision Tree, and give better results in post-evaluation.
• BUT-FIT (Jon et al., 2020), LMVE , Lijunyi  use ALBERT as the encoder. BUT-FIT uses back-translation from Czech for data augmentation, and LMVE uses hint sentences, back-translation from French and intra-subtask transfer learning between Subtasks A and B to enhance their system.
It can be seen from the results that pretrained language models such as RoBERTa can achieve rather high performance, e.g., the team Solomon achieves 96.0% and 94.0% on Subtask A and Subtask B, respectively, without using further resources. This shows that large-scale pretrained language models do contain commonsense knowledge to deal with the Subtask A and the Subtask B in this challenge. Additionally finetuning the pretrained language models on commonsense-related text such as OMCS, which we use as inspirational materials, can push the results even higher, close to human performance. The best-performing teams on Subtask A and Subtask B both adopt K-BERT, which incorporates the external knowledge base (i.e. ConceptNet) to complement the pretrained language models with knowledge triples. This shows that knowledge-graph-enhanced approaches, such as K-BERT can effectively incorporate external knowledge. However, the high number may also indicate data leaking to some extent, since in the data creation stage, both ConceptNet and OMCS are used as references for the annotator to write the data instances.

Subtask C
The results for Subtask C are shown in Table 6. There are in total 17 valid submissions for Subtask C. There are generally two approaches: (1) sequence-to-sequence approach, where the source side is the non-sensical statement, and the reason is the target sequence.
(2) language model generation approach, which uses large-scale pretrained auto-regressive language models such as GPT-2 (Radford et al., 2019) for reason generation, where the non-sensical sentence acts as prompt. An example of the language model generation approach is shown in Figure 1, which is most commonly used and achieves relatively good results. Below we describe in detail the systems and their main features.
• BUT-FIT (Jon et al., 2020) experiments with both the sequence-to-sequence approach and the language generation approach. For the sequence-to-sequence approach, they use BART (Lewis et al., 2019) with beam-search decoding to achieves the highest BLEU among all the teams. For the language generation approach, the nonsensical statement is used as a prompt. At the training stage, the statement and the explanation are concatenated together, and a GPT-2 is trained on these sequences with a next token prediction objective. At the test time, based on the statement, the model generates the reason tokens until the end-of-sentence token is generated.
• KaLM (Wan and Huang, 2020) uses the sequence-to-sequence architecture BART. To enhance the source side statement, they extract keywords from the statement and search for evidence from Wiktionary. 2 After that, they concatenate the evidence along with the original statement as the source sentence for the generation. This approach proves effective and makes their system second-best for human evaluations.
• ANA (Konar et al., 2020) has the highest human evaluation score with a multitask learning framework. Specifically, they use a decoder-only transformer based on GPT-2 as the backbone model, and train the model with two self-attention heads: one for language models and another for classification. They then use data from both task B and task C to calculate language model loss and classification loss. Furthermore, they use OMCS at the pretraining stage and use CoS-E (Rajani et al., 2019) and OpenBook (Mihaylov et al., 2018) at the task-specific training stage.
• Solomon (Srivastava et al., 2020), JUSTers (Fadel et al., 2020), SWAGex (Rim and Okazaki, 2020), UI (Doxolodeo and Mahendra, 2020) and CUHK  use GPT or GPT-2 finetuned on the task training data. JBNU (Na and Lee, 2020) uses UniLM, which incorporates three LM tasks: unidirectional LM, bidirectional LM and sequence-to-sequence prediction LM, and only use one of the reference correct reasons. UI does not use the training data and treats the generation as a Cloze task. SSN-NLP (S, 2020) uses the seq2seq NMT framework without a pretrained LM.
Large-scale pretrained language models such as BART and GPT-2 dominates the submissions. The two systems with the highest human evaluations, namely ANA and KaLM, use additional resources such as Wiktionary, OMCS, and other commonsense datasets. This again shows that additional knowledge from structured databases can help with the generation of the reasons. From Table 6 we can see that BLEU does not correlate well with Human Evaluation, especially for the top-performing systems. According to a further experiment of BUT-FIT, the naive baseline of "copying source sentence as the reason" can give a BLEU of 17.23, which can rank No. 4 among all the submissions. This indicates that BLEU, which focuses on the surface token overlap, has difficulty in evaluating the generated text reliably. The top-performed system achieves the human evaluation score of 2.10, showing the power of pretrained language models, but considering the human performance of 2.58, we still have a long way to go to generate human acceptable reasons.

Related Work
Commonsense reasoning in natural language has been studied in different forms of tasks and has recently attracted extensive attention. In the Winograd Schema Challenge (WSC) (Levesque et al., 2012;Morgenstern and Ortiz, 2015), a model needs to solve hard co-reference resolution problems based on commonsense knowledge. For example, "The trophy would not fit in the brown suitcase because it was too big. What was too big (trophy or suitcase)?" The Choice of Plausible Alternatives (COPA) (Roemmele et al., 2011) emphasizes on events and consequences. Each question in COPA aims to find the suitable cause or result of the premise from two given alternatives. All premises and alternatives are simple sentences. For example, the premise can be "The man broke his toe. What was the CAUSE of this?" and the two candidate answers are "(1) He got a hole in his sock." and "(2) He dropped a hammer on his foot." Several subsequent datasets are inspired by COPA. The JHU Ordinal Common-sense Inference (JOCI) (Zhang et al., 2017) aims to label the plausibility from 5 (very likely) to 1 (impossible) of human response after a particular situation. Situations with Adversarial Generations (SWAG) (Zellers et al., 2018) request a system to choose the most likely-to-happen alternative after a specific situation. Those datasets emphasize the pre-situations and/or the after-situations of certain situations, but not on the reasons why they occur or are caused. Besides, our dataset is not limited to events or situations. It concerns a broader commonsense setting, which includes events, descriptions, assertion etc. Some datasets are inspired by reading comprehension. The Story Cloze Test and ROCStories Corpora (Mostafazadeh et al., 2016;Sharma et al., 2018) aim to figure out the right ending from two candidate sentences after a four-sentence story. For a narrative text, MCScript (Ostermann et al., 2018a) gives various types of questions and pairs of answer candidates for each question. Most questions require knowledge beyond the facts mentioned in the text. Compared to those reading comprehension tasks, our benchmark encourages people to use any external resources they want.
Some other datasets evolve from QA problems and care more about factual commonsense knowledge. SQUABU (Davis, 2016) provides a small hand-constructed test of commonsense and scientific questions. CommonsenseQA (Talmor et al., 2018) asks crowd workers to create questions from ConceptNet (Speer et al., 2017), which is a large graph of commonsense knowledge, where each question discriminates its answer candidates between three target concepts that all share the same relationship to a single source drawn from ConceptNet. OpenBookQA (Mihaylov et al., 2018) provides questions and answer candidates, as well as thousands of diverse facts about elementary level science that are related to the questions. The AI2 Reasoning Challenge (ARC)  gives thousands of questions with different knowledge types, as well as a relevant 14M-sentence corpus, mixed with science facts and other narrative sentences. MuTual provides a dataset for Multi-Turn dialogue reasoning in the commonsense area (Cui et al., 2020). Those questions are not easy to answer without specializing certain domain knowledge, while our questions are based on daily common sense.
Some datasets focus on non-sentential eventual plausibility (Wang et al., 2018;Porada et al., 2019), such as "gorilla-ride-camel". In contrast, our dataset is based on statements which includes events, descriptions, assertion etc, not merely events, such as "China's territory is larger than Japan's". And some datasets concentrate on limited attributes or actions of world knowledge, such as physics (Forbes and Choi, 2017). Our dataset concerns general commonsense knowledge beyond just physical common sense, the sentence in our task "Tom's mom become (happy)/(upset) when Tom gets high grades in the exam" is about social and emotional common sense. For our first task, those statements that conforms to commonsense can also be phrased as being plausible. Thus our first task is similar to plausibility tests, despite that plausibility has a broader scope while our focus is on commonsense only.
More importantly, compared with our work, the above tasks do not directly estimate general common sense or ask the logical reasons behind the correct answers and questions. In recent years, some large-scale commonsense inference knowledge resources have been developed, which may be helpful in commonsense reasoning tasks. Atomic  presents a large-scale everyday commonsense knowledge graph, which has nine if-then relations with variables, including causes, effects, and so on. Event2Mind (Rashkin et al., 2018) proposes a new corpus and task, aiming to find out the mentioned/unmentioned people's intents and reactions under various daily circumstances. These datasets are not directly useful for our benchmark since they focus only on a small domain. ConceptNet is a seminal knowledge graph that has been upgraded over time (Liu and Singh, 2004;Havasi et al., 2007;Speer and Havasi, 2013;Speer et al., 2017). ConceptNet constructs triples using labeled edges as relations and various words and/or phrases as entities. It also has the sentences describing the corresponding triples. In contrast to these datasets, we investigate the evaluation of common sense, rather than building a resource.
Before organizing this shared-task, a pilot study (Wang et al., 2019) has been performed, showing that there is still a significant gap between human and machine performance when no training data is provided, despite that the models have already been pretrained with over 100 million natural language sentences. In our task here, we also provide training data with human annotations.

Summary
This paper summarizes SemEval-2020 Task 4: Commonsense Validation and Explanation. In this task, we construct a dataset that consists of 11,997 instances and 83,986 sentences. The task attracted around 40 participating teams, out of which 31 teams submit their system papers. The pretrained models are shown to be very effective in Subtask A and Subtask B, but there is still a large room to improve system performances in Subtask C. Contextualized embedding such as RoBERTa and BART play a central role in the success of the top-performing models, demonstrating that such methods contain commonsense information to a good extent.
We attribute the high performance on Subtask A and B to several main reasons: 1) Subtask A is a relatively easy question by definition: a model needs only to detect a relatively less plausible content among the two candidate sentences. 2) Pretrained models are obtained on billion-words large corpora such as Wikipedia data, which help obtain commonsense knowledge (Zhou et al., 2019), which helps achieve considerably better performance. 3) As described in the annotation process, we use the sentences from OMCS to inspire crowd-sourcing workers. The top-3 systems also use OMCS, which potentially help them to attain better performances. 4) For Subtask B, as discussed in our data analysis section, the data has some flaws in the average length and common words, which reduces the difficulty. 5) Some instances have obvious patterns. For example, there are tens of instances that contain "put XXX into YYY", and "XXX is bigger than YYY", making the problems simpler. 6) Hundreds of crowd-sourcing workers write instances. It is likely for workers to think about the shared commonsense knowledge, such as "XXX is bigger/shorter/quicker/slower than YYY".
We consider future works in four directions: 1) We observe that there is still a gap between machine performance and human performance in Subtask C, and the reason generation task still needs further investigation. 2) The artifacts or spurious correlations in the datasets can be further removed, e.g., by making different candidate sentences in subtask B be the same, removing instances with shared commonsense knowledge, removing artifacts in common words, and filtering out common patterns. 3) Subtask A can be turned into a more difficult form. Instead of comparing which statement makes more sense, we can form it into a classification task, validating if one statement makes sense or not. 4) We notice that the BLEU score does not closely align with human evaluation for systems with high performances, and it is desirable to develop an auto-metric for comparing the semantic correlation between two reasons.