CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning

This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not it makes sense and require the model to explain it. Based on BERTarchitecture with a multi-task setting, we propose an effective and interpretable"Explain, Reason and Predict"(ERP) system to solve the three sub-tasks about commonsense: (a) Validation, (b)Reasoning, and (c) Explanation. Inspired by cognitive studies of common sense, our system first generates a reason or understanding of the sentences and then chooses which one statement makes sense, which is achieved by multi-task learning. During the post-evaluation, our system has reached 92.9% accuracy in subtask A (rank 11), 89.7% accuracy in subtask B (rank 9), andBLEU score of 12.9 in subtask C (rank 8)


Introduction
Introducing common sense to natural language understanding systems is attracting more and more attention. Common sense, as ordinarily conceived, present themselves as the aspect of the grammar of expressions and sentences on which their semantic properties and relations depend (Asher and Vieu, 1995). And one important difference between human and machine text understanding lies in the fact that humans can access commonsense knowledge while processing text, which helps them to draw inferences about facts that are not mentioned in a text. Thus, it's a fundamental question on how to validate whether a system has a common sense capability, and more importantly, let the system explain how it inferences using hidden facts. Existing benchmarks measure commonsense knowledge indirectly and without explanation, and also existing datasets test common sense indirectly through tasks that require extra knowledge, such as co-reference resolution, or reading comprehension. They verify whether a system is equipped with common sense by testing whether it can give a correct answer when the input does not contain such knowledge. However, there are some limitations to such benchmarks. First, they do not give a direct quantitatively standard to measure sense masking capability. Second, they do not explicitly identify the key factors required in a sense-making process. And also they do not require the model to explain why it make that prediction.
Common sense reasoning tasks are intended to require the model to go beyond pattern recognition. Instead, the model should use common sense or world knowledge to make inferences. Some empirical analysis has been done previously for common sense reasoning, mainly focus on the form of question answering (QA) (Talmor et al., 2019). But question-answering is hard to directly evaluate the commonsense in contextualized representations. And there has been few work investigating commonsense in pre-trained language models (Zhou et al., 2019), such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). Introduced by (Wang et al., 2019a), sense-making is a task to tests whether a model can differentiate sense-making and non-sensemaking statements. Specifically, the statements typically differ only in one keyword which covers nouns, verbs, adjectives, and adverbs. There are two existing approaches that can address this problem, one simple way is to use more commonsense knowledge can be learned from larger training sets (Wang et al., 2019b). On the other hands, some works (Lin et al., 2019) focus on effectively utilizing external, structured commonsense knowledge graphs, such as ConceptNet (Speer et al., 2016) and COMET (Bosselut et al., 2019). Insipred by previous works, more researchers are trying to fuse commonsense knowledge and language model (Forbes and Choi, 2017), and apply them to downstream tasks (Zhong et al., 2019). Recently, a new hybrid approach has been proposed for common sense reasoning (He et al., 2019). The core idea behind it is multi-task learning (Liu et al., 2019b), which has been widely applied in natural language tasks (Liu et al., 2019a).
However, progress in this field has on the whole been frustratingly slow and much of the work is purely theoretical. There may be no single perfect set of benchmark problems, and as yet there is essentially none at all, nor anything like an agreed-upon evaluation metric, benchmarks and evaluation marks would serve to move the field forward. The field might well benefit if commonsense reasoning were systematically described and evaluated. To tackle it, this system focuses on a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. Our results indicate that pre-trained models are not able to demonstrate well on the benchmark, and some remaining cases demonstrating that human level is not achieved yet. Thus, we design a new procedure to handle the commonsense challenge inspired by human cognition. It firstly explain its understanding of the given sentences by a language model, and induce the hidden common sense fact. And then, the explanation is used as a supplementary input to the prediction module. Still, we believe that our approach also can be applied to more challenging data sets.
The organization of this paper is as follows: in Section 2, we introduce the basic information about pre-trained language model and task definition. We then describe the framework of our model in Section 3. Empirical results are given and discussed in Section 4. And then we provide more exhaustive analysis for some bad cases that appeared at our experiment in Section 5. Finally, we conclude this survey and in Section 6.

Task Definition
Formally, the dataset is composed of 10 sentences: two sentences s 1 , s 2 , three options o 1 , o 2 , o 3 , three references r 1 , r 2 , r 3 . s 1 and s 2 are two similar statements which in the same syntactic structure and differ by only a few words, but only one of them makes sense while the other does not. They are used on our subtask A called Validation, which requires the model to identify which one makes sense. For the against-common-sense statement s 1 or s 2 , we have three optional sentences o 1 , o 2 and o 3 to explain why the statement does not make sense. Our subtask B, named Explanation (Multi-Choice), requires that the only correct reason be identified from distractors. For the same against-common-sense statement s 1 or s 2 , our subtask C naming Explanation (Generation), asks the participants to generate the reason why it does not make sense. The 3 referential reasons r 1 , r 2 and r 3 are used for evaluating subtask C.
Subtask A: Unlike other classification problem, subtask A gives us two statements s 1 , s 2 which have similar wordings. Their dependency tree or semantic structure is extremely similar and that requires us to build a model which can recognize these subtle differences and reasoning to judge the sentence whether or not it makes sense.
Subtask B: Subtask B gives us one false sentence s f (either s 1 or s 2 ) which means this sentence does not make sense and three options o 1 , o 2 , o 3 . We need to choose one right option which can explain why the give sentence does not make sense.
Subtask C: Subtask C provides one false sentence s f as same as in subtask B and three references r 1 , r 2 , r 3 . All these three references can explain why the false sentence does not make sense. This task requires us to build a model to generate the correct reason automatically given one false sentence.

Pretrained Language Model
BERT is the state-of-the-art bidirectional pre-trained language model that has recently shown excellent performance in a wide range of NLP tasks (Devlin et al., 2019). It is an encoder based on multi-head attention with the self-attention mechanism in a fully connected layer. The input representation of BERT is constructed by summing the corresponding token, segment, and position embeddings. As an autoencoding (AE) model, It can see the context in both forward and backward directions. The pre-train of BERT uses two unsupervised tasks. 1) Masked LM; 2) Next Sentence Prediction (NSP). By optimizing for both of two tasks, BERT not only can learn semantic and synthetic knowledge but also world knowledge (Rogers et al., 2020). These explain why BERT has astonishing performance.
RoBERTa is a replication study of BERT which showed that carefully tuning hyper-parameters and increase training data size lead to significantly improved results on language understanding. More specifically, (Liu et al., 2019c) proposed three methods to improve BERT 1) training the model longer, with bigger batches, over more data; 2) removing the next sentence prediction objective; 3) training on longer sequences, and 4) dynamically changing the masking pattern applied to the training data. As same as other NLP tasks, RoBERTa gets more higher accuracy compared with BERT.

Models
Our proposed ERP system first generates (explain) its understanding of the given sentences by a language model, and then the explanation is used as a supplementary input to the prediction module. For subtask A the input is a sentence pair s 1 and s 2, and the input is the gainst-common-sense statement s for subtask B. Subtask C is an explanation generation task and in this way, we could explore common-sense reasoning in two settings 1) explain-and-then-predict and 2) predict-and-then-explain to evaluate the effectiveness of our ERP system. Therefore, we illustrate the ERP system consecutively for different sub-tasks in Sections 3.1 and 3.2.
The architecture of our model is shown in Figure 1, the input x represents sequences (either one sentence or stacked sentences), and then for each token in this sequence is constructed by summing the corresponding token, segment and position embeddings. Then the semantic encoder map the input token into a vector in word-level (token-level), the transformer encoder captures the contextual information in sentence-level via the self-attention mechanism. After we get the contextual embedding vector, we use task-specific layer to apply downstream tasks, we use text classification layer here for both of task A and task B. We choose to introduce subtask A and subtask B first, and followed by subtask C for intuitive understanding, but for the competition, the organizer release the datasets of subtask A, subtask C, and subtask B in turn, which supports our ERP system.

Sub-task A and Sub-task B
For both of subtask A and subtask B, we cast them as text classification problems. First of all, the training set of subtask A is Ω = (s 1 1 , s 2 1 , y 1 ), (s 1 2 , s 2 2 , y 2 ), ..., (s 1 N , s 2 N , y N ) , in which s stands for two similar sentences and y is label. In order to fine-tuning our model, we modified the input sequence x to x = "[CLS] + s 1 + [SEP ] + s 2", the [CLS] token is used for the final classfication, the [SEP] token is used to separate different sentences.
Secondly, for subtask B, during training and validation, the organizer already release the correct explanation for each sent in subtask B, we need to use these data to generate the explanation for test data in subtask B, section 3.2 will introduce more details. Therefore, the generated explanation can be used to improve the performance of our model. As shown in Figure 1, the input sequence consists of one false sent, three options, and some explanations (either ground-truth or generated) according to different periods. The training and validation sample can be cast as Ω = (s 1 , a 1 , b 1 , c 1 , e 1 1 , e 1 2 , e 1 3 ), (s 2 , a 2 , b 2 , c 2 , e 2 1 , e 2 2 , e 2 3 ), ..., , the test samples are Ω = (s 1 , a 1 , b 1 , c 1 , e 1 g ), (s 2 , a 2 , b 2 , c 2 , e 2 g ), ..., (s N , a N , b N , c N , e N g ) . we still use the same structure to arrange our input but use additional special token to disambiguate different functions of sentences, like we use [OPTION] to represent three options, [EXP] to represent explanations. The objective of both subtask A and subtask B is to maximize:

Sub-task C
Here, we employ Commonsense Auto-Generated Explanations in (Wang et al., 2019a), generated by a language model. Subtask C provides one incorrect sentence and three references for explanation. All these three references can explain why the incorrect sentence does not make sense. Our LM is the large, pre-trained OpenAI GPT (Radford et al., 2018), which is a multi-layer, transformer (Vaswani et al., 2017) decoder. GPT is fine-tuned on the datasets. Thus, the input contains during the fine-tuning can be described as follows: where the special token CUZ means "is wrong may because". the input context during testing is defined as follows: The model is trained to generate the explanation e on the basis of conditional language modeling objective, the objective is to maximize: where k is the size of the context window (in our case k is always greater than the length of e so that the entire explanation is within the context). The conditional probability P is modeled by a neural network with parameters Θ conditioned on C ans and previous explanation tokens.

Experiment
It is important to make it clear that all our experiments are conducted which meet the requirement of the competition. We can not use the dataset which is not released during the formal competition which means we can not use subtask B data for subtask A and subtask C, because subtask B is released at last, etc.

Baseline of Sub-task A and Subtask B
As described before, the project consists of three subtasks. Subtask A is to choose from two natural language statements with similar wordings which one makes sense and another one does not make sense. Subtask B is to find the key reason why a given statement does not make sense. Subtask C asks the machine to generate reasons. Subtask A and B are evaluated by accuracy and Subtask C is evaluated using BLEU. To improve the reliability of the evaluation of Subtask C, we use a random subset of the test set and will do a human evaluation to further evaluate the systems with a relatively high BLEU score (which is not conducted in the post-evaluation period).
First of all, we use BERT and RoBERTa as our baseline since both of them show impressive performance in many NLP downstream tasks. Table 1 shows results compare BERT with RoBERTa that use different corpus for each task. For subtask A, the RoBERTa model reaches the highest accuracy 86.2% in the test and 88.5% in dev datasets. For subtask B, when we add data from subtask A, the performance get the peak at 82.3% accuracy, but it attracts our attention when we use additional subtask C data that the dev accuracy is extremely high with the test accuracy is obviously lower. We assume 1) the data from subtask C show tremendous potential ability to solve subtask B 2) the model relies too much on Subtask C data, resulting in very low performance without it. After we use the generated explanation during the test, the model gets considerable improvement which validates our assumption.

Explain and Predict
To better understand this deviant phenomenon, we present results with different sample percentages when we randomly choose whether or not to use the subtask C data which is shown at Table 2. Specifically, under the condition of 7:3 sample percent, when we get a sample from subtask B, and then we choose to inject additional explanation with a 30% probability, but not with 70% probability. If we decide to inject additional knowledge, then we will sample one, two, and three explanations with the equal possibility. We observe that the accuracy of dev datasets becomes a little lower, but the test accuracy gets comparable improvement. We force the model to learn more with limited external knowledge through this approach, and the result further validates our assumption before. This leads to the appearance of our "Explain, Reason and Predict (ERP) system and provides an interpretable foundation. Then all we need to do is to improve our baseline, in which we try different ways such as knowledge inject and multi-tasks.

Multi-task
During the experiment, we find an interesting case illustrating that multi-task learning may help a lot at sub-tasks A and B. Given a false sentence s: an umbrella can help you keep warm in snowy days. and three options: A. we don't wear umbrellas, B: umbrellas can keep you dry in snowy days, C: going outside is very crazy in snowy days. The ground truth is A, but the model outputs B which can better explain the sentence. After we check the other true sentence a thicker cloth can help you keep warm in snowy days in subtask A dataset, we know why the ground truth is A. Obviously, we need some knowledge at subtask A to help us to solve task B, so we think multi-tasks learning is a direction worthy to try (Liu et al., 2019b).
Rather than enriching semantic embedding with knowledge graph, we leverage existing datasets across different domains which also require common sense reasoning like ARC, CommonseQA, and so on. We believe multi-task learning can learn more robust and universal embedding and then make our model get better performance and improve our baseline. In the following experiments, we will validate including additional datasets as external input information can boost our performance of our ERP system.  Table 3: Multi-Task Result, 93.5% * means Explain and Predict we said before. Table 3 shows the results obtained by our final Multi-task ERP model. We report our two best models that ensemble models using different dropout rates, see more details in the following section. At subtask A, our ensemble model reaches 92.9% accuracy and 95.1% accuracy during test and dev respectively, while getting 89.7% accuracy at the test of subtask B and 93.5% accuracy at dev of subtask B. The highest accuracy in dev dataset of subtask B indicates the tremendous potential of our ERP system. Using additional datasets together to train provides marginal improvement compare with a single task which attributes to better model generalization under multi-task setting from our point of view.

Implement details
Our implementation of MT-DNN is based on (Liu et al., 2019b). We used Adamax as our optimizer with a learning rate of 5e-5 and a batch size of 4. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used unless stated otherwise. We also set the dropout rate of all the task-specific layers as 0.1, except for ensemble models which we set different dropout rates to get different models. According to (Liu et al., 2019a), we set dropout rate ranged in {0.1, 0.2, 0.3}. To avoid the exploding gradient problem, we clipped the gradient norm within 1. All the texts were tokenized using wordpieces and were chopped to spans no longer than 512 tokens. We set the mixture ratio as 0.4 to re-weighting different tasks (Xu et al., 2018).

Subtask C
Since this is a text generation problem, we choose to use the GPT model as our baseline. Since some of the samples use knowledge from subtask A, we conducted contrast experiments by using data from subtask A and CoS-E(Rajani et al., 2019). We observed that adding explanations led to a very small decrease in the performance compared to the baseline at test datasets, but adding data from subtask A improve about 0.3.

Model Corpus
Test Dev GPT Task C 12.65 5.96 + Task A 12.94 5.99 + Aug 12.31 6.54 Compared with the original paper (Wang et al., 2019a), our model gets much higher accuracy in both of subtask A and subtask B. Our performance rank 10th on 29 April 2019, with 92.9% accuracy at subtask A (rank 11), 89.7% at subtask B (rank 9), 12.9 at subtask C (rank 8) 2 .

Analysis
Despite the strong performance of our model, it still fails to detect some samples at subtask A and subtask B, and few sentences generated by our model can not well explain why the given sentence does not make sense. An in-depth analysis of these samples shows that they can be clustered into some classes.

Error Analysis at Subtask A
• Basic common sense knowledge which can be solved by introduce external knowledge graph like ConceptNet (eg., s 1: The moon sets at night, s 2: The sun sets at night, label: 1, prediction: 2).
• Implicit common sense knowledge. Current knowledge graphs do not contain everything about common sense knowledge because the limitation of memory and the huge volume of common sense, and it still needs better solutions by using more comprehensive knowledge representation and transfer learning or other methods (eg., s 1: Cats have got seven lives, s 2: Cats have got one life, label: 1, prediction: 2).
• Specific domain knowledge required to make a correct judgment (eg., s 1: Hair is already dead, s 2: Hair screams when you cut it, label: 2, prediction: 1), since the human may not know that the hair is dead protein cells, so it is a big challenge for the model to learn this rare domain knowledge from large corpus and datasets.

Error Analysis at Subtask B
For subtaskB, there are two different ways to address it: the conventional and the explain and predict methods, that leads three different cases 1) both methods are wrong 2) the conventional one is correct but the other wrong 3) ERP system is correct but the other wrong. According to a comprehensive analysis below, we find that our ERP system can reason and make a more persuasive decision than the conventional one.
1. Both methods output the wrong judgment (a) Explain in different perspectives or levels (eg., s f : Everyone loves reading horror novels. o 1 : Horror novels are scary. o 2 : Reading novels can be a good way to relax. o 3 : Not everyone likes to read horror novels. label: C, prediction: A). Why the given s f does not make sense can have multiple explanations in different levels. Here, to explain why "Everyone loves reading horror novels" does not make sense, from our point of view, both o 1 and o 3 are correct if we assume the given s f is already false, since they can composite "s f is wrong because o 1 or o 3 ". We think o 1 gives explanation from more subtle and deeper level than o 3 .
(b) Implicit common sense knowledge as same as in subtask A (eg., s f : drama plays are often performed before cows, o 1 : this rural drama tells the story of a cow, o 2 : the cow is a kind of animal while drama isn't, o 3 : a cow is unable to appreciate and understand the drama, label: C, prediction: B), we need to know that drama plays are appreciated and understand by people in this example.
2. Examples classified wrongly by the conventional methods but not our ERP system (a) Lack of reasoning capability which equipped in our ERP system (eg., s f : shoes can fly, o 1 : There are many creatures that can fly, o 2 : Shoes do not have wings, o 3 : People cannot fly, label: B, prediction: C). The conventional can not reason those wings are needed to fly here.
(b) Basic common sense knowledge. The model still needs external knowledge to support making the right classification.
3. About 1.10% of samples are not classified correctly by our model but the conventional ones, we think this mostly attribute to noise introduced by multi-task setting.
(a) Capture plausible knowledge (eg., s f : the lava was warm and soft, o 1 : lava can destroy the warm and soft cake, o 2 : lava is too hard to be soft, o 3 : lava is too hot to be warm or soft, label: C, prediction: B). The model captured that something is too hard to be soft, but it ignores the attributes of lava.
(b) Others (eg., s f : it is said that Santa comes on Thanksgiving Days, o 1 : Santa comes on Christmas day, not Thanksgiving Day, o 2 : Santa is a figure in legend, not reality, o 3 : Santa is a figure in western culture, not eastern culture, label: A, prediction: B). We first think this is caused by explaining in different perspectives or levels which described above, but after we check the whole data, and we find an example (s f : Santa Claus sent Jim a Christmas present, o 1 : There aren't Santa Claus in the world, o 2 : Santa Claus is very busy, o 3 : Santa Claus is old, label: A) in the training data of subtask B. This proves that our model can learn more robust and universal embedding than the conventional method.

Error Analysis at Subtask C
Although most of the results make sense, but there are still some generated reasons which can not well explain why the given sentence does not make sense. Most cases, as we found, are with: • Wrong explain direction (eg., s f : The inverter was able to power the continent, e g : inverter is not a living thing).
• Repetition (eg., s f : sugar is used to make coffee sour, e g : sugar is used to make coffee). Like the example, some cases contain repeatedly generated words.