LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation Using Pretraining Language Model

This paper introduces our system for commonsense validation and explanation. For Sen-Making task, we use a novel pretraining language model based architecture to pick out one of the two given statements that is againstcommon sense. For Explanation task, we use a hint sentence mechanism to improve the performance greatly. In addition, we propose a subtask level transfer learning to share information between subtasks.


Introduction
Common sense verification and explanation is an important and challenging task in artificial intelligence and natural language processing. This is a simple task for human beings, because human beings can make full use of external knowledge accumulated in their daily lives. However, common sense verification and reasoning is difficult for machines. According to Wang (2019), even some state-of-the-art language models such as ELMO (Peters et al., 2018) and BERT (Devlin et al., 2019) have very poor performance. So it is crucial to integrate the ability of commonsense-aware to natural language understanding model (Davis, 2017).
SemEval-2020 task4 (Wang et al., 2020) aims to improve the ability of common sense judgment for model, and we participated in two subtasks of this task. The dataset of SemEval-2020 task4 named ComVE. Each instance in ComVE is composed of 5 sentences s 1 , s 2 , o 1 , o 2 , o 3 . s 1 and s 2 will be used for subtask a, and s 1 or s 2 with o 1 , o 2 , o 3 will be used for subtask b.
Subtask a(also known as Sen-Making task) aims to test a model's ability of commonsense validation. Specifically, given two statements s 1 , s 2 whose lexical and syntactic are similar, the object of Sen-Making model is to determine which statement is common sense(compared to another one). For example, s 1 is put the elephant in the refrigerator and s 2 is put the turkey in the refrigerator, a good model needs to judge that the latter is more common sense.
Subtask b(also known as Explanation task) is a multiple choice task that aims to find the key reason why a given statement does not make sense. For example, given a sentence that violates common sense with three options s, o 1 , o 2 , o 3 , where s is he put an elephant into the fridge, o 1 is an elephant is much bigger than a fridge, o 2 is elephants are usually gray while fridges are usually white, and o 3 is an elephant cannot eat a fridge, the model needs to judge that o 1 is the correct option.
The official baseline of Sen-Making use a pretraining language model(PLM) to dynamic encoding the two input sentence, and use a simple full connection neural network to calculate the perplexities respectively, and choosing the one with lower scores as the correct one. We believe that the baseline method treats two sentences independently and ignores the inner relationship between the two sentences, so we propose a novel model structure that fully considers the interaction between statements.
The official baseline of Explanation treats the task as BERT-like multiple choice task (Devlin et al., 2019). We think that the baseline model doesn't make full use of the input data. So we design a structure to incorporate another statement that is common sense to existing model.
In addition, we believe that fine-tuning on similar subtask can improve the performance of the current subtask because there are many commonalities between the two subtasks, so we propose a novel transfer learning mechanism between Sen-Making and Explanation.
The proposed system named LMVE, it is a neural network model(includes two sub-modules to solve both subtask a and b) bases on large scale pretraining language model. Our contributions are as follows: • First, we propose subtask level transfer learning that help share information between subtasks.
• Second, we propose a novel structure to calculate the perplexity of sentence, which takes into account the interaction between sentences in a pair.
• Third, we propose the hint sentence mechanism that will help improve the performance of multiple choice task.(subtask b).

System Description
We consider our model for both Sen-Making and Explanation as two parts: encoder and decoder. Encoder is mainly used for getting the contextual representation of input sentence tokens. In recent years, some pretraining language models including BERT (Devlin et al., 2019), RoBERTa  and ALBERT (Lan et al., 2020) have been proven beneficial for many natural language processing (NLP) tasks (Rajpurkar et al., 2016;Bowman et al., 2015). These pretrained models have learned general-purpose language representations on a large amount of unlabeled data, therefore, adapting these models to the downstream tasks can bring a good initialization for parameters and avoid training from scratch (Xu et al., 2020). So we tried some popular PLMs as encoders. Decoder consists of several simple linear layers whose number of parameters are far less than encoder, and the role of decoder is to fuse the output of encoder and predict the answer.   (Wang et al., 2019), which regards two sentences as independent individuals. But in ComVE, there are certain similarities(lexical and grammatical) between the two statements, so we think that the interaction between the two sentences is helpful to improve the performance of the model. Figure 1(b) gives an overview of our model for Sen-Making task which is mainly composed of three modules including token encoding, feature fusion and answer prediction. Encoding: Let {x 1 0 , ..., x 1 p } and {x 2 0 , ..., x 2 q } represent one-hot vectors for the first sentence and the second sentence in an instance, we first concatenate them and add some special tokens like Figure 1(b), then we will get two sequences {x [CLS]

LMVE for Sen-Making Task
The two sequences will be fed into ALBERT respectively. We use U i ∈ R d×n and V i ∈ R d×n denote outputs of i-th transformer block in ALBERT, where d is the hidden size of model, n is the sequence length, and i ∈ {0, ..., L − 1}. Fusion: Some pretraining language model (BERT et al.) usually take the first token (corresponds to [CLS]) of the output of last transformer block as the representation of a sequence, but we use the weighted sum of the representation of first token in all transformer block outputs as the final representation 3 . The following equations describe the process of fusion: where ω ∈ R d is a trainable parameter and i ∈ [0, L − 1]. We can regard x k as the representation of k-th statement(k ∈ {0, 1}). Answer Prediction: This module maps the output of the fusion layer to a probability distribution of answer. Given w ∈ R d and b ∈ R as learnable parameters, it calculates the answer possibility as We define the training loss as cross entropy loss function: where N is the number of samples in the dataset and y i ∈ {0, 1}.

Hint Sentence mechanism
Before formally introducing our model for Explanation task, let's first introduce the hint sentence mechanism.
In official baseline (Wang et al., 2019), the against common sense statement will be concatenated with three options respectively and fed into the model. We believe that this form of input does not make full use of the data in the ComVE. Specifically, for an instance s 1 , s 2 , o 1 , o 2 , o 3 in ComVE, it is assumed that s 1 does not conform to common sense, then s 2 , o 1 , o 2 , o 3 will be used to train the baseline model or to predict answer. However, in this process, s 1 was abandoned. We believe that another common sense statement(s 1 ) in statement pair contains some useful information, and should be incorporated into our model.
So we propose hint sentence mechanism: A hint sentence is common sense and its lexical and syntactic are similar to the given against common sense statement and they differ by only few words. In other words, we call the another sentence in the sentence pair a hint sentence.
The process of how the hint sentence is integrated into the existing model can be referred to next section and Figure 2. The results of ablation experiment(Sec 3.6) show that hint sentence mechanism can greatly improve the performance of our model for Explanation task. Figure 2 gives an overview of our model for Explanation task. it also has three modules.

LVME for Explanation Task
Let {s 0 , ..., s p }, {h 0 , ..., h q } and {o i 0 , ..., o i r i } represent one-hot vectors for the input statement, hint sentence and i-th option in an instance, where i ∈ {0, 1, 2} and p, q and r is the length of them, we first concatenate them and add some special tokens like Figure 2, then we will get three sequences  Figure 2: The model architecture for Explanation task.
{x [CLS] , s 0 , ..., s p , x [SEP ] , h 0 , ..., h q , x [SEP ] , o i 0 , ..., o i r i , x [SEP ] }. Then the three sequences will be fed into ALBERT respectively. Similar to last sub-section, each sequence will get a representation vector after fusion, and then the three representation vector will pass a linear layer like Equation 4 to calculate the probability distributions of answer.
We define training loss as where N is the number of samples in the dataset and y i ∈ {0, 1, 2} is true label. Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. PLM is a typical example of transfer learning and we call it task level transfer learning.

Subtask Level Transfer Learning
Sen-Making task and Explanation task are both generalized multiple choice tasks, and there is an association between the input data for them, so we believe that in SemEval-2020 Task 4, fine-tuning on similar subtask can improve the performance of the current subtask.
Subtask level transfer learning refers to use the encoder after fine-tuning on subtask a(Sen-Making) to train subtask b(Explanation) and vice versa. The process of Subtask level transfer learning are shown in Figure 3.

Dataset
ComVE include 10000 samples in train set and 1000 samples in dev/test set for both Sen-Making and Explanation task. The average length of two statements in the Sen-Making task are both 8.26, exactly the same. The average length of true reasons is 7.63 in Explanation task.
It should be noted that in SemEval-2020 Task 4, the test set of Explanation task is issued only after Sen-Making task is completed, so it is impossible to use the test set of Explanation task to reverse deduce the answer of subtask a test set.

Baseline
To verify the effectiveness of our model, we used ALBERT to replace the BERT in the official baseline, leaving the rest unchanged. We do not perform subtask level transfer learning (Sec 2.4) on them.

Preprocessing
Data Augmentation: To enhance the robustness of our model, we use Google Sheets 4 to perform back translation technology on original texts to get augmented texts. Specifically, given a training sample s 1 , s 2 , o 1 , o 2 , o 3 we first translate the original statements s 1 , s 2 to French and then translate them back to English (denoted asŝ 1 ,ŝ 2 ). ŝ 1 ,ŝ 2 , o 1 , o 2 , o 3 will add to training dataset as a new sample. the size of the dataset has doubled after augmentation. Tokenization: We employ the tokenizer that comes with the HuggingFace (Wolf et al., 2019) PyTorch implementation of ALBERT. The tokenizer lowercases the input and applies the SentencePiece encoding (Kudo, 2018) to split input words into most frequent subwords present in the pre-training corpus. Non-English characters will be removed.

Implementation Details
We use the Transformers 5 toolkit to implemented our model and tune the hyper-parameters according to validation performance on the development set. The hidden size is equal to the corresponding PLM. To train our model, we employ the AdamW algorithm (Loshchilov and Hutter, 2019) with the initial learning rate as 2e-5 and the mini-batch size as 48.
We also prepared an ensemble model consisting of 7 models for Sen-Making task and 19 for Explanation task with different hyperparameter settings and random seeds. We used majority voting strategy to fuse the candidate predictions of different models together.

Model
Params

Main Result
The result of our model for subtask a and subtask b are summarized in Table 1. We have tried different pretraining language model as our encoder, and found that ALBERT based model achieves the state-ofthe-art performance. Figure 4 shows a learning curve computed over the provided training data with testing against the development set, and we can see that in the case of low-resource (only use 10%-20% training data of target task), the performance of introducing subtask level transfer learning is significantly higher than original implementation.  Table 2: Ablation study on model components. † means we use the model structure as Figure 1(a) and Baseline means the model in Sec 3.2.

Ablation Study
To get better insight into our model architecture, we conduct an ablation study on dev set of ComVE, and the results are shown in Table 2.
From the results we can see that subtask level transfer learning has a relatively large contribution for both subtask a and b, which confirms our hypothesis that fine-tuning on similar task can improve the performance of the current task. Data augmentation and weighted sum fusion also have minor contributions due to the more robust dataset and more robust model.
For subtask a, we can see that compared with baseline method (Figure 1(a)), concatenating another sentence as input can have a higher performance. We speculate that the reason is traditional method treat two statements as independent individuals, and our method takes into account the inherent connection between the two statements.
For subtask b, we can see from Table 2 that hint sentence makes a great contribution the overall improvement. We think the reason is a common sense statement with a similar grammar and syntax does help the model to determine why the input sentence is against common sense.

Conclusions
This paper introduces our system for commonsense validation and explanation. For Sen-Making task, we use a novel pretraining language model based architecture to pick out one of the two given statements that is against common sense. For Explanation task, we use a hint sentence mechanism to improve the performance greatly. In addition, we propose a subtask level transfer learning to share information between subtasks.
As future work, we plan to integrate the external knowledge base(such as ConceptNet 6 ) into commonsense inference.