JBNU at SemEval-2020 Task 4: BERT and UniLM for Commonsense Validation and Explanation

This paper presents our contributions to the SemEval-2020 Task 4 Commonsense Validation and Explanation (ComVE) and includes the experimental results of the two Subtasks B and C of the SemEval-2020 Task 4. Our systems rely on pre-trained language models, i.e., BERT (including its variants) and UniLM, and rank 10th and 7th among 27 and 17 systems on Subtasks B and C, respectively. We analyze the commonsense ability of the existing pretrained language models by testing them on the SemEval-2020 Task 4 ComVE dataset, specifically for Subtasks B and C, the explanation subtasks with multi-choice and sentence generation, respectively.


Introduction
SemEval-2020 Task 4 aims to evaluate whether a system identifies and rationalizes a given natural language statement to be comprehensible under commonsense knowledge (Wang et al., 2020). Starting from the pilot study , SemEval-2020 Task 4 consists of three subtasks: 1) Subtask A: differentiating statements that make sense from those that do not, 2) Subtask B and C: selecting a reason or explaining why the statement does not make sense. This paper presents an overview of our systems that were examined for Subtasks B and C, as well as the final results.
Given the era of pretrained language models, the main design goal of our system on ComVE is to explore the effect of using pretrained language models on Subtasks B and C, both for the understanding and generation tasks. In the top-level design, BERT is selected for Subtask B, and UniLM, a generalized pretrained language model that supports both understanding and generation capabilities, is employed for Subtask C. Our system architectures for Subtasks B and C are summarized as follows.
we first apply BERT (Devlin et al., 2019) or its variants to the [SEP]-based concatenation of the statement and i-th optional statement. Given k optional statements, BERT+FNN transforms the k BERT-encoded representations obtained over all optional statements using a feedforward neural network (FNN) on the last embeddings of [CLS] tokens, to compute the k scores. The probabilities of k optional statements being the reason are computed based on the softmax function. A simple extension is made on BERT+BiLSTM to further transform the BERT-encoded representations using bidirectional long shot-term memory (LSTM) before applying FNN on the [CLS] embedding 1 . To compare the performances of the BERT variants on Subtask B, we replace the BERT module in BERT+FNN and BERT+BiLSTM with RoBERTa  or ALBERT (Lan et al., 2019). In this paper, all models induced from BERT, RoBERTa, and ALBERT, are referred to as BERT-style models.

2.
UniLM (Subtask C): For Subtask C, we need to use generation capable pretrained language models beyond BERT, because it is difficult to apply BERT to natural language generation (NLG) tasks, due to its bidirectionality nature (Wang and Cho, 2019). Hence, we employ UniLM (Dong et al., 2019) which was recently found to be successful in NLG. UniLM employs three language model (LM) tasks for pretraining, consisting of the unidirectional LM toward pretrained language models for NLG tasks (Peters et al., 2018), bidirectional LM (Devlin et al., 2019) and sequence-to-sequence prediction LM tasks, thereby enabling to fine-tune it on natural language understanding (NLU) and NLG tasks. We examine the effect of UniLM on commonsense reasoning for Subtask C.
The remainder of this paper is organized as follows: Section 2 briefly summarizes the data description for the SemEval-2020 ComVE task. Section 3 presents the details of our system architecture, Section 4 provides the preliminary and official experimental results, while our concluding remarks and a description of the future work are presented in Section 5.

Data Description for SemEval-2020 ComVE
Each instance in the dataset for explanation subtasks of ComVE is composed of seven sentences < s, o 1 , o 2 , o 3 , r 1 , r 2 , r 3 >. Statement s, which does not make sense, is given as an input sentence, commonly to Subtasks B and C. For Subtask B, o 1 , o 2 and o 3 are the three optional sentences to explain why the statement does not make sense; the only a single sentence o i is marked as a correct (positive) reason and the others as negative ones. For Subtask C, r 1 , r 2 , and r 3 , are the additional sentences for referential reasons and, are used for training and evaluation. During the preliminary experiment for Subtask B, we found that some of the topics included only two optional sentences 2 . Thus, we excluded those topics from the dataset. In addition, we attached a punctuation mark to the sentences that did not originally end with any punctuation marks.

System Description
This section presents detailed descriptions of our system and the methods that used for Subtasks B and C at SemEval-2020 ComVE.

Model for Subtask-B
Let us recall the definition of Subtask B: given statement s, the system is required to select the correct reason from three optional reasons by examining why statement s does not make sense.
To address Subtask B, we propose two types of BERT-style models, BERT+FNN and BERT+BiLSTM, whose model architectures are depicted in Figures 1 and 2, respectively, where inputs for BERT are provided at the bottom and the output from BERT is presented at the top right.  Formally, given an optional sentence o i , let W i = [w i,0 , w i,1 , w i,2 , · · · , w i,n ] be the sequence of output embeddings of the BERT-style model, where w i,0 denotes the [CLS] token embedding for the i-th sentence, and w i,t ∈ R n denotes each output embedding 3 .
In BERT+FNN, we use only [CLS] tokens embedding, which is the first token embedding acquired from the BERT-style model for classification. The [CLS] token embedding is usually used to represent the whole meaning a given sentence. BERT+FNN is composed of a FNN and the softmax function to select a correct sentence that makes sense.
which indicates the probability that o i is the correct reason for why s does not make sense. In BERT+BiLSTM, we use all token embeddings acquired from the BERT-style models. To obtain sentence representations, we employ a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) architecture. Using the LSTM, given the i-th optional sentence, the hidden state at time t, denoted by h i,t ∈ R m , is computed via where c i,t denotes the cell state of the LSTM.   To exploit the contextual information in a bidirectional manner, we process the input embeddings using a bidirectional LSTM, which reads an input in both forward and backward orders. We then perform the mean pooling instead of the concatenation: R d × R d → R d , thereby yielding the mean vector of both representations to combine them. In particular, we compute the hidden state at time t, h i,t ∈ R m , for an input embedding of length T , using the following: The rest of the architecture is the same as that of BERT+FNN, where the feedforward neural network is applied to h i,T instead of the [CLS] embedding.

Input representation for BERT encoder
The remaining part includes determining the input that should be used for the BERT encoder. ," wherein S1 and S2 denote the first and second sequences, respectively. Table  1 presents how an optional sentence is packed as an input with an original sentence s.
To make the input more natural, we optionally further apply a simple preprocessing on the dataset by supplementing extra words, such as "because" between s and o i , as illustrated in Table 2.

Extension using BERT variants
Under the architectures of Figures 1 and 2, we compare three pretrained language models, i.e., BERT, RoBERTa and ALBERT, to examine the effectiveness of the BERT-style models on the commonsense reasoning ability in the setting of Subtask B.

Model for Subtask C
It should be noted that Subtask C is a type of NLG task, where the goal is to generate reasons for determining why input sentence s does not make sense. To address Subtask C, we use UniLM, as mentioned in the introduction, inspired by the recent works wherein UniLM demonstrates promising performance on NLG tasks, such as abstractive summarization, question generation, and generative question answering. While BERT is used mainly for NLU tasks, UniLM provides various types of language models based on its encoders and decoders, which enable it to be fine-tuned for both NLU and NLG tasks, including our addressed Subtask C.
To train UniLM for Subtask C, let a training (or test) instance be an entry consisting of four sentences < s, r 1 , r 2 , r 3 >. The input sentence s is encoded by UniLM and the generation model based on UniLM is fine-tuned for the loss function of Subtask C, such that it generates a correct reason for why the input sentence s does not make sense. For Subtask C, we select only one reason, r 1 , from the three possible references.

Experimental Results
We use the official release dataset of SemEval Task 4 for the experiments. The dataset is split into train/trial/dev/test sets, and, we use the dev (development) set to obtain the model with the best performance.

Model training
In our submitted model for Subtask B, the hidden dimension of LSTM was 768. We used the Adam optimizer for BiLSTM and FNN with a learning rate of 1e-3. The number of layers for the LSTM and dropout were 1 and 0.3, respectively. For finetuning BERT, we used the Adam optimizer with a learning rate of 5e-6.
For Subtask C, the submitted model was trained with Adam using β 1 = 0.9 and β 2 = 0.999 for optimization. The learning rate and dropout rate were 3e-5 and 0.1, respectively.

Results for Subtask B
For Subtask B, we first evaluate BERT+FNN across several BERT-style models, i.e., BERT, RoBERTa, and ALBERT. Table 3 presents the results of BERT+FNN on the trial and dev datasets released from the SemEval-2020 organizers. Note that all models for Subtask B use the exactly same evaluation pipeline, which makes them directly comparable. As evident in Table 3, the RoBERTa-large-based model performs significantly well on this task, outperforming other models on the dev set. Our results partly suggest that some of the commonsense knowledge is entailed from the BERT-style models.

Model
Name Accuracy(%) Tiral-set Dev-set  2. BERT+BiLSTM + extra words: The extended input representation by adding extra words in Table  2 is applied. Table 4 presents the results. Interestingly, the extended input representation of using Table 2 makes some improvement over that of using the default one. This result enables us to explore the issue of determining which natural expression is effective for the BERT input in the ComVE task. The run "BERT+BiLSTM + extra words" is our final submission for Subtask B.  Table 4: Comparison results using RoBERTa-large models under BERT+LSTM of Figure 2 for Subtask B. Table 5 presents the results of our model using UniLM for Subtask C, comparing the performances of the top three results in the leaderboard.

Results on Subtask C
In Table 5, the model achieves a BLEU score of 70.20% on the trial set. From the further analysis, we found that the trial set contained a large number of instances in the training set. Given this redundancy, in the trial set where the input sentence is likely included in the training set, the model tends to accurately generate correct sentences that explain why the input sentence does not make sense. For new test sentences, it is observed that the model generates relatively low-quality sentences, compared to the results from the trail set.
Finally, our submission ranks 7th out of the 17 valid submissions on Subtask C.

Conclusion
We participated in Subtasks B and C at the SemEval-2020 ComVE. Our system was based on BERT and UniLM for Subtasks B and C, respectively. For Subtask B, we explored the effects of the three variants of BERT-style models to study their commonsense reasoning ability. Finally, our submitted run for Subtask B ranked 10th out of the 27 submissions. Our results showed that the model capacity of BERT is highly related to the task accuracy, suggesting that BERT encodes the commonsense knowledge, but more in larger models. This implies that we should scale up the current models by exploring significantly larger models such as T5 (Raffel et al., 2019;Roberts et al., 2020), GPT-3 (Brown et al., 2020), or retrieval-based commonsense reasoning motivated by (Guu et al., 2020) For Subtask C, the UniLM-based model ranked 7th out of the 17 submissions and 6th on the human score. We found that there still exists a large divergence between our results and human-level commonsense reasoning. Most of the sentences generated from the model were considerably different from the answers on the development dataset. Although we did not compare the other models on Subtask C, we expect that model capacity would be important here.
In future work, we plan to explore large pretrained models to enrich the commonsense knowledge in neural models under closed book (Roberts et al., 2020) or open book (Guu et al., 2020) settings. Further, we would like to incorporate pretrained language models with external knowledge, such as ConceptNet (Speer et al., 2017).