DEEPYANG at SemEval-2020 Task 4: Using the Hidden Layer State of BERT Model for Differentiating Common Sense

Introducing common sense to natural language understanding systems has received increasing research attention. To facilitate the researches on common sense reasoning, the SemEval-2020 Task 4 Commonsense Validation and Explanation(ComVE) is proposed. We participate in sub-task A and try various methods including traditional machine learning methods, deep learning methods, and also recent pre-trained language models. Finally, we concatenate the original output of BERT and the output vector of BERT hidden layer state to obtain more abundant semantic information features, and obtain competitive results. Our model achieves an accuracy of 0.8510 in the final test data and ranks 25th among all the teams.


Introduction
Humans can utilize a variety of knowledge and reasoning to help understand the meanings of language. For example, it is easy for a human to know that someone can put a turkey into a refrigerator. Still, he can never put an elephant into it with basic commonsense reasoning, but it can be non-trivial for a system to tell the difference (Wang et al., 2019). What enables us to arrive at this conclusion is the knowledge we have about the world and the underlying reasoning process, often called commonsense thought (Minsky, 2000) or commonsense reasoning (Davis and Marcus, 2015), that allows us to connect pieces of knowledge to reach the new conclusion. We know that the refrigerator is household appliances and turkeys are the food we usually eat, but elephants cannot be eaten; the elephant is larger than the refrigerator and cannot be put in it. While this kind of knowledge and reasoning becomes so naturally to humans, it is extremely difficult for machines. Despite significant advances in natural language processing in the last several decades, machines are still far away from having this type of natural language inference (NLI) ability (Storks et al., 2019). Therefore, we need an effective automatic method to differentiate natural language statements that make sense from those that do not make sense.
In this paper, the used methods and the results obtained by our team, DEEPYANG, on the ComVE shared task organized at SemEval 2020 are presented (Wang et al., 2020). The ComVE shared task includes three sub-tasks Commonsense Validation, Commonsense Explanation (Multi-Choice), and Commonsense Explanation (Generation). We only participate in subtask A. The main goal in sub-task A is to choose which statement of the two is against common sense. By definition, the sub-task A is to select from two natural language statements with similar wordings which one makes sense and which one does not make sense. For this subtask, we try traditional machine learning methods, deep learning methods, and pre-trained models. After comparison, we choose the BERT-base model and concatenate the original output of BERT and the output vector of BERT hidden layer to obtain more abundant semantic information features. Finally, our model achieves an accuracy of 0.8510 on the test set, and ranks 25th.

Related Work
Recent years, the NLP community have witnessed a variety of works that enhances machines ability to perform deep language understanding which goes between the lines, rather relying on reasoning *Corresponding author:zhouxb@ynu.edu.com This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. and knowledge of the world. Many benchmark tasks and datasets have been presented to support the development and evaluation of such natural language inference ability. For example, using traditional machine learning methods mainly includes Support Vector Machines (SVMs) (Hearst et al., 1998), etc. Nowadays, deep neural networks have become mainstream for NLP, such as Convolutional Neural Networks (Kim, 2014) and Long Short Term Memory networks (LSTMs) (Wang et al., 2016) (Wang et al., 2018).
However, in traditional word embedding models such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) or FastText (Joulin et al., 2016), the embedding vectors are contextindependent. Despite the target word appears in different contexts, after training, the embedding vector remains the same. Consequently, these embeddings lack the capability of modeling different word senses in different contexts, although this phenomenon is prevalent in language.
To address this problem, recent works have developed contextual word representation models, e.g., Embeddings from Language Models (ELMO) (Peters et al., 2018) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018). Based on the context they appeared, these models give words different embedding vectors . These pre-trained word representations can be used as features or fine-tuned for downstream tasks. For example, the Generative Pre-trained Transformer (GPT) (Radford et al., 2018) and BERT introduce minimal task-specific parameters and can be easily fine-tuned on the downstream tasks with modified final layers and loss functions.

Methodology
In this section, the methods operated by our team for sub-task A are explained. We use several methods including traditional machine learning methods, deep learning methods, and pretrained Language Model, such as SVM, LSTM, Attention, BERT, and so on. As expected, as the out-of-the-box pre-trained and fine-tuned BERT model has shown great performance in many kinds of NLP problems, BERT also has achieved a good result by comparison in this work. Finally, We choose the BERT model as our base model. Next, we introduce the methods we use in this work.

Traditional Machine Learning Methods
Traditional machine learning methods, in particular, SVM is one of the most robust and accurate methods among all well-known data mining algorithms. Therefore, in our experiments, we apply the SVM classifiers. But it doesn't perform well on the validation set.

Word Embedding Model
Representing a word by using a low-dimensional vector is the usual way in natural language processing. The fastText tool is used in our system to get the word representation of the sentences. A low dimensional vector in fastText is associated with each word, and hidden representations can be shared between different classes of classifiers so that textual information can be used together in different classes. We use the pre-trained fastText embedding and try a combination of different models, such as LSTM, CNN, Attention, and other different combinations, but can't get satisfying results.

BERT with Hidden State
Because the training of model in this task must make full use of the whole sentence content to extract useful linguistic, syntactic, and semantic features which may help to make a deeper understanding of the sentences, while at the same time, less subjected by the noisiness of speech. So, we use BERT in subtask-A. Unlike most of the other methods, BERT uses the bidirectional representation to make use of the left and right context to gain a deeper understanding of a sentence by capturing long-range dependencies between each part of the sentence. In the classification task, the original output of BERT(pooler output) is obtained by its last layer hidden state of the first token of the sequence (CLS token) further processed by a linear layer and a tanh activation function. But the pooler output is usually not a good summary of the semantic content of the input. 1 It is very necessary to explore the deep representation learning of Bert, which can help us to understand the limitations of Bert more clearly, so as to improve it. Recent studies have shown that the top hidden layer of BERT can learn more abundant semantic information features. (Jawahar et al., 2019) Therefore, in order to make the model get more abundant semantic information features, we propose the following model which is shown in Figure 1. Firstly we get the pooler output (P O). Secondly, we take the H 0 of the last two hidden layers. Finally, we concatenate P O and H 0 into the classifier.

Pre-processing
For traditional machine learning methods, in word embedding model, we combine sent0 and sent1, and then use < eops > as the separator of two sentences in the input data, formed as follows: sent0+< eops >+sent1. BERT uses the wordpiece tool for word segmentation and inserts special separators ([CLS], which are used to separate each sample) and separator ([SEP ], which are used to separate different sentences in the sample), formed as follows:

Experimental settings
For the word embedding model, the crawl-300d-2M embedding 2 is 2 million word vectors trained with subword information on Common Crawl (600B tokens). We use crawl-300d-2M as our word embedding and input them to the LSTM and Attention module.
For the BERT, we use BERT-Base 3 pre-trained model, which contains 12 layers. We use stratified 5-fold cross-validation 4 with 42 random seeds for training set, and stratified sampling ensures that the proportion of samples in each category of each fold data set remains unchanged. We use Adam optimizer with a learning rate as 1e-5 and CrossEntropy Loss. In order to save GPU memory, the batch size is set to 4 and the gradient accumulation steps is set to 4, so that each time a sample is an input, the gradient is accumulated 4 times, and then the back-propagation update parameters are performed. We extract the hidden layer state of BERT by setting the output hidden States is true.

Results and analysis
As shown in Table 1, the data set is balanced. In this task, we use accuracy to measure the performance of the proposed method. Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. sent size label 0 label 1 Train 10000 4979 5021 Table 1: Distribution of labels in train datasets. Table 2 shows the system performance of the various methods on the test set. BERT has a significant advantage over other methods on the test set. It is likely because the original BERT models trained on BookCorpus (Zhu et al., 2015) and English Wikipedia contain sufficient common knowledge. We think this can be attributed to BERTs training on Next Sentence Prediction, which can assist in handling the logic relationship between two sentences. (Jawahar et al., 2019) points out that the high hidden layer of BERT can learn rich semantic information features. Therefore, in order to get more abundant semantic information features, we also take advantage of the top hidden layer state of BERT. Finally, we achieve competitive results with an accuracy of 85.1%, ranking 25th in subtasks.

Conclusion
In this paper, we address the challenge of automatically differentiate natural language statements that make sense from those that do not make sense. We conduct experiments with SVM, word embedding model, and BERT. After comparison, we obtain more abundant semantic information features by concatenating the original output of BERT and the output vector of BERT hidden layer state. The result shows that it is helpful to improve the performance of BERT to obtain more abundant semantic information features by extracting the hidden state of BERT. The code is available online. 5