ECNU-SenseMaker at SemEval-2020 Task 4: Leveraging Heterogeneous Knowledge Resources for Commonsense Validation and Explanation

This paper describes our system for SemEval-2020 Task 4: Commonsense Validation and Explanation (Wang et al., 2020). We propose a novel Knowledge-enhanced Graph Attention Network (KEGAT) architecture for this task, leveraging heterogeneous knowledge from both the structured knowledge base (i.e. ConceptNet) and unstructured text to better improve the ability of a machine in commonsense understanding. This model has a powerful commonsense inference capability via utilizing suitable commonsense incorporation methods and upgraded data augmentation techniques. Besides, an internal sharing mechanism is cooperated to prohibit our model from insufficient and excessive reasoning for commonsense. As a result, this model performs quite well in both validation and explanation. For instance, it achieves state-of-the-art accuracy in the subtask called Commonsense Explanation (Multi-Choice). We officially name the system as ECNU-SenseMaker. Code is publicly available at https://github.com/ECNU-ICA/ECNU-SenseMaker.


Introduction
Commonsense reasoning is the process of making decisions by combining facts and beliefs with some basic knowledge appeared in daily life, which is fundamental to many Natural Language Understanding (NLU) tasks (Johnson-Laird, 1980). However, most existing models are quite weak in terms of commonsense acquisition and understanding compared with humans since commonsense is often expressed implicitly and constantly evolving (Wang et al., 2019a). Thus, the commonsense problem is considered to be an important bottleneck of modern NLU systems (Davis and Marcus, 2015).
Many attempts have been made to empower machines with human abilities in commonsense understanding, and several benchmarks have emerged to verify the commonsense reasoning capability, such as SQUABU (Davis, 2016) and CommonsenseQA (Talmor et al., 2019;Lin et al., 2019). However, these benchmarks estimate commonsense indirectly and do not include corresponding explanations. Therefore, the Commonsense Validation and Explanation (ComVE) benchmark is proposed which directly asks the real reasons behind decision making (Wang et al., 2020). It consists of three subtasks, and we propose an entire system (ECNU-SenseMaker) to solve the first two subtasks, namely, Commonsense Validation and Commonsense Explanation (Multi-Choice). In fact, the two different subtasks correlate to each other. Subtask A (Validation) requires the model to select which one is invalid between two given statements. While Subtask B (Explanation) aims to identify one real reason from two other confusing options to explain why the statement is invalid.
Compared with the previous benchmarks, these subtasks are non-trivial because machines does not have the ability of children and adults in accumulating commonsense knowledge in daily life. Since most questions in such subtasks require knowledge beyond the mentioned facts in text, even trained over millions of sentences, there still exist a great gap behind human performance in commonsense reasoning, not to mention explanation capability (Wang et al., 2019a). Besides, many previous methods on commonsense reasoning have some important drawbacks which remain to be unsolved. For example, the ways of combining knowledge into deep learning architectures are far from satisfaction, and it is an issue to balance the tradeoff between noise and the amount of incorporated commonsense from knowledge base such as ConceptNet (Speer et al., 2017). In addition, researchers still consider this commonsense knowledge base to be incomplete despite it has been evolved for a long time. The large majority of commonsense is generally recognized to be implicitly expressed in unstructured text or human interactions in everyday life. Furthermore, some pilot experiments have shown inference remains a challenging problem in the above subtasks (Wang et al., 2019a).
To overcome these difficulties, we propose a novel Knowledge-enhanced Graph Attention Network (KEGAT) architecture for both Commonsense Validation and Explanation. Different from the ordinary Graph Attention Networks (Veličković et al., 2018), this model could leverage heterogeneous knowledge resources to improve its commonsense reasoning ability. On one hand, commonsense knowledge in the structured knowledge base (i.e. ConceptNet) is well incorporated and utilized by the system with an elegant structural design as knowledge-enhanced embedding module. On the other, an upgraded data augmentation technique is put forward to empower the model with commonsense understanding capability based on the unlimited commonsense knowledge learnt from a large amount of unstructured text. Besides, an internal sharing mechanism is cooperated to prohibit our model from insufficient and excessive reasoning for commonsense. In this way, we build a novel Graph Neural Network based architecture for both validation and explanation tasks, making more accurate inference for commonsense understanding. We officially name it as the ECNU-SenseMaker system. In summary, the main contributions of our system are as follows: • We propose the Knowledge-enhanced Graph Attention Network to solve Commonsense Validation and Explanation, leveraging heterogeneous knowledge resources for better commonsense reasoning. • Our system utilizes an elegant embedding module to well incorporate commonsense from structured knowledge bases, and design novel approaches to alleviate the noise caused by external knowledge. • Our system gains valuable commonsense knowledge based on a large amount of unstructured text via a novel data augmentation technique, which also improves the robustness of our model. • Last but not the least, this system uses an internal sharing mechanism to indirectly guide the reasoning process, which greatly avoids insufficient and excessive commonsense inference.

Problem Definition
Formally, each instance in the ComVE task consists of 5 sentences {S 1 , S 2 , R 1 , R 2 , R 3 }, where S 1 and S 2 are two statements with similar expressions but only one of them conforms to commonsense, and R 1 , R 2 , R 3 stand for three optional explanations. Subtask A, called Commonsense Validation, requires the model to identify the against-common-sense sentenceŜ − from the above two statements S 1 and S 2 , whereŜ − ∈ {S 1 ,S 2 }. While Subtask B, called Commonsense Explanation (Multi-Choice), requires the model to pick the most reasonable explanationR from the three given options R 1 , R 2 , R 3 for the againstcommon-sense statementŜ − , whereR ∈ {R 1 ,R 2 , R 3 }. Subtask C, called Commonsense Explanation (Generation), is similar to Subtask B, but it requires the model to generate a reasonable explanationR.

Methodology
In this section, we describe the framework of our proposed ECNU-SenseMaker system. An overview of this architecture is depicted in Figure 1. We will introduce the basic components of this architecture, and provide some novel proposed strategies to enhance the commonsense reasoning ability of our system.

Input Module
We cast both Subtask A and B as classification problems.  Figure 1: The overview of ECNU-SenseMaker.
A ∈ {|S|, |R|}, and we assume that y * ∈ {1, 2, . . . , A} represents the label of the instance. Then, we can adopt various methods to encode this converted input U i . For instance, a basic way is to first get the one-hot vector and positional encoding for all tokens, perform a separate linearly transformation, and finally add these vectors to obtain an new embedding as E U i for every U i . Furthermore, we propose a more advanced approach to embed both the converted inputs and structured knowledge from external resources, which will be introduced in the next paragraph.
Structured Knowledge Incorporation Inspired by the work of K-BERT (Liu et al., 2020), we propose a novel Knowledge-enhanced Embedding (KEmb) module to well incorporate some structured knowledge from ConceptNet (Speer et al., 2017) to improve the commonsense understanding ability of the model. For all entities in every converted input U i , we extract their adjacent entities from ConceptNet and add the corresponding relations to form a new tree-structured form, which is depicted in the upper part of Figure  2. We use soft-position instead of positional encoding, which is the relative distance between each node and the root node. We also use a attention mask matrix to maintain the invisibility of different branches in this tree-structured input and propose the following strategies to embed structured knowledge into the pre-trained model. Firstly, we use edge weight values in ConceptNet as prior knowledge to select the most relevant adjacent entities and avoid some unnecessary noise since these weight values represent how believable the information is (Speer et al., 2017). Secondly, we manually design several text templates to describe the structured entities and their relation in ConceptNet with natural language, and insert the novel pieces of unstructured text into the original input. An example is shown in Figure 2 to explain the detailed operations. For two entities "sugar" and "coffee" in this statement, we extract their adjacent entities from ConceptNet and select the most relevant ones based on the corresponding edge weight values. Therefore, for the central entity "sugar", we only choose the high-correlated entities "sweetening coffee" and "sweet food", but discard "carbohydrate" with a low weight value. For the triple (sugar, UsedFor, sweetening coffee), we use our template to convert it to be "Sugar is used to sweetening coffee" and insert this piece into the original statement with the soft-position and attention mask matrix operations (Liu et al., 2020). Finally, the flatten input can be handled by the pre-trained model described next.

Reasoning Module
To improve the reasoning ability of the model, we propose to model the input sentences in a multi-level perspective which combines entities with the entire statements or possible explanations. First of all, we utilize the pre-trained model to process the above embedding E U i obtained in the previous step to provide a high-level representation asÊ base U i . This is not only because the pre-trained architectures have achieved state-of-the-art performance in a variety of NLP tasks (Devlin et al., 2019;Yang et al., 2019;Lan et al., 2020), but also because of the objective of language model, which can serve as a metric to estimate the possibility of a sentence being commonsensical. In this part, we choose an context-sensitive pre-trained model, RoBERTa (Liu et al., 2019), which is composed of N -layer transformer encoder (Vaswani et  Sugar (is used for sweetening coffee) (is a sweet food) is used to make coffee (is related to sugar) sweet 0 1 2 3 4 5 1 2 3 4 1 2 3 4 5 6 7 8 9 5 Attention  Knowledge-enhanced Graph Attention Network (KEGAT) As shown in Figure 1, our KEGAT mainly consists of a Graph Attention Network, a self-attention sub-module and several multi-layer perceptrons (MLP), which enables a multi-level reasoning from entities to sentences. For the entity level, we still utilize some structured knowledge from ConceptNet but take a different incorporation approach, which aims to conduct commonsense inferences over newly constructed subgraphs. To achieve this, we adopt the N-gram method extract all entities from the converted input U i , and use edge weight as the probability to sample a maximum of k adjacent nodes from ConceptNet to form a subgraph for every extracted entity. Suppose the number of entities is n, we have n constructed subgraphs in total, which may be connected with edges. After that, we use the conceptnet-numberbatch * to get the i-th entity embeddings as the initial representation h (0) i , which will be refined by the L-layer Graph Attention Network (GAT) (Veličković et al., 2018). In the refinement process, the GAT module automatically learns a optimal edge weight between two entities in these subgraphs based on the Commonsense Validation or Explanation subtasks, indicating the relevance of adjacent entities to every central entity. In other words, for a central entity, the GAT attempts to only assign higher weight values to those edges connected with several most commonsensical adjacent entities from the constructed subgraph, and discard some irreverent edges. Thus, the commonsense inference ability of our model is highly improved with those knowledge incorporated by the refined subgraphs. The working principles of our GAT is in Eq. 1-3.
We update every entity node based on Eq. 1, where σ(·) stands for a ELU function (Clevert et al., 2016), W is the network parameter, h i is the representation from the l-th layer of GAT, and N i represents all adjacent nodes to the i-th entity. M is the number of independent attention mechanisms in Eq. 2, and a (l) ij represents the relevance degree of the j-th adjacent entity respect to the i-th entity. Besides, f(·) is * ConceptNet-Numberbatch: https://github.com/commonsense/conceptnet-numberbatch a projection function converting vector to real number, and [; ] stands for the concatenation operation. Finally , we defineÊ to be the final representation for entity subgraphs obtained from the GAT.
From the sentence level, we propose to adopt a self-attention submodule and many MLPs to promote the model to reason over both entities and input sentences. We first utilize a MLP to fuse the symbolic and semantic representations and then take a self-attention operation for refinement. Thus, the entity-level representation can be further refined taking the statements and possible explanations as a reference. In a word, some valuable dimensions can be highlighted to retain the most commonsensical information from the fused representationsÊ all U i to the improve reasoning ability. We formulate these operations in Eq. 4-5.
where G U i is the refined representation, SelfAttn(·) stands for a self-attention operation, and σ(·) is the activation function. Finally, we concatenate G U i andÊ base U i to obtain the entire reasoning representation.
Unstructured Knowledge Augmentation Since commonsense knowledge may be scattered in a large amount of unstructured text, we introduce three novel types of unstructured resources to teach our ECNU-SenseMaker system, aiming to improve the intelligence of the model in commonsense understanding. The first type comes from the training set of CommonsenseQA (Talmor et al., 2019), and the second one originates from ConceptNet. To be detailed, we manually design some templates to express the extracted relations from ConceptNet with natural language, in which way we generate a great many of unstructured text based on this structured commonsense knowledge base. As a matter of fact, these unstructured data share a similar distribution with the text in the ComVE dataset (Wang et al., 2020). Therefore, we can use them to train our system in advance before the whole official process starts, which always leads to a better initial weight for our model. Apart from this, we integrate some data from Subtask C to serve as the third type of unstructured resources to increase the generalization ability of our system, which is allowed by competition regulations.

Prediction Module
After the above multi-level reasoning, for every training data we obtain the representation {Ê U i } A i=1 of converted inputs. In the prediction module, we use a multilayer perceptron (MLP) to solve the downstream tasks of commonsense validation and explanation based on Eq. 7-8.
where P is the output of the MLP, and P ∈ R A×1 . y stands for the prediction result, P i represents the probability of selecting the i-th statement label or explanation. The training objective is in Eq 8, which aims to minimize negative log-likelihood, and y * here stands for one-hot vector of the optimal label.

Adaptive Strategies
Noise Alleviation Strategy Previous knowledge incorporation approaches often lead to inevitable noise (Zhong et al., 2019;Wang et al., 2019b), and it is still an open research question to balance the tradeoff between noise and the amount of incorporated commonsense from knowledge base consisting of all types

Task Examples
Subtask A Which statement of the two is against commonsense? S1: he put an elephant into the fridge. × S2: he put a turkey into the fridge.

Subtask B
Why is "he put an elephant into the fridge" against commonsense?
A. an elephant is much bigger than a fridge. √ B. elephants are usually gray while fridges are usually white. × C. an elephant cannot eat a fridge. × of entity nodes (Weissenborn et al., 2017;Khashabi et al., 2017). The proposed KEmb and KEGAT have provided possible solutions to alleviate the noise caused by incorporated structured knowledge to some extent. These two modules share a same goal of identifying the most commonsensical external entities and discarding the irreverent ones. Take the KEGAT for instance, we achieve this based on both entity-level and sentence-level inference thoroughly described in the previous Reasoning Module part. Furthermore, several unimportant types of edges have been removed to avoid unnecessary noises, such as "/r/ExternalURL", "/r/DistinctFrom", etc.
Internal Sharing Mechanism To further improve the commonsense inference ability of our system, we propose an internal sharing mechanism which utilizes some processed outputs of pre-trained models to assist the KEGAT module in reasoning with a right direction, aiming to avoid insufficient or excessive inferences. First of all, we add a new multilayer perceptron to process the output of the RoBERTa pooler layerÊ base U i , the representation from pre-trained models. Meanwhile, we can also get the embedding of every input token, and project this embedding to be a vocabulary-size vector with the network. In this way, it is easy to utilize the idea of classification problems to calculate the loss for every token, which measures the difference between the converted embedding to the original input. Thus, we sum up all these to get the overall loss of the sentence, and utilize CrossEntropy to obtain the pre-trained module loss as L 1 . We use a similar approach to Kendall et al. (2018) to combine L 1 and the KEGAT loss L 2 . Therefore, we obtain a novel loss as L = 1 2σ 2 1 L 1 (W) + 1 2σ 2 2 L 2 (W) + log σ 1 σ 2 where σ 1 and σ 2 are two learnable parameters, they dynamically adjust the learning rate according to the difficulty of data fitting for better learning. With the above internal sharing mechanism, we can utilize the cross-entropy loss from the pre-trained model part to guide the KEGAT module to decide whether to remove the relevant evidence chain or not. Experiments show that this approach has a positive impact on the performance of our model in the commonsense validation and explanation task.

Datasets and Metric
In the ComVE benckmark, Subtask A (Validation) requires the model to find the against-common-sense statement, while Subtask B (Explanation: Multi-Choice) aims to select the most reasonable explanation for the invalid statement. Examples of both subtasks are listed in Table 1. In both Subtask A and B, we utilize Accuracy as a metric to evaluate model performance (Wang et al., 2020).

Experimental Settings
In our experiment, we set the batch size as 2 and maximum sentence length as 128. During training, we freeze all layers except the last classification layer and learn 4 epochs with a learning rate of 0.001. In the fine-tuning phase, we unfreeze all layers and learn 8 epochs with a learning rate of 0.000005. Like the training phase, using the weights of the pre-trained language model itself to classify downstream tasks is  beneficial to correct the randomly initialized classification layer. Therefore, all layers of the entire model in the fine-tuning phase are suitable for fine-tuning with the same low learning rate. In each phase, we save the model parameters at the time of the highest test accuracy and load it at the beginning of the next phase. In addition, we use the Adam optimizer (Kingma and Ba, 2015) and set epsilon to 0.000001 for gradient descent. Table 2 shows the results of the top five teams from the leadboard for Subtask A & B (by March 17).

Results
Our system achieves state-of-the-art accuracy in Subtask B. It outperforms the baseline (BERT base) model (Wang et al., 2019a) with a relative improvement of 52.98% and achieves a relative improvement of 15.43% compared with fine-tuned BERT base. Meanwhile, experimental results also show that our system performs well in solving Subtask A. Therefore, we conclude from Table 2 that our system has the ability of solving both commonsense validation and explanation tasks. Besides, we test the performance of our system with the strategies mentioned in Section 3. Here, "+KEmb" stands for our system with the Knowledge-enhanced Embedding, "+KEGAT" represents our proposed model with Knowledge-enhanced Graph Attention Network, the abbreviated name "+LM " refers to our system with the Internal Sharing Mechanism, and "+CommonsenseQA pre-trained" stands  for the system with data augmentation technique. In addition, Dev Acc. and Test Acc. stand for the accuracy on the dev set and test set respectively. Table 3 shows the experimental results of our system on Subtask A. From this table, we conclude that on the test set, our system achieves a relative improvement of 7.63% when adding both the KEmb and KEGAT submodules compared with the fine-tuned BERT base. Moreover, we also test the performance of two ensemble models shown in the bottom of Table 3, and the "RoBERTa-large LM + RoBERTa-large + ALBERT-xxlarge" ensemble obtains the best performance on the test set, which outperforms the fine-tuned BERT base model with a relative improvement of 8.53%.
Here, this ensemble is the combination of three models and the "RoBERTa-large LM" stands for the RoBERTa-large model with the Internal Sharing Mechanism. Meanwhile, the results on Subtask B are shown in Table 4, and we conclude from this table that our system with LM strategy outperforms the fine-tuned BERT base with a relative improvement of 13.85%. Furthermore, our ensemble model achieves a relative improvement of 15.43% compared with the fine-tuned BERT base. Therefore, it can be concluded that the ensemble models with the Internal Sharing Mechanism strategy gereatly improve the commonsense reasoning ability of our system, and the single system with multiple strategies performs well in most cases. Besides, data augmentation techniques proves to be an effective way to improve the performance of machine in commonsense validation and explanation.

Error Analysis
To further improve the performance in the future, it is helpful to study the failure cases of our model. In particular, we have categorized the observed failure cases in the ComVE into the following categories.
1. Decision or interpretation from different perspectives. For example, "S 1 : A war is fought for solution.
S 2 : There is peace during war.", where S 2 violates commonsense. And "S 1 : Humans wrote the bible about god. S 2 : God lives physically on earth." In this given example, S 2 violates commonsense. Whereas there may exist various decision-making standards, under which both of the two statements should make sense from the point of view of humans. 2. Ambiguous explanations are difficult to understand. For example, "Ŝ − : My son had us write an essay on The National Monument. R 1 : My son can write the alphabet instead of teach how to write an essay. R 2 : My son isn't smart enough to assign an essay. R 3 : My son cannot read and write an essay." In this example, all three options, which seem to be less related and ambiguous, are expressed in a neutral way. Therefore, it would be more difficult for the machine to understand and make decisions.

Conclusion
In this paper, we propose the ECNU-SenseMaker system, which utilizes a novel Knowledge-enhanced Graph Attention Network architecture to solve both commonsense validation and explanation tasks. It well incorporates heterogeneous knowledge from both unstructured text and structured resources such as ConceptNet, and relies on the graph attention module with an internal sharing mechanism to improve the commonsense reasoning ability of the model. Our model achieves good results in the newest leaderboard of SemEval-2020, solving two subtasks with an entire model. We hope our work can shed some lights to directly empower machines with human abilities in commonsense understanding and reasoning.