ECNU_ICA_1 SemEval-2021 Task 4: Leveraging Knowledge-enhanced Graph Attention Networks for Reading Comprehension of Abstract Meaning

This paper describes our system for SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning. To accomplish this task, we utilize the Knowledge-Enhanced Graph Attention Network (KEGAT) architecture with a novel semantic space transformation strategy. It leverages heterogeneous knowledge to learn adequate evidences, and seeks for an effective semantic space of abstract concepts to better improve the ability of a machine in understanding the abstract meaning of natural language. Experimental results show that our system achieves strong performance on this task in terms of both imperceptibility and nonspecificity.


Introduction
Recent years have witnessed the remarkable success of pre-trained language models in machine reading comprehension (MRC). Nevertheless, new research points out that these dominant approaches rely heavily on superficial text pattern-matching heuristics to achieve shallow comprehension on natural language (Zhang et al., 2020). For humans, the basic ability to represent abstract concepts guarantees an in-depth understanding of natural language. Consequently, teaching machines to better comprehend abstract meaning is a significant and urgent step to push the frontier technique of MRC forward.
If computers can understand passages as human do, we expect them to accurately predict abstract words that people can use in summaries of the given passages. Thus, researchers have recently proposed a reading comprehension of abstract meaning (Re-CAM) task in SemEval 2021. Unlike some previous datasets such as CNN/Daily Mail (Hermann et al., 2015) that request computers to predict concrete concepts, e.g., named entities, ReCAM requires machines to fill out abstract words removed * Equal corresponding authors. from human written summaries. In ReCAM, subtask 1 and subtask 2 respectively evaluate the performance of machines towards imperceptibility and nonspecificity, two formal definitions of abstractness in natural language understanding (Spreen and Schulz, 1966;Changizi, 2008). Specifically, concrete words refer to things, events, and properties that we can perceive directly with our senses (Spreen and Schulz, 1966;Coltheart, 1981;Turney et al., 2011), e.g., donut, trees, and red. In contrast, abstract words refer to the ideas and concepts that are distant from immediate perception. Examples for abstract words include objective, culture, and economy. Subtask 1 requires machines to perform reading comprehension of abstract meaning for imperceptible concepts, while subtask 2 concentrates on hypernyms, which is more abstract and different from the concrete concepts (Changizi, 2008).
To better understand the abstract meaning, we utilize the Knowledge-Enhanced Graph Attention Network (KEGAT) architecture with a novel semantic space transformation strategy for ReCAM. It well incorporates structured knowledge base such as ConceptNet (Speer et al., 2017) and exploits a novel representation transformation strategy to improve the ability of machines in natural language understanding. The main contributions of our system are as follows: • We utilize the KEGAT architecture to accomplish two subtasks in Reading Comprehension of Abstract Meaning, leveraging heterogeneous knowledge resources to provide adequate evidences and relying on Graph Attention Networks for the better reasoning. • The proposed semantic space transformation strategy seeks for an effective representation mapping from concrete objects to abstract concepts, enabling machines to better understand the abstract meanings of natural language.
• Extensive experiments show that our system achieves strong performance on this task in terms of both imperceptibility and nonspecificity.

Methodology
In this section, we describe the framework of our system and propose some strategies to enhance the reasoning ability of the model. An overview of the architecture is depicted in Figure 1.

Input Module
We cast the ReCAM task as a classification problem. For each instance, we assume that P is the passage, Q is the question,

Reasoning Module
Since pre-trained language models have achieved state-of-the-art performance in various NLP tasks (Devlin et al., 2019;Yang et al., 2019;Lan et al., 2020), we adopt the pre-trained architecture to process the embedding E U i that is obtained from the previous step to get the high-level representation asÊ base U i . Specifically, we use Electra (Clark et al., 2020), a word-sensitive pre-trained language model which is composed of N -layer transformer encoders (Vaswani et al., 2017) depicted in the middle of Figure 1. Then, we utilize a Knowledge-Enhanced Graph Attention Network (KEGAT) component to accomplish the reasoning process based on all relevant entities and the highlevel representation of the entire question-answer pair from the pre-trained model. The working principle of our KEGAT model is introduced later.
As shown in Figure 1, our KEGAT model mainly consists of a Graph Attention Network, a selfattention submodule and a multi-layer perceptron (MLP). It enables a multi-level reasoning process from entities to sentences. For the entity level, we utilize some structured knowledge from Con-ceptNet with a different integration approach to achieve the goal of conducting inferences over new constructed subgraphs. Here, we adopt the N-gram method to extract all entities from the converted input U i , and use edge weight as the probability to select a maximum of k adjacent nodes from Concept-Net for subgraph construction. Suppose the number of entities is n, we construct n subgraphs in total, and the subgraphs may be connected with edges. Next, we utilize the conceptnet-numberbatch * to obtain the i-th entity embedding as the initial representation h (0) i , which is subsequently refined by the L-layer Graph Attention Network (GAT). In the refinement process, the GAT module automatically learns an optimal edge weight between two entities in these subgraphs based on the ReCAM task, indicating the relevance of adjacent entities to every central entity. In other word, for a central entity, the GAT tries to only assign higher weight values to those edges connected with several most reasonable adjacent entities from the constructed subgraph, and discards some irreverent edges. Thus, the abstract semantic inference ability of our model is highly improved with the knowledge incorporated by the refined subgraphs. The working principle of our GAT is in Eq. 1-3.
We update each entity node based on Eq. 1, where σ(·) represents a ELU function (Clevert et al., 2016), W is the network parameter, h i is the representation from the l-th layer of GAT, and N i stands for all adjacent nodes to the i-th entity. M is the number of independent attention mechanisms in Eq. 2, and a (l) ij is the relevance degree of the jth adjacent entity with respect to the i-th entity. Besides, f(·) represents a projection function converting the vector to a real number, and [; ] stands for the concatenation operation. Finally, we definê to be the final representation for entity subgraphs that are obtained from the GAT. * ConceptNet-Numberbatch: https://github.com/commonsense/conceptnet-numberbatch From the sentence level, we adopt a selfattention submodule and several MLPs to promote the model to reason over both entities and input sentences. We first utilize a MLP to fuse the symbolic and semantic representations and then take a self-attention operation for refinement. Thus, the entity-level representation can be further refined by taking the question-answer pair as a reference. To sum up, some valuable dimensions can be highlighted to retain the most reasonable information from the fused representationsÊ all U i to improve the reasoning ability. We formulate these steps as Eq. 4 and Eq. 5.
where G U i is the refined representation, SelfAttn(·) represents a self-attention operation, and σ(·) is the activation function. Finally, we concatenate G U i andÊ base U i to obtain the entire reasoning representation asÊ

Prediction Module
With the previous multi-level reasoning process, we obtain the representation of converted inputs as {Ê U i } A i=1 for each instance. In the prediction module, we use a multi-layer perceptron to solve the downstream tasks of ReCAM based on Eq. 7-9.
where y represents the prediction result, and P i stands for the probability of selecting the i-th option label. P is the output of the MLP, where P ∈ R A×1 . L is the training objective to minimize negative log-likelihood and y * here stands for one-hot vector of the optimal label.

Adaptive Strategies
Noise Reduction Strategy Previous methods of knowledge integration often lead to inevitable noise (Zhong et al., 2019;, and it is still an open research problem to balance the impact between noise and the amount of incorporated knowledge. (Weissenborn et al., 2018;Khashabi et al., 2017). Our KEGAT can alleviate the noise that is caused by incorporated structured knowledge to a certain extent. This module accomplishes the goal of identifying the most reasonable external entities and discarding the irreverent ones. For example, we rely on both entity-level and sentencelevel inference thoroughly that is discussed in the previous Reasoning Module part to achieve this goal. Furthermore, we remove several unimportant types of edges to avoid unnecessary noises, such as "/r/DistinctFrom", "/r/ExternalURL", etc.
Semantic Space Transformation Strategy Unlike some previous MRC tasks that request computers to predict concrete concepts, ReCAM task here asks models to fill out abstract words removed from human written summaries. Thus, we utilize a semantic space transformation strategy to convert ordinary semantic representation into abstract representation for classification. Specifically, for the final answer prediction, this approach deals with the hidden vector representation V which is obtained ahead of the prediction module. One method is to extend the dimension (ED) of V . For instance, we use a MLP to expand V by 500 dimensions and then perform the downstream classification prediction. The second attempt is to transform V directly with a nonlinear activation function, such as RELU. And another method is to transform V through a simple deep neural network (DNN), which is depicted in the right of Figure 1.

Datasets and Metric
In the ReCAM task, it requires the model to fill out abstract words removed from human written summaries. The total number of abstract words that can be selected is five. We utilize Accuracy as a metric to evaluate model performance.

Experimental Settings
In our experiment, we set the maximum sentence length as 210 and the batch size as 16. During training, we freeze all layers and learn 2 epochs with a learning rate of 0.001 except for the last classification layer, In the fine-tuning phase, we unfreeze all layers and learn 10 epochs with a learning rate of 0.000005. Like the training phase, it is beneficial to use the weights of the pre-trained language model to correct the randomly initialized classification layer. All layers of the entire model in the fine-tuning phase are suitable for classifying downstream tasks with the low learning rate. For each phase, we save model parameters when it reaches the highest accuracy on the dev set, and load it at the beginning of the next phase. In addition, we adopt the Adam optimizer (Kingma and Ba, 2015) and set epsilon to be 0.000001 for the gradient descent. We train our model with Titan XP GPUs. Table 1 shows the results of the top five teams from the leaderboard for ReCAM task (by February 10).

Results
Our system achieves the 3rd place in Subtask 1 in terms of Accuracy. And it can be concluded from Table 2 that our system has the ability to solve the ReCAM task. Besides, we test the performance of our system with the strategies mentioned in Section 2.4. Here, "+KEGAT" represents our proposed model with Knowledge-Enhanced Graph Attention Networks, "+ED ", "+RELU", "+DNN" refer to our system with different semantic space transformation strategies. In addition, Dev Acc. and Test Acc. stand for the accuracy on the dev set and test set respectively. Table 2 shows the experimental results of our system on the ReCAM task. In this table, the baseline model GA Reader provided by the competition organizer is not ideal, and its performance is slightly higher than 20% with our actual testing. We conclude that on the dev set, our system respectively achieves the relative improvement of 6.69% and 4.24% on subtask 1 and subtask 2 when adding KEGAT submodule compared with the fine-tuned Roberta large. Moreover, we test the performance of three ensemble models shown in the bottom of 2, and the "Electra-large ED + Electra-large KEGAT-RELU " ensemble obtains the best performance on the dev set, which respectively outperforms the fine-tuned Roberta large model with the relative improvement of 7.41% and 5.29% on subtask 1 and subtask 2. Here, this ensemble framework refers to the combination of two models. Therefore, it can be concluded that the ensemble models with the semantic space transformation strategy greatly improve the reasoning ability of our system, and the single system with multiple strategies performs well in most cases.

Further Discussion
To further investigate this task, we have additionally assessed the impact of data bias on the model performance. By statistics, the average length of passages in the dev sets of subtask 1 and subtask 2 are 268.8 and 434.6, respectively. In general, longer passages often consist of more noise that greatly influences answer reasoning process of the model. We only select a portion of contents from   the given passage for this assessment instead of the whole passage. Specially, in the given dataset, we take a fixed length of 210 as the content interval by intercepting it at two different positions, namely token ID 0 ∼ 210 and token ID 211 ∼ 420. Then we fine-tune the Electra-large model for each subtask using their own training set and compare the performance of the fine-tuned Electra model on two different passage intervals. It means that we have conducted experiments with different passage contents twice. Table 3 reports the results of our system on these different passage intervals. In this table, compared to the experiment that adopts the passage content with position from 0 to 210, intercepting the one with position from 211 to 420 leads the performance to drop by about 1 ∼ 2% on these two subtasks. Thus, we conclude that the positional bias indeed affects model performance to some extent.

Conclusion
We utilize a knowledge-Enhanced Graph Attention Network architecture with semantic transformation strategies for machines to better comprehend the abstract meanings of natural language. It well incorporates heterogeneous knowledge and relies on Graph Attention Networks to learn adequate evidences. The subsequent semantic transformation enables an effective representation mapping from concrete objects to abstract concepts. Our system achieves strong performance on this comprehension task in terms of both imperceptibility and nonspecificity. We hope this work can shed some lights on the study of in-depth reading comprehension.