SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning

This paper introduces the SemEval-2021 shared task 4: Reading Comprehension of Abstract Meaning (ReCAM). This shared task is designed to help evaluate the ability of machines in representing and understanding abstract concepts.Given a passage and the corresponding question, a participating system is expected to choose the correct answer from five candidates of abstract concepts in cloze-style machine reading comprehension tasks. Based on two typical definitions of abstractness, i.e., the imperceptibility and nonspecificity, our task provides three subtasks to evaluate models’ ability in comprehending the two types of abstract meaning and the models’ generalizability. Specifically, Subtask 1 aims to evaluate how well a participating system models concepts that cannot be directly perceived in the physical world. Subtask 2 focuses on models’ ability in comprehending nonspecific concepts located high in a hypernym hierarchy given the context of a passage. Subtask 3 aims to provide some insights into models’ generalizability over the two types of abstractness. During the SemEval-2021 official evaluation period, we received 23 submissions to Subtask 1 and 28 to Subtask 2. The participating teams additionally made 29 submissions to Subtask 3. The leaderboard and competition website can be found at https://competitions.codalab.org/competitions/26153. The data and baseline code are available at https://github.com/boyuanzheng010/SemEval2021-Reading-Comprehension-of-Abstract-Meaning.


Introduction
Humans use words with abstract meaning in their daily life.In the past, research efforts have been exerted to better understand and model abstract meaning (Turney et al., 2011;Theijssen et al., 2011;Changizi, 2008;Spreen and Schulz, 1966).Modelling abstract meaning is closely related to many other NLP tasks such as reading comprehension, metaphor modelling, sentiment analysis, summarization, and word sense disambiguation.
In the past decade, significant advancement has been seen in developing computational models for semantics, based on deep neural networks.In this shared task, we aim to help assess the capability of the state-of-the-art deep learning models on representing and modelling abstract concepts in a specific reading comprehension setup.
We introduce SemEval-2021 Task 4, Reading Comprehension of Abstract Meaning (ReCAM).Specifically, we design this shared task by following the machine reading comprehension framework (Hermann et al., 2015;Onishi et al., 2016;Hill et al., 2016), in which computers are given a passage D i as well as a human summary S i to comprehend.If a model can digest the passage as humans do, we expect it to predict the abstract word used in the summary, if the abstract word is masked.Unlike the previous work that requires computers to predict concrete concepts, e.g., named entities, in our task we ask models to fill in abstract words removed from human summaries.During the SemEval-2021 official evaluation period, we received 23 submissions to Subtask 1 and 28 submissions to Subtask 2. The participating teams additionally made 29 submissions to Subtask 3. In this paper, we induce the shared task and provide a summary for the evaluation.
Passage ... Observers have even named it after him, "Abenomics".It is based on three key pillars of monetary policy to ensure long-term sustainable growth in the world's third-largest economy, with fiscal stimulus and structural reforms.In this weekend's upper house elections, .... Question Abenomics: The @placeholder and the risk.

Answer
(A) chance (B) prospective (C) government (D) objective (E) threat Table 1: An example for Subtask 1.The correct answer to the question is objective.

Subtask 1: ReCAM-Imperceptibility
In one definition (Turney et al., 2011;Theijssen et al., 2011;Spreen and Schulz, 1966), concrete words refer to things, events, and properties that humans can directly perceive with their senses, e.g., trees and flowers.In contrast, abstract words refer to "ideas and concepts that are distant from immediate perception", e.g., objective, culture, and economy.In Subtask 1, we perform reading comprehension on imperceptible abstract concepts, named as ReCAM-ImPerceptibility.Table 1 shows an example.

Subtask 2: ReCAM-NonSpecificity
The second typical definition of abstractness is based on nonspecific concepts (Theijssen et al., 2011;Spreen and Schulz, 1966).Compared to specific concepts such as groundhog and whale, words such as vertebrate are regarded as more abstract.
Our Subtask 2, named as ReCAM-NonSpecificity, is designed based on this viewpoint.We will discuss how the datasets are constructed in Section 3.

Subtask 3: ReCAM-Cross
In this subtask, participants are asked to submit their predictions on the test data of Subtask 2, using models trained on the training data of Subtask 1, and vice versa.This subtask aims to demonstrate models' generalizability between modelling the two typical definitions of abstractness.

Data Construction
We develop our multi-choice machine reading comprehension datasets based on the XSum summarization dataset (Narayan et al., 2018).We first locate words with abstract meaning using our abstractness scorers.Then we perform data filtering to select our target words to construct our datasets.

The XSum Data
By collecting online articles from the British Broadcasting Corporation (BBC), Narayan et al. (2018) developed a large-scale text summarization dataset, XSum, in which each article has a single sentence summary.We developed our ReCAM dataset based on XSum.

Finding Imperceptible Concepts
Abstractness Scorer for Imperceptibility Following Turney et al. ( 2011), we use the MRC Psycholinguistic Database (Coltheart, 1981), which includes 4,295 words rated with a degree of abstractness by human subjects, to train our abstractness scorer for imperceptibility.The rating of the words in the MRC Psycholinguistic Database ranges from 158 (highly abstract) to 670 (highly concrete).We linearly scale the rating to the range of 0 (highly abstract) to 1 (highly concrete).The neural regression model accepts fixed Glove embedding (Pennington et al., 2014) as input and predicts the abstractness rating score between 0 and 1.Our regression model is a three-layer network that consists of two nonlinear hidden layers with the ReLU activation and a sigmoid output layer.The mean square error (MSE) is used as the training loss.
To test the regression model's performance, we randomly split the MRC Psycholinguistic Database into train and test set with the size of 2,148 and 1,877, respectively.Table 2 shows the final performance of the neural regression model on the MRC database.We use the Pearson correlation between ratings predicted by models and original ratings from MRC as the evaluation metric.We can see that the regression model achieves high correlation coefficients (the higher, the better), i.e., 0.934 and 0.835, on the training and test set.The correlations are significant (p-values are smaller than 10 −5 ), reflecting the quality of our models in finding abstract words.Note that Turney et al. ( 2011) report a correlation score of 0.81 on their MRC test set.Their training-test split is unavailable, so we run cross-validation here in our experiment.The scorer can then be used to assign an imperceptibility score to a word that is not in the MRC Psycholinguistic Database.
Using the abstractness scorer described above, we assign an abstractness value to each word in summaries and select words with a value lower than 0.35 as the candidates for our target words (words that will be removed from the summaries  to construct questions).We only consider content words as potential target words, i.e., nouns, verbs, adjectives, and adverbs.For this purpose, we use part-of-speech tagging model (?) implemented in Stanza (Qi et al., 2020).

Finding Nonspecific Concepts
Nonspecificity Scorer Following the work of Changizi (2008), we assign a nonspecificity score to a word token based on the hypernym hierarchy of WordNet (Miller, 1998).Specifically, the root of the hierarchy is at level 0 and regarded as the most abstract.The abstractness of a node in the hierarchy is measured by the maximal length of its path to the root.The hypernym level in WordNet is between 0 and 17.For each word token in summaries, we use Adapted Lesk Algorithm (Banerjee and Pedersen, 2002) to label the sense since the WordNet hypernym hierarchy works at the sense level.Since a summary sentence may be short, we concatenate each summary sentence with the corresponding passage for word sense disambiguation.Built on this, each token, which is labelled with a sense, receives an abstractness score based on the WordNet hierarchy.
Using the nonspecificity scorer, we assign an nonspecificity value to each word in summaries and select words with a value smaller than six as the candidate target words.The targets words will be nouns and verbs since the hypernym hierarchy in WordNet (?) consists of these two POS types.

Filtering
We aim to avoid developing simple questions.For example, if a target word also appears in the passage, it is likely that a model can easily find the answer without the need to understand the passage in depth.
Filtering by Lemmas We lemmatized passages and summaries.If a lemma appears both in a summary and the corresponding passage, the lexemes of the lemma will not be considered as target words.Note that a strict filter may exclude some good candidates for target words but helps avoid introducing many simple questions.
Filtering by Synonyms and Antonyms For a word in a summary, if a synonym or antonym of the word appears in the corresponding passage, we will not consider this word to be our target word.We use WordNet (?) to derive synonyms and antonyms.Instead of using word sense disambiguation (WSD), for a word w i in a summary, we use all senses of this word and add all synonyms and antonyms into a pool.Only if none of the words in the pool appear in the passage, we consider w i as a candidate target word.Otherwise, we will not use w i to construct a question for this passage-summary pair.
Filtering by Similarity We further filter words by similarity.For each candidate target word in a summary and each word in the passage, we calculate similarity and use that to perform further filtering.
We use 300-dimension GloVe word embedding trained on 840 billion tokens (Pennington et al., 2014).We calculate the cosine similarity between a candidate target word and a passage word.For contextual embedding, we embed each sentence in a passage as well as the summary into a contextaware representation matrix using the BERT-large uncased language model.Then, we calculate the similarity between each passage token and question token with the cosine similarity.If the similarity is higher than 0.85, we will not consider the involved summary words as candidate target words.

Constructing Multiple Choices
We train machine reading comprehension models using the data built so far to generate four choices for each question.Together with the ground-truth (the target word identified above and removed from the human summary), we have five choices/options for each question.In our work, we propose to use three models, Gated-Attention Reader (Hermann et al., 2015), Attentive Model and Attention Model with Word Gloss to generate the candidate options.Please find details of the models in Appendix B and Appendix C as well as the training details in Appendix D.
We adopt the idea of k-fold cross validation to train the above mentioned three models to generate candidate answer words.Specifically, we split the data into 4 folds.Each time, we train the baseline models on 3 folds of data and use the trained MRR R@1 R@5 R@10 GAReader 0.245 0.175 0.314 0.378 AttReader 0.235 0.167 0.300 0.363 +gloss 0.179 0.123 0.227 0.276  (Craswell, 2009), Recall@1, Recall@5, and Recall@10.
models to predict candidate words on the remaining 1-fold data.With 4-fold iteration, we obtain predication of each model on the entire data.The performance of the three baseline models are listed in Table 3 for Subtask 1 and Table 4 for Subtask 2, using several typical retrieval-based evaluation metrics.
For each target word that has been removed from the corresponding summary sentence (again, a question is a summary sentence containing a removed target word), we collect top-10 words predicted by each of the three models.In this way, we can collect a candidate word pool of 30 predicted word tokens for each removed target word.To avoid including multiple correct choices for each question, we adopt synonym and context similarity filtering methods described in Section 3.4.Specifically we first calculate similarity between the ground-truth target word and each word type in the pool.We exclude a word type from the multiple choices if its similarity to the ground-truth is higher than 0.85.In addition, we also exclude synonyms of the ground-truth target word.For the remaining word tokens in the pool, we select four most frequent word types (a word type may have multiple tokens in the pool).Together with the ground-truth word, we obtain five choices for each question.

Further Quality Control
We further make the following efforts to remove noise in the dataset and improve the datasets' qual-ity.We observe that up to now, there are mainly two kinds of noise in our dataset: 1) some target words cannot be inferred solely based on the corresponding passage; 2) more than one of the multiple choices are correct answers.
The first issue is mainly related to the property of the XSum dataset, in which the first sentence of a passage is used as the summary.The second type of problems are often caused by our automatic generation method.Although we have applied strict rules in Section 3.4 to handle this, among a small portion of the resulting data, multiple potentially correct answers still exist in candidate answers.
To further ensure the quality of our dataset, we invite workers in Amazon Mechanical Turk to perform further data selection.Each annotator needs to follow the procedure of Appendix A to answer the question and annotate relevant information, with which further data selection is applied.To ensure quality, we only include workers from Englishspeaking countries and only if their previous HITs' approval rates are above 90%.To see more details about this process, please refer to Appendix E.

Systems and Results
Our shared task received 23 submissions to Subtask 1, 28 submissions to Subtask 2, and 29 submissions to Subtask 3. We use accuracy as the evaluation metric for the three subtasks.

Subtask 1: ReCAM-Imperceptibility
Table 6 shows all the official submissions and most of them outperform the baseline model.The baseline used for Subtask 1 is the Gated-Attention (GA) Reader (Dhingra et al., 2017).The GA Reader uses a multi-layer iterated architecture with a gated attention mechanism to derive better query-aware passage representation.The motivation behind using GA Reader is to have a simple comparison between our task and the CNN/Daily Mail reading comprehension dataset since GA Reader achieves reasonably good performance on the CNN/Daily Mail reading comprehension dataset.
Note that the last column of the table lists the accuracy (Acc.Cross) for models trained on the Subtask 2 training data and tested on the Subtask 1 testset.We will discuss those results later in Section 4.3.
The best result in Subtask 1 was achieved by team SRC-B-roc (Zhang et al., 2021) with an accuracy of 0.951.
The system was built on a pre-trained ELECTRA discriminator and it further applied upper attention and auto-denoising mechanism to process long sequences.The second-placed system, PINGAN omini-Sinitic (Wang et al., 2021), adopted an ensemble of ELECTRA-based models with task-adaptive pre-training and a mutlihead attention based multiple-choice classifier.ECNU-ICA-1 (Liu et al., 2021) ranked third in this subtask with a knowledge-enhanced Graph Attention Network and a semantic space transformation strategy.
Most participating systems performed intermediate task pre-training (Pruksachatkun et al., 2020) for their language models.For example, CNN/Daily Mail dataset was selected by ZJUKLAB (Xie et al., 2021a) to further pretrain their language models.The CNN/Daily Mail dataset and Newsroom dataset boost model performance on both Subtask 1 and Subtask 2. Data augmentation methods are also popular among participants.ZJUKLAB (Xie et al., 2021a) performed negative data augmentation with a language model to leverage misleading words.IIE-NLP-Eyas (Xie et al., 2021b) adopted template-based input reconstruction methods to augment their dataset and further fine-tuned their language models based on the dataset.
Most teams also used an ensemble of multiple pre-trained language models to further enhance model performance.SRC-B-roc (Zhang et al., 2021) applied Wrong Answer Ensemble (Kim and Fung, 2020) by training the model to learn the correct and wrong answer separately and ensembled them to obtain the final predictions.Stochastic Weight Averaging (Izmailov et al., 2018) was also performed across multiple checkpoints in the same run to achieve better generalization.
In addition, some interesting approaches were additionally used to tackle the task from different perspectives.
PINGAN omini-Sinitic (Wang et al., 2021) turned the original multi-choice task into a masked-sentence classification task by adding each option to the placeholder.Noise detection methods and auto denoising methods were further proposed by adding a noise-tolerant loss.ZJUKLAB (Xie et al., 2021a) used label smoothing to encourage the activations of the penultimate layer.ECNU-ICA-1 (Liu et al., 2021) utilized a semantic space transformation strategy to convert ordinary semantic representations into abstract representations for classification.
Many teams used external knowledge resources to further improve model performance.Word-Net (Fellbaum, 1998) was widely used to provide candidate word definitions.ECNU-ICA-1 (Liu et al., 2021) also used ConceptNet5 (Speer et al., 2016) and Graph Neural Network in their systems.To alleviate the noise induced by incorporating structured knowledge through unimportant edges, they propose a noise reduction strategy.owlmx used the MRC Psycholinguistic Database to obtain a measurement of imperceptibility abstractness.
Different pre-processing techniques were proposed in multiple systems.ZJUKLAB (Xie et al., 2021a) used a sliding window to limit input length in training.PINGAN Omini-Sinitic (Wang et al., 2021) used the cycle noisy label detection algorithm to make models more robust.
Much interesting analysis regarding the failure cases and data distribution was discussed in several system description papers.XRJL (Jiang et al., 2021) found that for a few questions, common sense knowledge was further needed to help find the answer.They also pointed out that there were still a few questions in which multiple candidate choices may serve as appropriate answers.

Subtask 2: ReCAM-Nonspecificity
In Subtask 2, we received 28 submissions.Table 7 shows the official leaderboard.The best result in Subtask 2 was achieved by team PINGAN omini-Sinitic (Wang et al., 2021) with an accuracy of 0.953, using a model similar to the team's model in Subtask 1.The second-placed team SRC-B-roc (Zhang et al., 2021) also adopted the same model it used in Subtask 1 with a data augmentation method based on the hypernym hierarchy in WordNet.
In general, the participating teams in Subtask 2 used pre-trained language models and neural networks similar to those they used in Subtask 1.The main differences lie in how the participants performed data augmentation and leveraged external knowledge.For example, in addition to SRC-B-roc (Zhang et al., 2021), the IRG team (Sharma et al., 2021) also performed data augmentation using hypernyms from WordNet.

Subtask 3: Cross-task Performance
In this section, we explore models' performance across the two types of definitions of abstractness.Specifically, in this subtask, participants train their models on the training set of one subtask and test on the testset of the other subtask.We received 29 submissions in total from the participants.
Cross-task performance: Subtask 2-to-1 testing.We asked participants to test their models trained on the Subtask 2 training data on the Subtask 1 test data.The results are shown in the last column of Table 6.
The results we received show that the performance of all systems drops substantially.For some systems ranking among top 10, the accuracy can decrease by 5 points (IIE-NLP-Eyas (Xie et al., 2021b) and XRJL (Jiang et al., 2021)), or even more (14 points for nxc).Some systems show good generalization ability in this Subtask 2-to-1 scenario; the performance of PINGAN-Omini-Sinitic (Wang et al., 2021) is only 1.3 point less, which may be due to the the data augmentation and task adaptive training used in the model.
Cross-task Performance: Subtask 1-to-2 Testing.Participants are asked to test their Subtask 1 systems on the Subtask 2 testset.Details of the results can be seen in the last column of Table 7.All systems' performances drop.For example, among the top-10 systems, the accuracy decreases by 5 points (IIE-NLP-Eyas (Xie et al., 2021b)) or 7 points (tt123).
However, ECNU-ICA-1 (Liu et al., 2021) shows a very good generalization ability in Subtask 1-to-2 testing.PINGAN-Omini-Sinitic (Wang et al., 2021), SRC-B-roc (Zhang et al., 2021) and XRJL (Jiang et al., 2021)'s systems are rather consistent in this Subtask 1-to-2 cross testing.Some algorithms they used may explain the models' good generalization ability.ECNU-ICA-1's algorithm of using knowledge-enhanced Graph Attention Network can provide external knowledge to the model.The Wrong Answer Ensemble algorithm (Kim and Fung, 2020) used in PINGAN-Omini-Sinitic (Wang et al., 2021) is a relatively simple but an effective way of improving model performance and generalization ability.Also, the Stochastic Weight Averaging algorithm across multiple checkpoints is effective for better generalization.XRJL (Jiang et al., 2021) retrieves the definitions of candidate answers from WordNet and feeds them to the model as extra inputs.We also think data augmentation methods contribute to the generalization ability.

Related Work
There have been tasks being proposed to evaluate machines' ability on reading comprehension, which either require models to find an entity or text span from the source document as the answer (Hermann et al., 2015;Hill et al., 2016;Onishi et al., 2016;Rajpurkar et al., 2016;Trischler et al., 2017), or further generate an answer (Nguyen et al., 2016;He et al., 2018;Kočiskỳ et al., 2018).The cloze-style MRC tasks (Hermann et al., 2015;Onishi et al., 2016;Hill et al., 2016) are most similar to ours, in which the missing words in the cloze questions are entities appearing in source documents.Unlike previous work, ReCAM questions specifically focus on abstract words unseen in the corresponding source documents.
In general, multi-choice questions have been widely used as a tool for language examination to test both humans and machines.In this paper, we follow the multiple-choice framework for our proposed ReCAM task to evaluate computers' ability in comprehending abstract concepts, in which computers are asked to predict the missing abstract words in human-written summaries.
This shared task aims to study the ability of machines in representing and understanding abstract concepts, based on two definitions of abstractness, the imperceptibility and nonspecificity, in a specific machine reading comprehension setup.We provide three subtasks to evaluate models' ability in comprehending the two types of abstract meaning as well as their generalizability.In Subtask 1, the top system achieves an accuracy of 0.951, and in Subtask 2, an accuracy of 0.953, suggesting the current systems perform well in the specific setup of our share task.In Subtask 3, we found that in general the models' performances dropped in both Subtask 2-to-1 and Subtask 1-to-2 testing.However, some models generalize well, benefiting from technologies such as data augmentation and task adaptive training.We hope the shared task can help shed some light on modelling abstract concepts and help design more challenging tasks in the future.

A Annotation Script B Gated-Attention Reader
The Gated-Attention (GA) Reader (Dhingra et al., 2017), the state-of-art model on CNN/Daily Mail reading comprehension dataset (Hermann et al., 2015), is adapted here in our experiments.The GA Reader uses a multi-layer iterated architecture with a gated attention mechanism, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader, to derive better query-aware passage representation.To apply GA Reader to our ARC task, we input the news passage p as the document and the processed summary s as the query to GA Reader.
Specifically, for an input passage p = [p 1 , p 2 , ..., p lp ] with l p words and its corresponding summary s = [s 1 , s 2 , ..., s ls ] with l s words, we first derive their corresponding word embedding sequence P = [p 1 , p 2 , ..., p lp ] and S = [s 1 , s 2 , ..., s ls ] respectively.Then the GA Reader accepts the P and S as inputs and return the hidden states .., h s ls ] as the sequential representation for passage p and summary s respectively.As for the final prediction process, we do not adopt the operations in Dhingra et al. (2017) because in ARC the answer words are unseen in the corresponding passage, however, GA Reader in Dhingra et al. (2017) tries to select a entity word in the passage as the final prediction since their target answer word appears in the passage.So we redesign the part of prediction.
First, the corresponding representation of "@placeholder" in H s , denoted as h s q (q is the position index of @placeholder in summary s), is used as the final vector representation for summary s.For the final vector representation p for passage p, a bilinear attention between h s q and H p is used for its derivation: We set a token embedding a e t for each candidate abstractive word a t (t ∈ [1, ..., n c ], n c is the size of candidate set).We first concatenate the h s p and p, then use the bilinear product and softmax to predict the probability distribution over all n c candidate abstractive words.
in which o t represents the probability of predicting the candidate abstractive word a t as the final answer.

C Attentive Model
The word gloss, which defines a word sense meaning, has been mainly used in word sense disambiguation (WSD) task and its variants (Lesk, 1986;Moro et al., 2014).Since the goal of ARC is to predict a word that can summarize corresponding information from the source passage, which is an abstracting process, it may be helpful when the gloss, i.e., interpretation of candidate abstractive words, are provided.
We design an attentive model with word gloss (AMWG) as Figure 1 shows.Specifically, all the encoders are 1-layer bi-directional recurrent neural networks (RNNs) with Gated Recurrent Units (GRU) (Cho et al.) ] to the WordGloss Encoder.
Similar to Section B, the corresponding representation of "@placeholder", i.e., h s q , is used as the final vector representation for summary s.And an bilinear attention f p att (•) is applied to h s q and H p as follows: Then p is derived as the vector representation for passage p by the weighed sum of H p , which is further concatenated with the h s q to form the final summarization vector v: Another attention f g att (•) is applied to v and H gt , The following weighted sum of H gt , i.e, a g t , is derive as the final vector representation for the gloss of candidate word a t : We also set a token embedding a e t for each candidate word a t (t ∈ [1, ..., n c ], n c is the size of candidate set), which is further concatenated with a g t to build the final representation a t for candidate word a t .For the final prediction, we input the summarization vector v and candidate representation vector a t to f pred (•) and apply the softmax to derive the probability distribution over all n c candidate abstractive words, in which o t gives the probability of predicting the candidate word a t as the final answer.The word gloss, which defines a word sense meaning, has been mainly used in word sense disambiguation (WSD) task and its variants (Lesk, 1986;Moro et al., 2014).Since the goal of ARC is to predict a word that can summarize corresponding information from the source passage, which is an abstracting process, it may be helpful when the gloss, i.e., interpretation of candidate abstractive words, are provided.
We design an attentive model with word gloss (AMWG) as Figure 1 shows.Specifically, all the encoders are 1-layer bi-directional recurrent neural networks (RNNs) with Gated Recurrent Units (GRU) (Cho et al.) ] to the WordGloss Encoder.
Similar to Section B, the corresponding representation of "@placeholder", i.e., h s q , is used as the final vector representation for summary s.And an bilinear attention f p att (•) is applied to h s q and H p as follows: The following weighted sum of H gt , i.e, a g t , is derive as the final vector representation for the gloss of candidate word a t : We also set a token embedding a e t for each candidate word a t (t ∈ [1, ..., n c ], n c is the size of candidate set), which is further concatenated with a g t to build the final representation a t for candidate word a t .For the final prediction, we input the summarization vector v and candidate representation vector a t to f pred (•) and apply the softmax to derive the probability distribution over all n c candidate abstractive words,

D Training Details
We train all models using the non-negative loglikelihood as the objective function.The gloss of candidate words are derived from WordNet using the NLTK tools (Bird and Loper, 2004).Specifically, we first lemmatize the candidate word and use the lemmatized word as the query word for the searching in WordNet.To cope with the semantic ambiguity of words, we just concatenate the gloss of the first sense in each retrieved POS for the query word with corresponding POS tag as the deliminator.Models in our experiments are trained with the following hyperparameter settings: All word embeddings and token embeddings a e t have 300 dimensions and are initialized with Glove (Pennington et al., 2014).The passage p and summary s share one set of word embeddings, which are fixed during training.The glosses {g t } for candidate words {a t } keep its own word embeddings.
The hidden state vectors of all bi-directional GRU-RNNs in all models have 150 dimensions.The number of attention hops in GA Reader is set to 3. The batch size is set to 32.The method of Adam (Kingma and Ba, 2015) is adopted for optimization with initial learning rate 1e − 03.A dropout with rate 0.3 is applied to the input layers for all GRU-RNN encoders and the final summarization vector v.

E Annotation Selection
To ensure most of our annotation is valid, we select annotations satisfying the following criteria: a) the average accuracy is higher than 40%; b) both text spans should not be empty; c) if the difficulty level is rated as easy, then this data sample should be answered correctly.

Figure 1 :
Figure 1: The model architecture of the attentive model with word gloss (AMWG) implemented in this paper.denotes the concatenation of input vectors.All the encoders are 1-layer bi-directional GRU-RNNs, denotes the weighted sum of vectors.

Table 2 :
Fitting performance of neural regression model on the MRC database.

Table 3 :
Three baseline models are used to generate candidate multiple choices for Subtask 1.The table shows their performance on the XSum dataset, evaluated with MRR Table 5 lists the size of our ReCAM datasets, i.e., numbers of questions.For example, in total Subtask 2 has 6,186 questions, which are split into training/development/test subsets.

Table 5 :
Size of the ReCAM Dataset.

Table 6 :
Official results of Subtask 1 and Subtask 3. Acc is the accuracy of the models trained on the Subtask 1 training data and tested on the Subtask 1 testset.Acc.cross is the accuracy of models trained on the Subtask 2 training data and tested on the Subtask 1 testset.

Table 7 :
Official results of Subtask 2 and Subtask 3.
. For an input news passage p = [p 1 , p 2 , ..., p lp ] with l p words, we can derive its hidden states H p = [h p 1 , h p 2 , ..., h p lp ] by sending its word embedding sequence P = [p 1 , p 2 , ..., p lp ] to the Passage Encoder.Similarly, we can derive hidden states H s = [h s 1 , h s 2 , ..., h s ls ] for summary s by inputting its word embedding sequence S = [s 1 , s 2 , ..., s ls ] into the Summary Encoder and hidden states H gt = [h gt . For an input news passage p = [p 1 , p 2 , ..., p lp ] with l p words, we can derive its hidden states H p = [h p 1 , h p 2 , ..., h p lp ] by sending its word embedding sequence P = [p 1 , p 2 , ..., p lp ] to the Passage Encoder.Similarly, we can derive hidden states H s = [h s 1 , h s 2 , ..., h s ls ] for summary s by inputting its word embedding sequence S = [s 1 , s 2 , ..., s ls ] into the Summary Encoder and hidden states H gt = [h gt 1 , h gt 2 , ..., h gt lg t ] for gloss g t of the candidate word a t by sending its word embedding sequence G t = [g t 1 , g t 2 , ..., g t lg t Another attention f g att (•) is applied to v and H gt ,e j = tanh(W g att v + b) T h gt j , ∀j ∈ [1, ..., l gt ], ∀j ∈ [1, ..., l gt ], o t = sof tmax t (r t ), ∀t ∈ [1, ..., n c ] (24)in which o t gives the probability of predicting the candidate word a t as the final answer.