TA-MAMC at SemEval-2021 Task 4: Task-adaptive Pretraining and Multi-head Attention for Abstract Meaning Reading Comprehension

This paper describes our system used in the SemEval-2021 Task4 Reading Comprehension of Abstract Meaning, achieving 1st for subtask 1 and 2nd for subtask 2 on the leaderboard. We propose an ensemble of ELECTRA-based models with task-adaptive pretraining and a multi-head attention multiple-choice classifier on top of the pre-trained model. The main contributions of our system are 1) revealing the performance discrepancy of different transformer-based pretraining models on the downstream task, 2) presentation of an efficient method to generate large task-adaptive corpora for pretraining. We also investigated several pretraining strategies and contrastive learning objectives. Our system achieves a test accuracy of 95.11 and 94.89 on subtask 1 and subtask 2 respectively.


Introduction
Machine reading comprehension (MRC) is one of the key tasks for measuring machines' ability of understanding human languages and reasoning, it can be used broadly in real world applications such as Q&A systems and dialogue systems. MRC often comes in a triplet style {passage, question, answer}, given a context passage, questions related with this passage is asked, and the machine is expected to give the answers. The question-answer form can be questionanswer pair, where the answer text is to be provided by machines, or statement form where the answer is to be filled in as cloze or multiple choices selection. By the type of answer formation, MRC can be divided into extractive and generative MRC, the former takes segments from the passage as the answer and the latter requires answer text generation based on the understanding of the passage.
Generative MRC is harder than extractive MRC, since it requires more on information integration and reasoning besides focusing on relevant information.
One of the classic MRC approach focuses on matching networks, various network structures have been proposed to capture the semantic interaction within passages/questions/answers. Recent years, pre-trained language models (LMs) have brought non-trivial progress to the performance on MRC, and there's a decline of complex matching networks (Zhang et al., 2020). Plugging matching networks on top of pre-trained LMs can see either improvements or degradation in performance (Zhang et al., 2020;Zhu et al., 2020). Multiplechoice MRC (MMRC) often lacks abundant training data for deep neural networks (this might be caused by the expensive human labelling cost) and it results in a limitation to take full advantage of the pre-trained LMs. The SemEval-2021 task 4 Reading Comprehension of Abstract Meaning (Zheng et al., 2021), is a multiple-choice English MRC task, aiming at investigating the machine's ability to understand abstract concepts in two aspects: subtask 1, nonconcrete concepts, e.g. service/economy compared with trees/red; subtask 2, generalized/summarized concepts, like vertebrate compared with monkey.
We propose an approach based on the pre-trained LM ELECTRA (Clark et al., 2020), with an ensemble of multi-head attention (Vaswani et al., 2017) multiple-choice classifier, and WAE (Kim and Fung, 2020) to get the final prediction. First, we conduct task-adaptive pretraining, which is transfer learning using in-domain data on the ELECTRA model. Then we fine-tune the ReCAM task using a multi-head attention multiple choice classifier (MAMC) on top of the ELECTRA model. Finally we enhance the system with WAE and ensemble them all to get the best generalization capability.  In addition, we also investigated into transfer learning with natural language inference (NLI) tasks and contrastive learning objectives. Figure 1 illustrates the overall architecture of our system. The options are substituted into the query to form a complete context, rather than separate query/option segments, in order to get a less semantically ambiguous representation of the query and option. The option-filled query and context tokens are concatenated as in Figure 1, wrapped by [CLS] token and [SEP] tokens. Token embeddings are added up with segment embeddings and positional encodings to form the input for the pretrained encoder. Then the representations from the encoder are put through a multi-head attention multiple choice classifier, which consists of 1) a 2 layer multi-head attention feed forward network to further capture the task specific query-context interactions, 2) a pooler and a linear transformation to get the final cross entropy loss. We first conduct task-adaptive pretraining on the system, and then fine-tune on the ReCAM dataset, the final model is an ensemble model by several generalization techniques including wrong answer ensemble.

Task-adaptive Pretraining
Pre-trained LMs and their downstream applications have definitely proved the power of transfer learning. The precondition of transfer learning is that the pretraining tasks have shared underlying sta-tistical features with downstream tasks. Usually in-domain data brings more improvement on downstream tasks than out-of-domain data (Sun et al., 2019;Gururangan et al., 2020).
The genre of the ReCAM task dataset is news (confirmed by manual random checking), we argue that the task of news abstractive summarization provides high quality further pretraining dataset for Re-CAM. The dataset comes in {article, summary} pairs, the articles are crawled from formal online news publishers and the summaries are generated by humans and contain abstractive key information of the articles. News abstractive summarization aims at teaching machines to grasp the key information of the whole context by letting machines to generate the summary text.
We regenerate the ReCAM style multiple-choice dataset from the original news abstractive summarization dataset. Letting the article/summary be the passage/question, the regeneration strategy mainly includes 2 steps: 1) identify the abstract concepts in the news dataset, 2) generate gold and pseudo options. In step 1, we count the part-of-speech (POS) tags of all gold labels on the ReCAM training data as shown in Figure 2 (nouns, adjectives and adverbs are the most frequent option tags), and use a similar POS tag distribution to randomly sample word in the summary text that does not appear in the corresponding news article as gold option. In step 2, the gold option in the summary is replaced by the mask token and fed into the pre-trained LM. The LM predicts the mask token and we select some of the top ranking ones as pseudo options. Specifically, setting a high ranking threshold (e.g. top 5) would get words too similar with the gold option, which would bring extra ambiguity to the model, some relaxation on the ranking threshold would ease the problem. This method is automatic, cheap to apply on large dataset, while the abstract concept approximation in step 1 would bring some noise, such as person's names and geolocations are sometimes selected, but by our experiment result the overall pretraining performance is not hurt, the noisy samples should account for a small fraction.
In addition, it is reported that NLI task transfer   (Jin et al., 2020). Therefore we also explored the MNLI (Williams et al., 2017) and RTE (Wang et al., 2018) tasks transfer learning for the ReCAM task, but it results in degradation. This indicates that NLI tasks are not generally fit for further pretraining in MMRC on pre-trained LMs.

Multi-head Attention Multiple Choice Classifier
The classifier takes the last layer hidden representations from the pre-trained encoder, applies the multi-head attention and feed forward non-linearity, each with a layer normalization (Vaswani et al., 2017). After that the last token is pooled, which is selecting the hidden vector from the hidden embeddings by the index of the last [SEP] token in the input, and then linearly transformed to get the probability of each {query option f illed , context} candidate pair.
In addition, we also explored the contrastive learning objective. When humans do MMRC, they usually compare the options according to the passage, exclude the wrong ones and then analyze further on the indeterminate ones. Inspired by this, we experimented with triplet loss (Weinberger et al., 2006) (among {input nonf illed , input gold , input pseudo } ) and ntuplet loss (Sohn, 2016) on all option-filled query and context within one sample. However the contrastive learning objective degrades the performance, suggesting these learning objectives are not as suitable for the ReCAM task as the MLE loss.

Wrong Answer Ensemble
Wrong Answer Ensemble (Kim and Fung, 2020) is a relatively simple yet effective method (Zhu et al., 2020). Kim proposed to train the model to learn the correct and the wrong answers separately and ensemble them to get the final prediction. In 2.2, the correct answer is labelled as 1 and wrong as 0 for correct answer training. Wrong answer training does the opposite labelling (correct/wrong answers as 0/1) and fine tune the model with binary cross entropy loss as below: The two models's output, p c and p w are linearly combined to give the final prediction. A simple linear regression is leveraged to find the best value of weight w.p 3 Experimental Setup

Dataset
We leverage external news abstractive summarization datasets for transfer learning, and then fine tune our model on the ReCAM dataset.
ReCAM. Dataset for the SemEval-2021 Task 4, consisting of news articles (verified by manually random checking) and multiple-choice questions.
XSUM. XSUM (Narayan et al., 2018) consists of 227k BBC articles from 2010 to 2017 covering a wide variety of subjects along with professionally written single-sentence summaries.
NEWSROOM. NEWSROOM (Grusky et al., 2018) is a dataset of 1.3 million news articles and summaries written by authors and editors in newsrooms of 38 major news publications between 1998 and 2017. After a coarse selection (filtering out lengthy articles/summaries, summaries duplicate with news articles, articles with unqualified pseudo options), about 229k article/summary pairs are used.
The data statistics are listed in Table 1, the 3 news datasets share similar article and query lengths.

Training Details
We compare the baseline performance of 3 kinds of Transformer-based models, BERT/ALBERT/ ELECTRA, and select ELECTRA as our encoder. We adopt most hyper parameter settings from the ELECTRA large model, specifically our learning rate is 1e-5, batch size is 32 and gradient clip norm  Table 2: Baseline performance of different pre-trained Models threshold is set to 1. In the task-adaptive data generation process, We set the threshold as top 10 for pseudo options selection, filtering out the word piece predictions(word pieces all start with a "#" in the vocabulary) and randomly select 4 words as pseudo options. See the appendix for hyperparameter details. Training was done on NVidia V100 GPUs. All the performance data is on the dev set.

Pre-trained LM Selection and Task-adaptive Pretraining
The baseline performance of BERT, ALBERT and ELECTRA is tested by directly fine-tuning the Re-CAM data on the pre-trained LMs. The results are shown in Table 2. ELECTRA outperforms the other two models with large margins. This may be caused by the learning objective difference among the models. The BERT/ALBERT models learn to predict the masked word from the vocabulary, while the ELECTRA model learns to predict whether each of the token in the input is replaced or not, which learns more about unreasonable co-occurrence knowledge besides reasonable co-occurrences and may help in digging deeper implicit semantic relations for ReCAM. Therefore the ELECTRA large model is selected as the encoder for further experiments. The XSUM/NEWSROOM regenerated data (denoted as XN) is used for in-domain pretraining on the encoder, and the subtask 1 is fine tuned after pretraining. The prediction accuracy grows with more data fed, as shown in Figure 3. In the end of the task-adaptive pretraining, subtask 1 achieves dev accuracy 92.73, 2.80% higher than directly fine-tuning on the encoder, subtask 2 gets 92.95, increased by 3.13%.
Besides the task-adaptive pretraining and finetuning, we also tried multitask learning with   XSUM/NEWSROOM and the ReCAM data together (up sampling the ReCAM data as 3:7 with the news dataset). The results in Table 3 shows that this approach outperforms the encoder baseline, while slightly worse than the full news data pre-trained model, this model is used for ensemble. Using MNLI/RTE for further pretraining hurt the ReCAM fine-tuning performance, especially MNLI pretraining brings about 10% accuracy decease than the baseline.

On-top Classifier and WAE
Adding MAMC on the top of the encoder helps increase accuracy on the ReCAM subtask 1 and subtask 2, the results are shown in Table 4. Further we applied the WAE to squeeze marginal increases on prediction accuracy. While option contrastive learning (OCL) does not bring performance improvement, worse than directly fine-tuning the encoder with multiple choice classifier.

Improving Generalization
We mainly applied 3 procedures below for better generalization, and the ensemble of all the models have achieved test accuracy 95.11 on subtask 1 and 94.89 on subtask 2 on the ReCAM leaderboard. 1) Data repartitioning (mix the train/dev sets, and randomly split into new train/dev sets by 8:2 or 9:1) aims to smooth the distribution difference among different train/dev data partition. As is shown in the Table 5, the accuracy of different sets differs, with some higher than then original partition.
2) Augmenting the task data itself for fine-tuning, to mask different word than the original gold option (if there exists) using the method in 2.1. The accuracy remains almost the same after adding the task augmented data. This suggests that our automatic augmentation method makes lower quality samples than the labelling data, while not too noisy that it can contribute to the robustness of the model.
3) We also did Stochastic Weight Averaging (Izmailov et al., 2018) across multiple checkpoints in the same run to get better generalization (SWA dose not improve dev error but test error, so it's not listed in Table 5).

Fail Cases Analysis
We manually checked and categorized the fail cases on subtask 1 and subtask 2 into 5 classes (given roughly 850 dev cases, the total fail cases is around 50 for both subtask 1 and subtask 2). The detailed examples for each class can be found in the appendix.
• EC0, easy case. In these cases, the answer can be inferred from the query/context, while the model fails to give the correct prediction • EC1, complicated coreference. Such cases has complicated coreference relations, though the answer can be inferred, the coreferences hinder the model from understanding correctly • EC2, complex reasoning. In these cases, either the information related with the answer Figure 4: Subtask 1/2 fail case distribution is sparse in the query/context, or the facets related with the answer is separated with intense unrelated noisy information • EC3, external knowledge dependency. Only with the external knowledge can one give a correct answer • EC4, ambiguity in sample cases. This category includes cases for which we think humans are not able to select the correct answer. Either the information is not enough to make a decision or there are more than one reasonable answers. Figure 4 shows the ratios of each fail case class, the EC4 is the major class, 48.5% for subtask 1 and 75.0% for sutask 2. The following is EC3, 36.4% for subtask 1 and 6.3% for subtask 2. EC0 and EC1 are minor classes among all. With the system backbone being pre-trained LM with a matching network, it's not a surprise to see EC1 and EC3 failures, while the few EC0 and EC2 failures shows that our system learns well to capture abstract concepts within the query/article pair.

Conclusion
Our system takes the large pre-trained LM ELEC-TRA, and enhance it with in-domain transfer learning and a multi-head multiple-choice classifier on top. We compared the benchmark performance of different pre-trained LMs (BERT, ALBERT and ELECTRA) on the SemEval-2021 task 4, the result shows that different pretraining objective/dataset can lead to different inclination of model knowledge and large performance discrepancy on the downstream task. Task-adaptive pretraining has contributed the main improvement, and multi-head multiple-choice classifier and WAE bring marginal improvement. We also investigated into option contrastive learning and multitask learning, the degradation of performance suggests that triplet and n-tuplet contrastive loss is not suitable for this task and NLI is not generally beneficial for MMRC tasks.