ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning

This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches. The code and dataset used in our paper can be found at https://github.com/CheaSim/SemEval2021. The leaderboard can be found at https://competitions.codalab.org/competitions/26153.


Introduction
Past decades have witnessed the huge progress of representation learning in Natural Language Processing (NLP). With pre-trained language models, machine reading comprehension (MRC) models can extract answers from given documents and even yield better performance than humans on benchmark datasets such as Squad (Rajpurkar et al., 2016). However, these successes sometimes lead to the hype in which these models are being described as "understanding" language or capturing "meaning" (Bender and Koller, 2020). Note that the intention of MRC is letting the systems read a text like human beings, extracting text information and understanding the meaning of a text then answering questions, which means the systems can not only conclude the semantic of the text but also comprehend the abstract concepts under the constraint of general knowledge regarding * Equal contribution and shared co-first authorship. 1 Our implementation is publicly available at https://github.com/zjunlp/SemEval2021Task4 the world (Wang and Jiang, 2016). Nevertheless, little works as well as benchmarks focus on this direction.
Unlike previous MRC datasets such as CNN/Daily Mail (Hermann et al., 2015), SQuAD (Rajpurkar et al., 2018), and CoQA (Reddy et al., 2019) that request computers to predict concrete concepts, e.g. named entities. This task challenges the model's ability to fill the abstract words removed from human-written summaries based on the English context.
Note that this task's input format is similar to the MLM pre-training task of BERT (Devlin et al., 2019), which aims to predict the mask tokens. Pretrained language models (PLMs) such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), DeBERTa (He et al., 2021) have achieved success on MRC tasks. Inspired by this, we introduce a simple yet effective method, namely, Negative Augmentation with Language model (NAL) in SemEval 2021 Task4. Specifically, we augment the answer distribution with an additional negative candidate from the mask language model's prediction. Previous work (Petroni et al., 2019;Zhou et al., 2020) indicates that the pre-trained language model has already captured much world knowledge. Thus, we argue that knowledge can help guild the model training and identify those ambiguous abstract meanings. Further, we introduce other technologies such as label smoothing, domain-adaptive pre-training in our system. We describe the detailed approaches used for the Subtasks in Section 3.
We conduct comprehensive experiments in Section 3, and we achieve the 4th system for Subtask 1: ReCAM-Imperceptibility and the 4th system for Subtask 2: ReCAM-Nonspecificity in the leaderboard. In our experiments, we observe that PLMs without fine-tuning can easily get 60+% accuracy on both Subtask 1 and Subtask 2, demonstrating that pre-trained language models already capture some abstract meanings. We further find that our negative augmentation with language model can improve the performance with 2.6% in Subtask 1 and 4.6% in Subtask 2. Finally, we conduct error analysis to promote future researches.

Background
Machine reading comprehension (MRC) has received increasing attention recently, which is a challenging task. According to the type of the answer, reading comprehension tasks can be divided into four categories (Chen, 2018): 1) Cloze-style: The question contains a "@placeholder," and the system must choose a word or entity from the set of candidate answers to fill in the "@placeholder" to make the sentence complete. 2) Multiple choice: In this type of task, Choosing a suitable answer from K sets of given answers. This answer can be one word or a sentence. 3) Span prediction: This kind of task is also called (Extractive question answering), which requires the system to extract a suitable range of text fragments from a given original text based on the question as to the answer. 4) Free-form answer: This task allows the answer to be any type of text, which is necessary to mine deep-level contextual semantic information according to a given question and a collection of candidate documents, and even combine multiple articles to give the best answer.
In SemEval 2021 Task4, it requires the system to have a strong ability of reading comprehension not only because the task is the cloze-style format as mentioned above but also the abstract words in answers. There are two definitions of abstract words: imperceptibility and nonspecificity. Concrete words refer to things, events, and properties that we can perceive directly with our senses (Spreen and Schulz, 1966;Turney et al., 2011). Compared to concrete words like "trees" and "red," abstract words for imperceptibility are created by humans instead of pointing the things in the natural world. For example, as shown in Table 1, "want" and "achieve" means a person's attitude to-P: Briton Davies won F42 shot put gold with a Games record at Rio 2016, but was unable to defend his 2012 discus title as it did not feature in Brazil. "I don't normally say what I'm going for," said the Welshman, 25. "But this time I'm definitely going for the two golds in both disciplines and nothing will be better than being in front of a home crowd." ...  wards something and a person's accomplishment about something. Meanwhile, the abstract words for nonspecificity can be described as upper words. By determining whether one word can generalize another word, we can get dictionaries of different levels. The words with higher levels are the nonspecificity words. Compared to concrete concepts like groundhog and whale, hypernyms such as vertebrate are regarded as more abstract (Changizi, 2008). The difference between Subtask 1 and Subtask 2 is the definition of abstract words. So the input of both Subtask 1 and Subtask 2 are the same. The input of these tasks are shown in Table 1, it can be represented as a triple < P, Q, A >, where P = s 1 , s 2 , ..., s m is the passage from CNN daily (Hermann et al., 2015), Q is a human-written summary based on the passage with one abstract word replaced by "@placeholder" and A is a set of can-didate abstract words for filling in the "@placeholder" in the question.

Model Design
Recently, with the development of the large Pretrained Language Models (PLMs), such as GPT (Radford et al., 2018), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), DeBERTa (He et al., 2021), have overwhelm the NLP community (Zhang et al., 2020c). The powerful semantic feature extraction capabilities of the PLMs make us only need to make better use of the BERT-like model itself for downstream tasks instead of adding different layers to the model. Similar to the normal multi-choice task, we have five candidates, one passage, and one question per sample. Here we leverage PLMs as encoders to capture the global context representation about the passage, question, and answer. Then a decoder is used to determine the score of each < P, Q, A > pair. Since we get A 1 , ..., A n n answers, for every passage, we construct n input samples as [Q − A i ; P ], the concatenation of Q − A i and P . Because the question is the summary with an abstract word removed. We construct Q − A by replacing "@placeholder" with the option from the candidate set instead of concatenating Q and A. After encoding all n inputs for a single passage, we get the global representations T i for different options in the candidate set. During fine-tuning PLMs, the first special token [CLS] represents the global meaning of the whole input. We use an dense decoder layer to compute the score for all T i , the calculation of score is as follow: (1) where the [Q − A; P ] is the input constructed according to the instruction of PLMs and MRC tasks, and the T * is the final hidden state of the first token [CLS]. The candidate answers with higher scores will be identified as the final prediction.
Since previous research (Gao et al., 2020; demonstrate that there exists a gap between language model pre-training and fine-tuning the models in the downstream task and inspire by the similar task definition as MLM, we introduce the negative augmentation with language model mechanism (Section 3.2). Note that the additional label will enhance the discriminability of the abstract meanings in a contrastive manner. In other words, the model is encouraged NOT to generate those abstract tokens from the language model, but the golden candidates from the given documents. We further introduce the label smoothing (Section 3.3), which can enhance the model performance. Finally, we leverage task-adaptive pre-training (Section 3.4) inspired by (Gururangan et al., 2020) to obtain better performance.

Negative Augmentation with Language Model
Inspired by the same format of MLM and this task, we first conduct a toy experiment to test whether a PLM can get the right answer without any supervised signal. Firstly we replace the "@placeholder" with [MASK] to reconstruct the input and ask the BERT model with MLM head to predict the word token at the [MASK]. Then we calculate the similarity between the word model predict and the options from the set of candidate answers. We set the option with the highest similarity score as the model's choice. Then we find that the BERT model without any fine-tuning gets 60+% accuracy in both Subtask 1 and Subtask 2. The result above shows that PLMs have the ability to predict abstract words, and those predicted words can be leveraged as negative candidates in the fine-tuning period.
Note that huge languages have quantities of parameters; the PLMs are able to store much knowledge through pre-training tasks. However, [MASK] is not used when fine-tuning the model for downstream tasks; how to use the knowledge stored by the model on pre-training tasks more explicitly on downstream tasks has become a hot topic of current research. Motivated by this, we try to bridge the gap between pre-train and downstream tasks. Inspired by the contrastive learning Robinson et al., 2020) as stronger negative samples will help the model learning with better performance, we introduce our negative augmentation with language model method. Specifically, we let the PLMs predict the "@placeholder" replaced with [MASK] token to generate negative candidates. Thus, we can leverage those negative words that may mislead the models to help train the models. Formally, we have: where P are the distribution of words model, predict, m i is the token in the vocabulary, and |V | is the total number of the vocabulary. We can use the distribution to get the top confusing words to augment our models, which is described in Figure  2. Due to the limitation of GPU, we add the most possible word to augment our models.

Label Smoothing
Label smoothing is a well-known "trick" to improve the model's performance effectively. It encourages the activations of the penultimate layer to be close to the template of the correct class and equally distant to the templates of the incorrect classes (Müller et al., 2019). With more options than the original dataset by the approach mentioned in Section 3.2, label smoothing will magnify our method's effect while fine-tuning the models. Suppose the output of the final layer and softmax layer as follows: where p k is the likelihood the model assigns to the k-th class, w k represents the weights and biases of the last layer. x is the vector containing the activations of the penultimate layer of a neural network concatenated with "1" to account for the bias. let us see the equitation about the cross entropy loss.
The cross-entropy formula without Label smoothing only focuses on whether the positive example is true and does not pay attention to the negative examples' relationship. We make the soft y as follows: We set ε as 0.1 in our models.

Task-Adaptive Pre-training
The BERT-like model is pre-trained in the general domain corpus such as Wikipedia. Since passages  mainly come from CNN daily, the data distribution may be quite different from pre-training data. Therefore, we utilize task-adaptive pre-train BERT with masked language model and next sentence prediction tasks on the domain-specific data. Taskadaptive pre-training not only makes the model better fit the distribution in the domain but also helps the model to predict good negative words to enhance the original dataset, which is described in Section 3.2. We take two different approaches for task-adaptive pre-training as follows: 1) In-domain pre-training, we use the source data: CNN Daily to task-adaptive pre-training our base models .
2) Within-task pre-training, practically we replace the "@placeholder" with the correct answer and put the same input format as the fine-tuning steps, which is [Q − A; P ] (Gururangan et al., 2020).

Pre-processing
For data pre-processing, we use the byte-level BPE encoding (Sennrich et al., 2016), and the official vocabulary contains more than fifty thousand byte-level tokens. All tokens are stored in MERGES.TXT, while VOCAB.JSON is a byte-toindex mapping. Generally speaking, the higher the frequency, the smaller the byte index. Since the average length of the passage about Subtask 1 and Subtask 2 is 262 and 418, we divide those long context paragraphs. We limit the max number of tokens in an input sample [Q−A; P ] to 256 for our system. Statically, 60% of the paragraphs exceeds the 256 tokens (including the special tokens like [CLS], [SEP] and so on. For these input samples, we divide them into new input samples with at most 256 tokens. To be more specific, we divide the passage to different inputs with the same question and answer.

Hyper-parameter Setting
Our system is implemented with PyTorch (Paszke et al., 2019) and we use the PyTorch version of the pre-trained language models 2 . We employ RoBERTa, ALBERT, and DeBERTa large models as our PLM encoder. We use AdamW optimizer (Loshchilov and Hutter, 2018) to fine-tune the models. We set the batch size to 1, and the max length of input to 256 for RoBERTa, 128 for ALBERT. Usually, the batch size has a significant influence on the BERT-like model; due to the limit of GPU memory, we use gradient accumulation in our training steps. We set the gradient accumulation step as 32, which means the formal number of batch sizes is 32 in training. We pick the best learning rate from the dev set, fine-tuning the RoBERTa, ALBERT, DeBERTa with the learning rate of 9×10 −6 , 1×10 −5 and 1×10 −5 respectively. We set the number of epoch to 8 for ALBERT and 12 for RoBERTa and DeBERTa. Furthermore, we save the best model on the validation set for testing during training. Because the formats of both Subtask 1 and Subtask 2 are the same, we set the same batch size and max length of the input sequence for training.

Subtask 1 Results
On Subtask 1 , the ReCAM-Imperceptibility task, the evaluation results are illustrated in Table 3. We set the three baseline models: RoBERTa Large , DeBERTa Large , and ALBERTxxLarge. RoBERTa Large + NAL, DeBERTa Large + NAL, and ALBERT Large + NAL denotes the language model with our proposed negative augmentation with language model. Ensemble refers to the ensemble model of the three models as mentioned above with all strategies. We find that ALBERT achieves better performance in Subtask 1 but fails to get good performance in Subtask 2, while DeBERTa and RoBERTa have better performance in Subtask 2.   Comparing with the original RoBERTa, DeBERTa, and ALBERT models, each model is hugely improved with NAL by about 2.1% accuracy. We further observe that DeBERTa and RoBERTa, which have the same architecture, obtain better performance than ALBERT in the dev and test sets. We think the possible reason is that ALBERT uses layer weight sharing, which reduces the model's generalization ability in reading comprehension, especially the abstract words meaning. Finally, the ensemble of the best model of RoBERTa, De-BERTa, and ALBERT lead to a significant improvement (4.3% accuracy) compared with baselines, which is also our final submission to the leaderboard.

Subtask 2 Results
On Subtask 2, the ReCAM-Nonspecificity task, the experiment results are showed in Table  4. Similar to the models in Subtask 1, we choose RoBERTa, DeBERTa and ALBERT as our baseline models. All RoBERTa Large + NAL , ALBERT xxLarge + NAL and DeBERTa Large + NAL are the models with negative augmentation with language model. Ensemble refers to the ensemble model of RoBERTa, DeBERTa, and ALBERT with all strategies. We notice that our proposed mechanism brings significant improvement (averaging 4.3% of the accuracy score) compared with baselines, demonstrating the effectiveness of our proposed strategies such as negative augmentation with a language model, label smoothing, and taskadaptive pre-training. We observe that ensemble approach of three enhanced models (RoBERTa Large + NAL, ALBERT xxLarge + NAL and DeBERTa Large + NAL) obtain the best accuracy of 92.8% at test set, which is also our final submit to the leaderboard.

Subtask3 Results
Subtask3 focuses on the model's transferability. During the evaluation period, we use the data on Subtask 2 to evaluate the models trained on the Subtask 1 and vice versa. We obtain the 82% accuracy of the model trained on Subtask 1 and evaluated on Subtask 2 on the dev set.
During experiments for all tasks, we have tried to use different decoders like MLP and other network architecture. Eventually, we find that it does not help to improve the system's performance. An explanation is that the pre-trained language models (PLMs) have already captured global contextual sentence meaning at the [CLS] token.

Analysis of Negative Augmentation with Language Model
During our experiments, we conduct case studies to figure out how our method of NAL helps the model to boost performance. From Table 5, we notice that the original PLM considers using the "all", "half" as its choice instead of "parts". Although fine-tuned on the downstream task, the baseline model still choose "half". In our NAL method, we add some misleading negative words to help models correct the knowledge learned from the pre-training task.

Analysis of Passage Length
In usual MRC tasks, the length of the passage is a key factor for the models to solve the problems. We conduct experiments to analyze the performance regarding different lengths of passage. Contrary to the common assumption, from Figure  3 and Figure 4, we observe that the instances with long passage obtain better performance. We think   Table. that abstract mean understanding may need comprehensive context information from the long sentence, and we will conduct further analysis in future works.

Case Study
We select four kinds of different types of error cases to promote further researches. We classify the examples according to the main causes (pretraining, fine-tuning, and so on) of the error. We think it will help us better understand what the model learns from pre-training and fine-tuning.
Case 1 -Influenced by the original pre-training task Because it is the right answer, so we choose another choice "played" as an augmented choice. ) • Right Option: (A) scored • Wrong Option: (E) beaten • Potential Causes: It is quite weird that the original PLMs can predict the right answer, but fail to make it after fine-tuning. We suppose that in the process of fine-tuning, the inconsistency of abstract vocabulary prediction and the interference of other vocabulary caused the model's effect in some cases to decrease instead. • How to help models? We could use our approach of NAL to increase the weight of the knowledge learned in the pre-training task or leverage external knowledge (Zhang et al., 2019(Zhang et al., , 2020bYu et al., 2020;Zhang et al., 2020a).
Case 3 -Obscure abstract word meaning • Passage: " ...Mr Habgood said: "We're pretty sure it will be popular because it was when East Street was closed for other reasons and we want to make it a friendlier place to be. "It does fit with our larger objectives to improve the town and make it safer for cyclists and pedestrians." ..." • Question: Three busy town center streets are to be pedestrianised in a bid to improve @placeholder for shoppers and cyclists . • Answer: (A) opportunities (B) services (C) quality (D) disruption (E) safety  cus of Subtask 2, the model may consider the "families" as the upper level of the "people" occur in the passage and choose the "(A) families" instead of the right answer "(B) properties". • How to help models? We try to use the proposed NAL to add more abstract words learned from the pre-training to mitigate this issue.

Conclusion
This paper presents our system design for the Se-mEval 2021 Task4. We propose a simple yet effective method called negative augmentation with language model. Comprehensive experiments demonstrate the effectiveness of our proposed approach.
We also conduct case studies and investigate why the model fails to obtain the correct prediction. Note that language models are pre-trained from the huge corpus; recently, researchers have iden-tified the bias in the language model, which may mislead the model prediction. Our proposed negative augmentation with language model can help the model better discriminate candidates in finetuning, thus boost the performance. From another perspective, as depicts in Section 3.2, the language model without any fine-tuning gets 60+% accuracy in both Subtask 1 and Subtask 2. This indicates that bias exists in the datasets (Part of the abstract meaning can be obtained from the language model). More strong benchmarks should be constructed in the future.