To Answer or Not to Answer? Improving Machine Reading Comprehension Model with Span-based Contrastive Learning

Machine Reading Comprehension with Unanswerable Questions is a difficult NLP task, challenged by the questions which can not be answered from passages. It is observed that subtle literal changes often make an answerable question unanswerable, however, most MRC models fail to recognize such changes. To address this problem, in this paper, we propose a span-based method of Contrastive Learning (spanCL) which explicitly contrast answerable questions with their answerable and unanswerable counterparts at the answer span level. With spanCL, MRC models are forced to perceive crucial semantic changes from slight literal differences. Experiments on SQuAD 2.0 dataset show that spanCL can improve baselines significantly, yielding 0.86-2.14 absolute EM improvements. Additional experiments also show that spanCL is an effective way to utilize generated questions.


Introduction
Machine Reading Comprehension (MRC) is an important task in Natural Language Understanding (NLU), aiming to answer specific questions through scanning a given passage (Hermann et al., 2015;Cui et al., 2016;Rajpurkar et al., 2018). As a fundamental NLU task, MRC also plays an essential role in many applications such as question answering and dialogue tasks (Chen et al., 2017;Gupta et al., 2020;Reddy et al., 2019). With the rapid development of pre-trained language models (PLMs), there is also a paradigm shift (Schick and Schütze, 2020;Dai et al., 2020;Sun et al., 2021) reformulating other NLP tasks (e.g. information extraction) into MRC format, especially for open-domain scenarios Yan et al., 2021a).
In most of the application scenarios, there exists a hypothesis that only answerable questions can be asked, which is somehow unrealistic and unreasonable. Thus, the model that is capable of distinguish- ing unanswerable questions is more welcomed than the model that can only give plausible answers (Rajpurkar et al., 2018). However, the challenge, that a slight literal change may transfer answerable questions into unanswerable ones, makes MRC models hard to gain such capability (Rajpurkar et al., 2018). For example, in Figure 1, the original answerable question becomes unanswerable by only replacing Twilight with Australian, but the small literal modification towards paraphrasing does not change the answer. Recent MRC models which predict answers using context-learning techniques and type-matching heuristics are not easy to perceive such subtle but crucial literal changes (Weissenborn et al., 2017;Jia and Liang, 2017). If different questions share many words in common, these models are most likely to give them the same answer, i.e., 2005 may be answered for all the three questions in Figure 1.
To address the aforementioned challenge, we propose a span-based method of Contrastive Learning (spanCL) in this paper. By explicitly contrasting answerable questions with their paraphrases and their distortions, MRC models are forced to recognize the subtle but crucial literal changes. Using pre-trained language model (PLM) as encoder, most contrastive learning methods adopt [CLS] as the sentence representation (Luo et al., 2020;Gao et al., 2021;Yan et al., 2021b;Wang et al., 2021). However, in this problem, as the differences between contrastive questions are very subtle, [CLS] is inadequate to capture such small changes. To solve the challenge, we propose a novel learning method, which incorporates the comparative knowledge between answerable and unanswerable questions, and exploits the semantic information of answer spans to improve the sentence representation. Overall, our contributions are summarized as two folds: • To improve MRC model's capability of distinguishing unanswerable questions, we propose a simple yet effective method called spanCL, which teaches the model to recognize crucial semantic changes from slight literal differences.
• Comprehensive experiments show that spanCL can yield substantial performance improvements of baselines. We also show that spanCL is an effective way to utilize generated questions.
Knowing what you do not know is a crucial aspect of model intelligence (Rajpurkar et al., 2018). In the field of MRC, a model should abstain from answering when no answer is available to the question. To deal with unanswerable questions, previous researchers mostly focused on designing a powerful answer verification module (Clark and Gardner, 2017;Liu et al., 2018;Kundu and Ng, 2018;Hu et al., 2019). Recently, a double checking strategy is proposed, in which an extra verifier is adopted to rectify the predicted answer (Hu et al., 2019;Back et al., 2019;Zhang et al., 2020a,b). Besides the idea of designing verification modules, some other studies try to solve the problem through data augmentation, namely to synthesize more QA pairs (Yang et al., 2019b;Alberti et al., 2019;Zhu et al., 2019b;. Contrastive Learning. To obtain rich representations of texts for down-stream NLP tasks, there have been numerous investigations of using contrastive objectives on strengthening supervised learning (Khosla et al., 2020;Gunel et al., 2020) and unsupervised learning Gao et al., 2021) in various domains (He et al., 2020;Lin et al., 2020;Iter et al., 2020;Kipf et al., 2019). The main idea of contrastive learning (CL) is to learn textual representations by contrasting positive and negative examples, through concentrating the positives and alienating the negatives. In NLP tasks, CL is usually devoted to learning rich sentence representations (Luo et al., 2020;, and the main difference between these methods is the approach to find positive and negative examples. Wang et al. (2021) argued that using hard negative examples in CL is helpful to improve the semantic robustness and sensitivity of pre-trained language models. Enlightened by the promising effects of CL, Kant et al. (2021) proposed to use CL in visual question answering. He focused on playing CL on MRC by comparing multiple answer candidates, but neglected the fact that not all questions can be answered through a given paragraph.

Approach
In this section, we first introduce the task of Machine Reading Comprehension with Unanswerable Questions (MRC-U). Then, a baseline MRC model based on PLM is described. At last, we propose a span-based contrastive learning method for MRC-U, named as spanCL. In this paper, question paraphrase and positive question, question distortion and negative question are used interchangeably.

Question Passage
Positive Question Original Question

Passage
Negative Question

PLM PLM
p u ll c lo s e p u s h a p a rt answer span rep a n s w e r s p a n re p a n s w e r s p a n r e p spanCL loss (a) Base Model (b) Span-based Contrastive Learning

Task Description
In this paper, we focus on studying extractive MRC, in which the expected answer of a question is a word span of a given passage. Thus, given a textual question Q and a textual passage P , our goal is to find the answer span (y s , y e ) to Q in P , where y s is the answer start position in P and y e is the answer end position in P .

Basic MRC Model
We use the model same as Devlin et al. (2018) as the basic model for MRC-U task. When a question and a passage are input, if the question is answerable, the model is expected to give a legal answer span (y s , y e ) in the passage; if the question is unanswerable, the model is expected to output the [CLS] span (0, 0), which indicates no related answer can be found in the passage. The overall structure of the network is presented in Figure 2. For illustration, we denote the output of PLM's last layer as the sequence representation, H ∈ R l×d , where l is the sequence length and d is the dimension. Accordingly, the hidden representation of the i-th token in the sequence is denoted as h i ∈ H. To find the start position of an answer, a start weight vector w s ∈ R d is introduced to calculate the beginning possibility of each position. Formally, the probability that the answer starts at the i-th token is defined as Similarly, with a end weight vector w e ∈ R d , the probability that the answer ends at the i-th token is defined as For learning, the cross-entropy loss on identifying the answer start and end positions is taken as the training objective as where y s and y e are the start and end positions of the true answer span. With the learnt model, the output answer span (y s , y e ) is predicted according to (y s , y e ) = arg max (i,j)|i≤j h i · w s + h j · w e . (4)

Span-based Contrastive Learning
In this section, spanCL is introduced from two aspects. First, considering the contrastive idea of CL, we give the details about how the positive and negative examples are generated. Second, the training objective of spanCL is presented. Positive Examples.
In our method, we define the positive examples as the questions which have slight literal differences but the same answers with their original questions. Back Translation is an effective data augmentation method (Xie et al., 2019;Zhang et al., 2017;Zhu et al., 2019a), in which a text is first translated to a target language (e.g. France) from its source language (e.g. English), and then back translated to the source language. The final back-translated text is taken as the example of augmentation. • Negation. A negation word is inserted or removed from the original question.
• Antonym. First, spaCy 1 is utilized to conduct segmentation and POS for the original question. Then, one of the words (verbs, nouns, adjectives, or adverbs) are randomly replaced with its antonym.
• Entity Replacement. With an answerable question, one of its entity words is randomly placed with another entity word, which has the same entity type but does not appear in any questions. Table 1 shows several negative examples derived by these strategies. Note that question generation is not the main topic of this paper. Span-based Contrastive Learning. Using PLM as the encoder, [CLS] usually serve as the sentence representation in CL (Gao et al., 2021;Wang et al., 2021;Yan et al., 2021b). When the difference between the original question and its paraphrase or distortion is very subtle, a single [CLS] token is not adequate to capture the difference, making the model hard to answer such question. Therefore, we propose to improve MRC 1 https://github.com/explosion/spaCy models by contrasting these questions according to their answer-span representations. Specifically, given a question Q org and its answer span (y s , y e ), through the augmentation methods mentioned previously, we synthesize one positive question is used as the answer-span representation to Q org and denoted as z Qorg . Similarly, the answer-span representation to Q pos and Q neg are denoted as z Qpos and z Qneg respectively. Then, our span-based contrastive loss is calculated as where Φ(u, v) = u v/ u v computes similarity between u and v and τ > 0 is a scalar temperature parameter. With the definition, the final objective loss of our method is presented as the following: 4 Experiments

Datasets and Metrics
We evaluate our method on the well-known dataset SQuAD 2.0 (Rajpurkar et al., 2018), which covers the questions of SQuAD1.1 (Rajpurkar et al., 2016) with new unanswerable questions written adversarially by crowdworkers to imitate the answerable ones. Moreover, for each unanswerable question, a plausible answer span is annotated, which indicates the incorrect answer obtained by type-matching heuristics. The training dataset contains 87k answerable and 43k unanswerable questions, and half of the examples in the development set are unanswerable. Two official metrics are used to evaluate the model performance on SQuAD 2.0: Exact Match (EM) and F1. EM is used to compute the percentage of predictions that match ground truth answers exactly. F1 is a softer metric, which measures the average overlap between the prediction and ground truth answer at token level.

Experimental Setup
MRC Model. We adopt the model introduced in 3.2 with various PLM encoders for the MRC-U task. Bert (Devlin et al., 2018), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2019) are selected in our experiments. We download the pretrained weights from Hugging Face 2 . Training Data Construction. For each original answerable question, we use Back Translation to generate its paraphrase. In SQuAD 2.0, we can find the negative questions for 18,541 answerable questions in the original dataset. For the rest 68,280 answerable questions, we use our augmentation strategies to generate negative questions.
During our training, the span loss is calculated based on Q org and Q neg . In section 4.3, we will explain why Q pos is discarded for calculating span loss. Hyper-parameters. We use the default hyperparameter settings for the SQuAD 2.0 task. Specifically, we set maximum sequence length, doc stride, maximum query length and maximum answer length to 512, 128, 64 and 30. For fine-tuning our model, we set the learning rate, batch size, training epoch and warm-up rate to 2e-5, 12, 2 and 0.1. The temperature in spanCL is set to 0.05. The weights of span loss and spanCL loss are λ 1 = λ 2 = 0.5. For each time, we fix the random seed, ensuring our results are reproducible. We run our experiments on two Tesla A100 40G GPUs with 5 GPU hours to train a model.

Main Results
From   0.86~2.14 absolute EM improvement and 0.76~2.0 absolute F1 improvement, demonstrating spanCL is model-agnostic and effective. As additional training data (i.e. the extra positive and negative questions) is used, it is necessary to analyze if the improvements are merely brought by this additional data. We conduct experiments by training with different datasets and display the results in Table 3. BERT base means training BERT base with original SQuAD 2.0 training set. "+pos" and "+neg" mean expanding the original training set with generated positive questions and generated negative questions respectively. Surprisingly, Simply expanding the training set can not guarantee the performance improvement. We find that adding positive examples into the training set does not improve the performance of MRC model. One possible reason is that the positive questions make the model over insensitive and ignore slight literal changes, which is inappropriate for MRC-U task. By comparing BERT base with "+neg", we find that training with more negative examples, the model tends to predict more NoAns and achieve a high performance on NoAns, while the performance on the HasAns drops a lot and the overall improvement of EM is much less than "+spanCL". From the results in Table 3, we can conclude that spanCL is effective to utilize the gen-

Influence of Negative Examples
The unanswerable questions generated by our strategies are rather plain. We believe spanCL can further boost the performance by high-quality unanswerable questions.  proposed a context-relevant generation method called CRQDA, which generates delicate negative questions 3 . In table 4, "+CRQDA" denotes training the baseline model with the dataset including the delicate negative questions generated by CRQDA. "+spanCL with simple negatives" denotes applying spanCL with negative questions generated by our three strategies. "+spanCL with CRQDA" denotes applying spanCL with negative questions generated by CRQDA. Comparing "+spanCL with simple negatives" with "+spanCL with CRQDA", we find that spanCL can further boost the performance by delicate negative questions.

Influence of Temperature
The temperature τ in spanCL loss (Equation 5) is used to control the smoothness of the distribution normalized by the softmax operation. A large temperature smoothes the distribution while a small temperature sharpens the distribution. As shown 3 https://github.com/dayihengliu/CRQDA   in the Figure 3, spanCL is sensitive to the temperature value. In general, small temperature results in better performance. A practical temperature can be obtained within a small range (from about 0.02 to 0.1). We select 0.05 as the temperature in our experiments.

Selection of Question Representations
In this paper, we argue that the answer span representation is better than [CLS]. We conduct experiments with different question representations in this section. When applying CL with [CLS] representations, we add a classification layer on the top of [CLS] to determine if a question is answerable or not (Zhang et al., 2020b), making the representation of [CLS] acquire the information of the question's answerability. We also play CL with both [CLS] and answer-span representations, in which two CL losses are calculated together. From Table 5, we can see that CL with [CLS] reps improves the model performance but the improvement is small than that from spanCL, and the combination of the two CL losses can confuse the model and result in a little improvement.

Comparison between Different Training Schemes
There are three training schemes to combine the span loss and spanCL loss: 1) joint training, in which these two losses are used together in each training step; 2) alternate training, in which the model is updated with spanCL loss after every M updates with span loss; 3) pre-train and fine-tune, in which we first pre-train the model with spanCL loss and then fine-tune it with span loss. For al- ternate training, we select M from {1, 2, 3} and find M = 2 gives the best results. From Table  6, we conclude that joint training gives the best performance and alternate training performs a little worse. Surprisingly, with the pre-train and finetune training scheme, the model performs worse than the baseline model. We guess this is because without the supervision of answer-span knowledge, it is hard to learn useful question representations.

Qualitative Analysis
We qualitatively analyze two representative unanswerable questions in Figure 4. It can be seen that the baseline model predicts a plausible answer for each question while the baseline model trained with spanCL abstain from answering. To correctly answer the first question, the model is asked to learn the question's semantics in sentence level. To correctly answer the second question, the model is asked to recognize the literal change in word level. SpanCL can help the model perceive such crucial differences between the question and passage from both semantic and lexical aspects, and thus enable the baseline model to abstain from answering for these two questions.

Conclusion
In this paper, we propose a span-based method of Contrastive Learning (spanCL) to solve the MRC task with Unanswerable Questions. SpanCL is devised based on the fact that an answerable question can become unanswerable with slight literal changes. By explicitly contrasting an answerable question with its paraphrase and distortion at the answer span level, MRC models can be taught to perceive subtle but crucial literal changes. Experimental results demonstrate that spanCL is modelagnostic and can improve MRC models significantly. Additional experiments show that spanCL is more effective to utilize the generated questions than other methods. In addition, it should be noticed that how to generate high-quality question examples is not fully investigated in this paper, which may introduce a performance bottleneck to spanCL. Therefore, a study on question generation compatible with spanCL is encouraged in the future.