REPT: Bridging Language Models and Machine Reading Comprehension via Retrieval-Based Pre-training

Pre-trained Language Models (PLMs) have achieved great success on Machine Reading Comprehension (MRC) over the past few years. Although the general language representation learned from large-scale corpora does benefit MRC, the poor support in evidence extraction which requires reasoning across multiple sentences hinders PLMs from further advancing MRC. To bridge the gap between general PLMs and MRC, we present REPT, a REtrieval-based Pre-Training approach. In particular, we introduce two self-supervised tasks to strengthen evidence extraction during pre-training, which is further inherited by downstream MRC tasks through the consistent retrieval operation and model architecture. To evaluate our proposed method, we conduct extensive experiments on five MRC datasets that require collecting evidence from and reasoning across multiple sentences. Experimental results demonstrate the effectiveness of our pre-training approach. Moreover, further analysis shows that our approach is able to enhance the capacity of evidence extraction without explicit supervision.


Introduction
Machine Reading Comprehension (MRC) is an important task to evaluate the machine understanding of natural language. Given a set of documents and a question (with possible options), an MRC system is required to provide the correct answer by either retrieving a meaningful span (Rajpurkar et al., 2018a) or selecting the correct option from a few candidates (Lai et al., 2017;Guo et al., 2019Guo et al., , 2021. Recently, with the development of self-supervised learning, the pre-trained language * Work is done during internship at Alibaba Group. † Corresponding author: Liqiang Nie. 1 Our code and pre-trained models will be released at github.com/SparkJiao/retrieval-based-mrc-pretraining. models Yang et al., 2019b) fine-tuned on several machine reading comprehension benchmarks (Reddy et al., 2019;Kwiatkowski et al., 2019) have achieved superior performance. The dominant reason lies in the strong and general contextual representation learned from large-scale natural language corpora. Nevertheless, PLMs focus more on the general language representation and semantics to benefit various downstream tasks, while MRC demands the capability of extracting evidence across one or multiple documents and performing reasoning over the collected clues Yang et al., 2018). Put it differently, there exists an obvious gap, indicating an insufficient exploitation of PLMs over MRC.
Some efforts have been made to bridge the gap between PLMs and downstream tasks, which can be roughly divided into two categories: knowledge enhancement and task-oriented pre-training (Qiu et al., 2020). The former introduces commonsense or world knowledge into the pre-training (Zhang et al., 2019;Varkel and Globerson, 2020;Ye et al., 2020) or fine-tuning (Yang et al., 2019a) for better performance over knowledgedriven tasks. And the latter includes some delicately designed pre-training tasks, e.g., the contrastive approach of learning discourse knowledge towards textual entailment task (Iter et al., 2020). Although these approaches have achieved some improvements on certain tasks, few of them are specifically designed for evidence extraction, which is indeed indispensable to MRC.
In fact, equipping PLMs with the capability of evidence extraction in MRC is challenging due to the following two factors. 1) The process of collecting clues from a document is difficult to be integrated into PLMs without designing specific model architectures or pre-training tasks (Qiu et al., 2020;Zhao et al., 2020). And 2) large-scale pre-Figure 1: A running example obtained from our method. The query sentences are extracted from the original document with some crucial information being randomly masked, i.e., the sentence 1 and 2. The model is required to predict the preceding and following sentence for each query in the original document and recover the masked clues, i.e., infer the original order from input order and fill the [MASK] with the initial token. The phrases in boxes are the possible clues for recovering the masked tokens and the correct order.
training process would make PLMs overfit to pretraining tasks (Chung et al., 2021;Tamkin et al., 2020). In other words, it is difficult to take full advantage of the pre-training merits if the training objectives of pre-training and downstream MRC are greatly separated.
To deal with the aforementioned challenges, we propose a novel retrieval-based pre-training approach, REPT, to bridge the gap between PLMs and MRC. Firstly, to unify the training objective, we design a novel pre-training task, namely Surrounding Sentences Prediction (SSP), as illustrated in Figure 1. Given a document, several sentences will be firstly selected as queries, and the others are jointly treated as a passage 2 . Thereafter, for each query, the model should predict its preceding and following sentences in the original document by collecting clues from each sentence, which is compatible with evidence extraction in MRC tasks. It is worth emphasizing that, the repeated occurrence of entities or nouns across different sentences often lead to information short-cut , from which the order of sentences can be easily recovered. In view of this, we propose to mask such explicit clues. As a result, the model is enforced to infer the correct positions of queries by gathering evidence with the incomplete information. Secondly, to preserve the effectiveness of contextual representation, the masked clues are also required to be recovered through retrieving relevant information from other parts of the document, which is implemented via our Retrieval based Masked Language Modeling (RMLM) task.
In this way, the pre-training stage can be properly aligned with MRC: 1) the training objectives are connected through the introduction of the two pre-training tasks, which will be inherited by downstream MRC tasks through consistent retrieval operation. And 2) the capability of evidence extraction from documents or sentences is enhanced during pre-training, and will be smoothly transferred to MRC. Our contributions in this paper are summarized as follows: 1. We present REPT, a novel pre-training approach, to bridge the gap between PLMs and MRC through retrieval-based pre-training.

2.
We design two self-supervised pre-training tasks, i.e., SSP and RMLM, to augment PLMs with the ability of evidence extraction with the help of retrieval operation and eliminating information short-cut, which can be smoothly transferred to downstream MRC tasks.
3. We evaluate our method over five reading comprehension benchmarks of two different task forms: Multiple Choice QA (MCQA) and Span Extraction (SE). The substantial improvements over strong baselines demonstrate the effectiveness of our pre-training approach. We conduct an empirical study to verify that our method are able to enhance evidence extraction as expected.

Related Work
MRC has received increasing attention in recent years. Many challenging benchmarks have been established to examine various forms of reasoning abilities, e.g., multi-hop (Yang et al., 2018), discrete (Dua et al., 2019), and logic reasoning . To solve the problem, a typical design is to gather possible clues through entity linking (Zhao et al., 2020) or self-constructed graph Ran et al., 2019), and then perform multi-step reasoning. It is worth noting that, gathering clues is vital but challenging, especially for long document understanding. Some efforts have been dedicated to improving evidence extraction via direct (Wang et al., 2018) or distant supervision (Niu et al., 2020).
Generally, the fine-tuned PLMs Yang et al., 2019b) can obtain superior performance in MRC due to their strong and general language representation. However, there still exist some gaps between PLMs and various downstream tasks, since certain abilities required by the downstream tasks cannot be learned through the existing pre-training tasks (Qiu et al., 2020). In order to take full advantage of PLMs, a few studies attempt to align the pre-training and fine-tuning stages. For example, Tamborrino et al. (2020) reformulated the commonsense question answering task as scoring via leveraging the predicted probabilities for Masked Language Modeling (MLM) in RoBERTa . With the help of the commonsense learned through MLM, the method achieves comparable results with supervised approaches in zero-shot setting, indicating that bridging the gap between these two stages yields considerable improvement. Chung et al. (2021) tried to address the overfitting problem during pre-training through decoupling input and output embedding weights and enlarging the embedding size during decoding. The resultant model is therefore more transferable across tasks and languages. In addition, some task-oriented pre-training methods have also been developed. For instance,  proposed a novel pre-training method for sentence representation learning, where the masked tokens in a sentence are forced to be recovered from other sentences through sentencelevel attention. Based on this, the attention weights can be directly fine-tuned to rank the candidates in answer selection or information retrieval.  tried to learn the dense document representation for information retrieval by minimizing the distance between the representation of an query sentence and its context.  designed an augmented MLM tasks to jointly train a neural retriever and a language model for Open-domain QA. Different from these methods ranking the documents for open-domain QA, our approach focuses on enhancing the ability of evidence extraction in MRC, where the MLM based task by it alone is insufficient.

Method
In this section, we present the details of the proposed method, REPT. We firstly describe the data pre-processing part ( §3.1), and then illustrate the two pre-training tasks, i.e., SSP and RMLM ( §3.3) and the training objectives ( §3.4). Finally, we detail how to fine-tune our pre-trained model for downstream tasks through retrieval-based evidence extraction ( §3.5).

Data Pre-processing
For pre-training, we use the English Wikipedia 3 as our training data. We divide each Wikipedia article into segments, each containing up to 500 tokens 4 without overlapping. We treat each segment as a document and split it into several sentences 5 .
In order to increase the difficulty and efficiency of pre-training, for each document, we select 30% of the most important sentences as queries and the rest in their original order as a passage. Specifically, the importance of each sentence in a document is measured through the summation of the importance of entities and nouns it contains, which is further defined as the number of sentences an entity/noun occurs. Hereafter, masking is introduced to entities and nouns in queries according to pre-defined ratios to eliminate information short-cut. More details about the masking strategy are described in Appendix A and an example after pre-processing can be found in Figure 1.

Task Definition
We treat a document as a sequence of n sequential sentences with m tokens. Supposing that there are t sentences selected as queries following §3.1, the rearranged sequence is defined as S = [s 1 , s 2 , · · · , s t , · · · , s n ], and the index of queries is Q = {1, 2, · · · , t}. Besides, we define a mapping function r to map the rearranged sentences to their original position. Taking Figure 1 as an example, the mapping r(s 1 ) = 1, r(s 2 ) = 4, r(s 3 ) = 2 and (s 4 ) = 3 indicates that the original order is {s 1 , s 3 , s 4 , s 2 , · · · }.  The attention-based sentence-level retrieval for evidence extraction for each sentence, which will be further adopted by SSP during pre-training and MCQA during fine-tuning. c) The attention-based document-level retrieval for evidence extraction among the input sequence, which is employed for RMLM. For SE, the similarity function is directly fine-tuned.
Taking S as input, the Surrounding Sentences Prediction task should predict the correct sentence index a and b for each query s q with q ∈ Q 6 : r(s a ) = r(s q ) − 1, r(s b ) = r(s q ) + 1. (1) As for the Retrieval based Masked Language Modeling (RMLM) task, the model should recover all the masked tokens in each query s q .

Model
First of all, we leverage a pre-trained Transformer (Vaswani et al., 2017), such as BERT, as our encoder to obtain the contextual representation of sentences. The output of Transformer is formulated as: (2) where H ∈ R d×(m+3) , and d is the hidden size. For a better illustration, we will use H i to represent the hidden state matrix of tokens that belong to sentence s i , such that: Specifically, for r(s q ) = 1 or r(s q ) = n, the corresponding prediction task is removed since its preceding or following sentence does not exist.
where l i is the length of sentence s i and m = i l i . Since the process for each query is exactly the same, we use q ∈ Q as a representative to introduce the calculation with respect to each query below.

Query Representation
In order to gather potential clues from a document or sentences, we adopt the multi-head attention mechanism proposed by (Vaswani et al., 2017) to obtain the sentence-level representation for each query. Formally, the attention mechanism is defined as MHA(Q, K, V), where Q, K, V are query, key and value matrices, respectively. To consider the global information, we leverage h cls as the query vector, and H q as K and V: During pre-training, we reuse the layer defined by Equation 3 with Q = v q 0 and K = V = H q , to generate the task-specific query representation v q , which is designed to alleviate the overfitting problem (He et al., 2021).

Surrounding Sentence Prediction
To enhance the capability of pre-trained models for evidence extraction, we have carefully designed the SSP task, where the model should predict the preceding and following sentences for a given query by extracting the relevant evidence from each sentence. Consequently, we introduce a retrieval operation, which is implemented via a single-head attention mechanism 7 : where u i q is the representation of sentence s i , highlighting the evidence information pertaining to query s q . Finally, the score of each sentence in the document with regard to s q is obtained through:

Retrieval based MLM
Since the masking noise introduced when constructing queries could also bring inconsistency between pre-training and fine-tuning, we further designed a retrieval based MLM task to alleviate this problem.
In the RMLM task, the model should predict the masked entities or nouns through retrieving relevant information from a document. More specifically, the query-aware evidence representation of the input sequence is obtained via: Denoting the index of a masked token in query s q as z, the representation of the masked token s q z used for recovering is: where the function f (·, ·) is implemented as a normalized 2-layer feed-forward network, and the details are illustrated in Appendix B.2.

Optimization
As the definition in Equation 1, given a and b as the index of the original preceding and following sentences of the query s q in S, the corresponding probabilities for surrounding sentences are formulated as: . 7 The details are illustrated in Appendix B.1.
The objective of SSP is subsequently defined as: log p ssp (b|q, S))).
As for RMLM, supposing the index set of masked tokens in query s q is Z q , and the set of corresponding original tokens is X q , the probability for recovering a masked token is: where z ∈ Z q , x z ∈ X q , x is a token in vocabulary, and e(x) denotes the word embedding of x.
Then the objective of RMLM is: ).
(11) During pre-training, the model tries to optimize the two objectives jointly:

Fine-tuning
During fine-tuning, the input contains a query sentence and a passage. For multiple choice QA tasks, we concatenate a question with an option to form a question-option pair and use it as a whole query.
In this section, we use q = 0 to represent the index of the query and the sentences of passage are kept in their original order. The input sequence can be thus denoted as: S = [s q , s 1 , s 2 , · · · , s n ].
To inherit the evidence extraction ability augmented during pre-training, we incorporate the same retrieval operation into fine-tuning to collect clues from the passage. Firstly, we reuse the attention mechanism defined in Equation 3 to obtain the query representation v q . As for the evidence extraction process, we formulate it differently for Multiple Choice QA and Span Extraction.

Multiple Choice QA
Similar to Equation 4, we adopt an attention mechanism, whereby the query-aware sentence representation u i q is obtained via gathering evidence from each sentence: And the final passage representation highlighting the evidence can be obtained via the sentence-level evidence extraction: where U = [u 1 q , · · · , u n q ] and U ∈ R d×n . Finally, we represent the probability of each option c using both the query v q and the passage v p : Specifically, for Multi-RC, since the number of correct answer options for each question is uncertain, the task is often treated as a binary classification problem for each option. As a result, we adopt a MLP to get the probability of whether an option c is correct: where σ is the sigmoid function.

Span Extraction
Since answer spans are often consistent with corresponding evidences, we directly leverage the query to extract relevant spans. The probability of selecting start position s and end position e of an answer span is given by: is not pre-specified and the correct answer(s) is not required to be a span in the text. Moreover, the dataset provides annotated evidence sentence.
ReClor  is extracted from logical reasoning questions of standardized graduate admission examinations. Existing studies show that the state-of-the-art models perform poorly on Re-Clor, indicating the deficiency of logical reasoning ability of current PLMs.

Span Extraction
Hotpot QA (Yang et al., 2018) is a question answering dataset involving natural and multi-hop questions. The challenge contains two settings, the distractor setting and the full-wiki setting. In this paper, we focused on the full-wiki setting, where the system should retrieve the relevant paragraphs from Wikipedia and then predict the answer. SQuAD2.0 (Rajpurkar et al., 2018b) is reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

Implementation Detail
We leave the details about the implementation and pre-training corpora in Appendix A due to the limitation of space.

Baseline
Since our method is used for further pretraining, we mainly compared our model with BERT/RoBERTa and their variants. For Hotpot QA, we integrated our models into an open-sourced and well-accepted system (Asai et al., 2020) and evaluated the performance. The details of baselines are summarized as follows:

Multiple Choice QA
BERT is the BERT-base model with 2-layer MLP as the task-specific module.
BERT-Q & RoBERTa-Q refer to the designed but not further trained models, which include an extra multi-head attention for generating query representation via Equation 3, and our retrieval operation for evidence extraction as in §3.5.1 and §3.5.2.
BERT-Q w. R/S & RoBERTa-Q w. R/S refer to the designed models further trained with our proposed SSP and RMLM tasks (denoted as S and R, respectively).

Hotpot QA
For hotpot QA, we constructed the system based on Graph-based Recurrent Retriever (Asai et al., 2020), which includes a retriever and a reader based on BERT. We simply replaced the reader with our models and evaluated their performance in comparison with several published strong baselines on the leaderboard 8 . Table 1 shows the results of the baselines and our method on multiple choice question answering. From Table 1, we can observe that: 1) Compared with BERT-Q and BERT, our method significantly improves the performance over all the datasets, which validates the effectiveness of our proposed pre-training method. 2) As for the model structure, BERT-Q obtains similar or worse results compared with BERT, which suggests that the retrieval operation can hardly improve the performance without specialised pre-training. 3) Taking the rows of BERT, BERT-Q, BERT w. M, BERT-Q w. M for 8 https://hotpotqa.github.io/. comparison, the models with further pre-training using MLM achieve similar or slightly higher performance. The results show that further training BERT using MLM and the same corpus can only achieve very limited improvements. 4) Regarding the two pre-training tasks, BERT-Q w. R/S leads to similar performance on the development sets compared with BERT-Q w. S, but a much higher accuracy on the test sets, which suggests RMLM can help to maintain the effectiveness of contextual language representation. However, there is a significant degradation over all datasets for BERT-Q w. R. The main reason is possibly because the model cannot tolerate the sentence shuffling noise, which may lead to the discrepancy between pretraining and MRC, and thus need to be alleviated through SSP. And 5) considering the experiments over RoBERTa-based models, RoBERTa-Q w. R/S outperforms RoBERTa-Q and RoBERTa-base with considerable improvements over Multi-RC and the test set of DREAM, which also indicates that our method can benefit stronger PLMs.

Performance on Span Extraction QA
The results of span extraction on Hotpot QA are shown in Table 2. We constructed the system using the Graph Recurrent Retriever (GRR) proposed by Asai et al. (2020) and different readers. As shown in the table, GRR + BERT-Q w. R/S outpeforms GRR + BERT-base by more than 2.5% absolute points on both EM and F1. And GRR + RoBERTa-Q w. R/S also achieves a significant improvement   Asai et al. (2020), GRR + BERTbase means the system whose retriever is GRR and reader is built on BERT-base. *: The results are reported by Asai et al. (2020  over GRR + RoBERTa-base. During the test stage, our best system, GRR + RoBERTa-Q w. R/S performs better than the strong baselines and get closer to GRR + BERT-wwm-large. The above results strongly demonstrate the effectiveness of our pretraining method on the task requiring multi-hop evidence extraction and reasoning.
Besides, we also conducted experiments on the most common benchmark, SQuAD2.0. The results on development set shown in Table 3 have also verified the effectiveness of our proposed pre-training method.

Evaluation of Evidence Extraction
To evaluate the performance of our method for evidence extraction in the setting of implicit supervision (with only answers), we ranked sentences in a passage using their attention weights obtained in Equation 4 and chose those sentences with higher weights as the evidences.
As shown in Table 4, the models with our proposed pre-training tasks obtain considerable improvements on the precision and recall of evidence extraction, which verifies that our pre-training method is able to effectively equip PLMs with the capability for gathering evidence without explicit supervision. For a better illustration, we further provided two examples in Appendix C.

Effect of Different Masking Ratio During
Pre-training Table 5 shows the results of our model pre-trained with different masking ratios. Due to the small amount of entities contained in the document, we only consisdered the masking ratio of nouns as the variable. Formally, we considered three ratios: 30%, 60%, 90%, and an extra setting, where the entities and nouns are all kept and the RMLM task is also removed during pre-training. As shown in the table, with more possible clues being masked, the model tend to obtain better results on the downstream tasks. For example, BERT-Q w. R/S (90%) achieves the best accuracy on RACE, and BERT-Q w. R/S (60%) obtains the highest performance over Multi-RC. And all models that employ masking outperform BERT-Q w. S (no masking). The main reason can be that with more explicit information short-cut being eliminated, it is more difficult for models to collect potential clues, and PLMs are enhanced with stronger reasoning ability of evidence extraction. However, there also exists a trade-off: as higher masking ratio leads to more noise, it could worsen the mismatch between pre-training and fine-tuning, and cause performance degradation, e.g., BERT-Q w. R/S (90%) performs the worst on Multi-RC.  generated using different random seeds and the corresponding accuracies are plotted on the figure. It is observed that with 70% training data, our model outperforms the baseline, BERT-Q, which was initialized using BERT and has not been further pretrained. The results indicate that our method can help to reduce the amount of annotated training data for downstream MRC tasks, which is especially useful in low resource scenarios.

Conclusion and Future Work
In this paper, we present a novel pre-training approach, REPT, to bridge the gap between pretrained language models and machine reading comprehension through retrieval-based pre-training. Specifically, we design two retrieval-based pretraining tasks equipped with self-supervised learning, namely Surrounding Sentences Prediction (SSP) and Retreval based Masked Language Modeling (RMLM), to enhance PLMs with the capability of evidence extraction for MRC. The experiments over five different datasets validate the effectiveness of our proposed method. In the future, we plan to extend the proposed pre-training approach to the more challenging open-domain settings.

A Implementation Detail
We built our model on Huggingface's Pytorch transformer repository (Wolf et al., 2019), and used AdamW (Loshchilov and Hutter, 2019) as the optimizer. We used the pre-trained BERT-base-uncased and RoBERTa-base checkpoint to initialize our encoder, and performed pre-training using 16 P100 GPUs simultaneously. The pre-training processes last around 16 hours for BERT and 4 days for RoBERTa, which takes 20,000 steps and 80,000 steps with the batch size as 512, respectively. All hyper-parameters can be found in Table 6 for pretraining and Table 7 for fine-tuning.
During constructing the training sample for pretraining, we controlled the masking ratio for entity and noun in query. For BERT, we masked 90% entities and 30% nouns. For RoBERTa, we constructed two datasets, where the masking ratios for entity and noun are set to 90%, 30% and 90%, 90%, respectively. And we mixed the two for jointly training. We also explored the effect of different masking ratios and the analysis is detailed in §5.
As for the fine-tuning stage, for multiple choice QA, we ran all experiments using for different random seeds (i.e., 33, 42, 57 and 67) and reported the average performance, except for ReClor, in which we only submitted the results obtained from the model which performs the best on development set to the leaderboard because the limitation of submission times. For Hotpot QA, we mainly followed the hyper-parameters of Asai et al. (2020) and thus did not repeat the experiments using different random seeds. Due to the submission limitation, we only submitted our best model on the development set to the leaderboard and reported its performance on test set.

B.1 Single-head Attention
To reduce the extra parameters introduced, we define a single-head attention mechanism compared to the multi-head one. Given the query matrix Q, key matrix K and value matrix V, the simple attention mechanism is formualted as: where W and b is the learnable parameters.

C Case Study About Evidence Extraction
In §5.3, the results show that our pre-training method can augment the ability to extract the correct evidence. To give an intuitive clarification over this, we select two cases shown in Figure 4. As we can see, BERT-Q w. R/S and RoBERTa-Q w. R/S can select the correct evidence sentences, while the baselines models attend to the wrong sentences. Besides, Figure 5 shows the attention maps of the two groups of comparison. It can be observed that our pre-training approach can help the model learn a uniform attention distribution over the possible evidence sentences.

D Analysis of Extra Parameters Introduced
For fair comparison, we try to introduce as few additional parameters as possible. Since the output layer is highly task-specific and the single headattention defined in Appendix B.1 is simple, we main analyze the extra parameters introduced for query representation learning defined in §3.3.1. A single layer of Transformer comprises of a multihead attention module and a feed-forward network. As a result, the multi-head attention module generating the query representation has introduced 2.8% extra parameters compared with a 12-layer Transformer without consideration to the parameters in embedding layer and layer normalization.   Table 7: Hyper-parameters for fine-tuning. ♣: Hyper-parameters for BERT-based models. ♠: Hyper-parameters for RoBERTa-based models.