DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications

Machine reading comprehension (MRC) is a crucial task in natural language processing and has achieved remarkable advancements. However, most of the neural MRC models are still far from robust and fail to generalize well in real-world applications. In order to comprehensively verify the robustness and generalization of MRC models, we introduce a real-world Chinese dataset – DuReader_robust . It is designed to evaluate the MRC models from three aspects: over-sensitivity, over-stability and generalization. Comparing to previous work, the instances in DuReader_robust are natural texts, rather than the altered unnatural texts. It presents the challenges when applying MRC models to real-world applications. The experimental results show that MRC models do not perform well on the challenge test set. Moreover, we analyze the behavior of existing models on the challenge test set, which may provide suggestions for future model development. The dataset and codes are publicly available at https://github.com/baidu/DuReader.


Introduction
Machine reading comprehension (MRC) requires machines to comprehend text and answer questions about it. With the development of deep learning, the recent studies of MRC have achieved remarkable advancements (Seo et al., 2017;Wang and Jiang, 2017;Devlin et al., 2019;Liu et al., 2019;Lan et al., 2020). However, previous studies show that most of the neural models are not robust enough (Jia and Liang, 2017;Ribeiro et al., 2018b;Talmor and Berant, 2019a;Welbl et al., 2020) and fail to generalize well (Talmor and Berant, 2019b). * This work was done while the first author was doing internship at Baidu Inc.
† Corresponding authors To further promote the studies of robust and well generalized MRC, we construct a Chinese dataset -DuReader robust which comprises natural questions and documents. In this paper, we focus on evaluating the robustness and generalization from the following aspects, where robustness consists of over-sensitivity and over-stability: (1) Over-sensitivity denotes that MRC models provide different answers to the paraphrased questions. It means that the models are overly sensitive to the difference between the original question and its paraphrased question. We provide an example in Table 1a.
(2) Over-stability means that the models might fail into a trap span that has many words in common with the question, and extract an incorrect answer from the trap span. Because the models overly rely on spurious lexical patterns without language understanding. We provide an example in Table 1b. (3) Generalization. The well-generalized MRC models have good performance on both in-domain and out-of-domain data. Otherwise, they are less generalized. We provide an example in Table 1c.
In previous work, the above issues have been studied separately. In this paper, we aim to create a dataset namely DuReader robust to comprehensively evaluate the three issues of neural MRC models. Previous work mainly studies these issues by altering the questions or the documents. Ribeiro et al. (2018b); ; Gan and Ng (2019) evaluate the over-sensitivity issue via paraphrase questions generated by rules or generative models. Jia and Liang (2017)

Passage
Passage 包粽子的线以前人们认为是来自麻叶子，其实是棕榈 树，粽子的音就来自棕叶子。 Many people argue that the zongzi (rice dumpling) leaves are made of hemp. Actually, it is the palm tree, the real origin, that endows zongzi with the special pronunciation. Question Question 包粽子的线来自什么 What is the raw material of zongzi leaves? Golden Answer : 棕榈树 Golden Answer : palm tree Predicted Answer : 麻叶子 (BERTbase) predicted Answer : hemp (BERTbase) (b) An example illustrates the over-stability issue. The underlined span in the passage appears as a trap because it has many words in common with the question. BERTbase falls into the trap.
What is the derivative of cos2x? Golden Answer : -2sin(2x) Golden Answer : -2sin(2x) Predicted Answer : -sin(2x) (BERTbase) Predicted Answer : -sin(2x) (BERTbase) (c) An example illustrates the generalization issue. Although BERTbase is sufficiently trained on large-scale open-domain data, it fails to predict the answer to a math question. on such unnatural texts can help the improvements of the neural models in real-world applications. By contrast, all the instances in DuReader robust are natural texts and collected from the Baidu search.
We conduct extensive experiments based on DuReader robust . The experimental results show that the models based on pre-trained language models (LMs) (Devlin et al., 2019;Sun et al., 2019;Liu et al., 2019) do not perform well on the challenge set. Besides, we have the following findings on the behaviors of the models: (1) if a paraphrased question contains more words rephrased from the original question, it is more likely that MRC models provide different answers; (2) the trap spans which share more words with the questions easily mislead MRC models; (3) domain knowledge is a key factor that affects the generalization ability of MRC models.

Dataset: DuReader robust
DuReader robust is built on DuReader, a largescale Chinese MRC dataset (He et al., 2018). In DuReader, all questions are issued by real users of  Baidu search, and the document-level contexts are collected from search results. In DuReader robust , we select entity questions and paragraph-level contexts from DuReader. We further employ crowdworkers to annotate the answer span conditioned on the question and the paragraph-level context 1 . Additionally, we used a mechanism to ensure data quality, where 10% of the annotated data will be randomly selected and reviewed by linguistic experts. If the accuracy is lower than 95%, the crowdworkers need to revise all the answers until the accuracy for the randomly selected data is higher than 95%.  Eventually, we collect about 21K instances for DuReader robust , each of which is a tuple q, p, A , where q is a question, p is a paragraph-level context containing reference answers A. Similar to the existing MRC datasets, DuReader robust consists of training set, in-domain development set and in-domain test set, whose sizes are 15K, 1.4K and 1.3K respectively. Besides, DuReader robust contains a challenge test set, in which 3.5K instances are created to evaluate the robustness and generalization of MRC models. The challenge test set can be divided into three subsets including oversensitivity set, over-stability set and generalization set. Table 2 shows the statistics of DuReader robust . Besides, DuReader robust covers a wide range of answer types (e.g. date, numbers, person, etc. ). The frequency distribution and examples of the answer types are shown in Table 3. Next, we will present our way to construct the three subsets in the challenge test set.

Over-sensitivity Subset
We build the over-sensitivity subset in the following way. First, we sample a subset of instances { q, p, A } from the in-domain test set of DuReader robust . For each question q, we obtain its N paraphrases {q 1 , q 2 , ..., q N } using the paraphrase retrieval toolkit (See Appendix A for further details). To ensure the paraphrase quality, we employ crowd-workers to discard all false paraphrases. Then, we replace q with the paraphrased question q i , and keep the original context p and answers A unchanged. This leads to the new instances { q i , p, A }, and they are used as the modelindependent instances in the over-sensitivity subset. Besides, we also employ a model-dependent way to collect instances. Specifically, we use paraphrased instances to attack the MRC models based on ERNIE (Sun et al., 2019) and RoBERTa (Liu et al., 2019). If one of the models gives a different prediction from the predicted answer of the original question, we adopt the instance, otherwise we discard it. The instances collected in the above modeldependent and model-independent ways constitute the over-sensitivity subset. The over-sensitivity subset consists of 1.2K instances. The number of model-independent instances is equal to that of model-dependent instances. Table 1a shows an example in the over-sensitivity subset.

Over-stability Subset
Intuitively, a trap span that has many words in common with the questions may easily mislead MRC models. Following this intuition, the over-stability subset is constructed as follows. First, we randomly select a set of instances q, p, A from DuReader. In general, a trap span may contain non-answer named entities of the same type as the reference answers A. This is because over-stable models usually rely on spurious patterns that match the correct answer types. Thus, we use a named entity recognizer 2 to identify all named entities in p along with their entity types. We keep the corresponding instance, if there are non-answer named entities that are of the same type as A. Then, we ask linguistic experts to annotate a new question q and answers A , if they consider p contains trap spans. A and A share the same named entity type. The annotated question q has a high level of lexical overlap with a trap span that does not contain A. We say { q , p, A } can be considered as a candidate instance. Each candidate instance is used to attack one of the MRC models based on ERNIE (Sun et al., 2019) and RoBERTa (Liu et al.,    2019). The candidate instance will be used to construct an overstability subset, if one of the model fails. Algorithm 1 shows the detailed procedure (See Appendix B for details). As a result, we have 0.8K instances to evaluate over-stability. Table 1b shows an example from the over-stability subset.

Generalization Subset
The in-domain test set consists merely of in-domain data (i.e., the distribution is the same as the one in the training and development sets). In order to evaluate the generalization ability of MRC models, we construct a generalization subset which comprises out-of-domain data. The out-of-domain data is collected from two vertical domains. The details are as follows.
Education We collect educational questions and documents from Baidu search, and we ask crowdworkers to annotate 1.2K high-quality tuples q, p, A . The topics include mathematics, physics, chemistry, language and literature. Table 1c shows an example. Finance Following Fisch et al. (2019), we leverage a dataset that was originally designed for information extraction in the finance domain for MRC. We obtain 0.4K instances of financial reports this way. The construction details are presented in Appendix C.

Baselines and Evaluation Metrics
We consider three baseline models in the experiments. They are based on different pre-trained language models, including BERT base (Devlin   (Liu et al., 2019). In Appendix D, we set the hyperparameters of our baseline models. Following Rajpurkar et al. (2016), we use exact match (EM) and F1-score to evaluate the heldout accuracy of an MRC model. All the metrics are calculated at Chinese character level, and we normalize both the predicted and true answers by removing spaces and punctuation marks. Table 4 shows the baseline results on the in-domain development set, in-domain test set, and challenge test set. The baseline performance is close to human performance on the in-domain test set, whereas the gap between baseline performance and human performance on the challenge test set is much larger. In Appendix E, we describe the method for calculating human performance.

Main Results
We further evaluate the baselines on the three challenge subsets for over-sensitivity, over-stability and generalization separately. Table 5 shows the results. We have found that baseline performance declines significantly for over-stability and generalization subsets (compared to the "In-domain test set" in Table 4). In contrast, the baseline performance degrades less significantly on the over-sensitivity subset, although there is still a noticeable gap.

Discussion 1: Over-sensitivity
First, we calculate the different prediction ratios (DPRs) of the baselines on the over-sensitivity subset. DPR measures the percentage of the paraphrased questions that yield different predictions. DPR is formulated in Appendix F . Table 6 presents the DPRs of the baselines on the over-sensitivity subset. The baselines obtained around 16% to 22% DPRs, which demonstrates that the baselines are sensitive to part of the paraphrased questions.
Second, we examine a hypothesis -if a paraphrased question contains more words rephrased from the original question, the MRC model is more likely to produce different answers. To measure how similar paraphrased questions are to the original questions, we use the F1-score. A low F1-score means that many words in the original question have been rephrased. We divide the paraphrased questions into buckets based on how similar they are to the original questions, and we then examine whether there is correlation between DPR and F1score similarity. Based on Figure 1, we can observe that the DPRs of all the baselines are negatively correlated with the F1-score similarity between the original and paraphrased questions. The results confirm the hypothesis.

Discussion 2: Over-stability
MRC models might be easily misled by trap spans that share many words with the questions. We examine whether there is a correlation between MRC performance (F1-score) and question-trap similarity in this section. Based on the similarity between trap spans and questions, we divide trap spans into buckets. According to Figure 2, the performance of the base models decreases as similarity increases and the large model (RoBERTa large ) is less overstable than the base ones.   that the baselines perform poorly for both domains. Additionally, we examine how baseline models behave in the education domain. Table 8 shows the performance of RoBERTa large on the four topics in the education domain. The model performs much worse when it comes to math and chemistry, since these topics are rare in the training set. The results of this analysis suggest that domain knowledge is a key factor affecting the generalization ability of MRC models. More discussion can be found in Appendix G.

Conclusion
In this paper, we create a Chinese dataset -DuReader robust and use it to evaluate both the robustness and generalization of the MRC models. Its questions and documents are natural texts from Baidu search. This presents the robustness and generalization challenges in the real-world applications. Our experiments show that the MRC models based on the pre-trained LMs do not perform well on DuReader robust challenge set. We also conduct extensive experiments to examine the behaviors of the MRC models on the dataset and provide insights for future model development.

Ethical Considerations
We aim to provide researchers and developers with a dataset DuReader robust to improve the robustness and generalization ability of MRC models. We also take the potential ethical issues into account.
(1) All the instances in the DuReader robust have been desensitised.
(2) Regarding to the issue of labor compensation, we make sure that all the crowdsourcing workers are fairly compensated.

A Paraphrase Retrieval Toolkit
We use a paraphrase retrieval toolkit to obtain paraphrase questions. The toolkit is used internally at Baidu, and our manual evaluations show that the accuracy of the retrieval results is around 98%. The paraphrase retrieval toolkit consists of two basic modules as follows.
• Paraphrase Candidate Retriever The retriever is a light-weight module. Given a question, the retriever will retrieve top-k paraphrase candidates from the search logs of Baidu Search.
• Paraphrase Candidate Re-ranker The reranker is a model fine-tuned from ERNIE (Sun et al., 2019) by using a set of manually labeled paraphrase questions. Given a set of retrieved paraphrase candidates, the re-ranker will estimate the semantic similarity between the original question and the paraphrase candidates. If the semantic similarity is higher than a pre-defined threshold, the candidate will be used as a paraphrased question.

B The Illustration of Annotating
Over-stability Instances Figure 3 illustrates the annotation of an overstability instance. In the instance, the answer to the original question is 30-40 minutes. The entity type of 5-10 minutes is the same as 30-40 minutes. The annotator raise a new question by revising the original question, the answer to the new question is 5-10 minutes. The sentence contains 30-40 minutes has many words in common with the new question, and it is considered as a trap sentence. The new question may mislead the model to predict the answer to the new question as 30-40 minutes.

C The Construction of Finance Data
We leverage a dataset that is originally designed for information extraction in finance domain. The original dataset contains the full texts of the financial reports as documents and the structured data that is extracted from the texts. Then, we use templates to generate questions for each data field in the structured data. Finally, we use these constructed instances for MRC. Each instance contains (1) a question generated from a template for a data field, (2) an answer that is the value in the data field and (3) a document from which the value (i.e. answer) is extracted.
How long does it take for an adult to drive three kilometers

Context:
Original Question: Answer: Reformed Question: Answer: An adult walks at a speed of about 5-7 kilometers per hour, and it takes about 30-40 minutes to walk three kilometers. Driving in the city is about 30-40 km/h, and it takes less than 5-10 minutes to drive 3 km.

5-10 minutes
How long does it take for an adult to walk three kilometers Figure 3: The illustration of annotating an overstability instance.

D Hyperparameters
We use a number of pre-trained language models in our baseline systems. When fine-tuning different pre-trained language models, we use the same hyperparameters. The settings of hyperparameters are as follows. The learning rate is set to 3e-5 and the batch size is 32. We set the number of epochs to 5. The maximal answer length and document length are set to 20 and 512, respectively. We set the length of document stride to 128. All experiments are conducted on 4 Tesla P40 GPUs.

E Human Performance
We evaluate human performance on both the indomain test set and challenge test set. We randomly sample two hundred instances from the in-domain test set, and three hundred instances from the challenge test set. We ask crowdworkers to provide answers to the questions in the sampled instances. Then, we use EM and F1-scores of these annotated answers as human performance.

F Different Prediction Rate
Different prediction rate (DPR) measures the percentage of paraphrase questions whose predictions are different from the original questions. Formally, we define DPR of a neural model f (θ) on a dataset D as follows.
Q where, f (θ; q) denotes the prediction of the trained MRC model f (θ). Q represents a set of pairs of   original question q and paraphrased question q in dataset D, and 1[ * ] is an indicator function. A high DPR score means that the MRC model is overly sensitive to the paraphrased question q , otherwise insensitive.

G.1 Over-sensitivity Analysis
We further analyze the prediction results to figure out what kind of paraphrases lead to different predictions. Five types of paraphrasing phenomena have been found, including (1) word reordering (WR), (2) replacement of function words (RF), (3) substitution by synonyms (SS), (4) inserting or removing content words (AD), and (5) more than one previously defined types happen in one paraphrase (CO). We randomly sample one hundred instances from the over-sensitivity subset and analyze the changes of the predictions by ERNIE (Sun et al., 2019). As shown in Table 9, most of changed predictions come from AD and CO. This analysis suggests that the models are sensitive to the changes of content words.

G.2 Generalization Analysis
In previous section, we have already analyzed the behaviors of baseline systems on education domain. In this section, we conduct analysis on financial domain. The data of financial domain contains management changes and equity pledge. The performance of RoBERTa large on management changes and equity pledge is 68.63% and 49.15% respectively. The model generalizes well on management changes, since the training set contains the relevant knowledge about asking person names. By contrast, the model performs worse on equity pledge. We classify the instances of equity pledge into five sets according to the question types. Table 10 shows the performance of RoBERTa large on the five question types. We can observe that the model performs the worst on the questions about company abbreviations, pledgee and pledgor, since there is little domain knowledge in the training set. By contrast, the model performs better on the questions about amount and date, since the model has already learnt relevant knowledge in the training set.