A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets

Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. (Code and dataset are available at https://github.com/IvaBojic/framework). We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33{%/40{% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.


Introduction
Data-centric approach emphasizes the collection of high-quality data as a centrally important step in the model development (Jarrahi et al., 2022).While model-centric approaches were more prominent in the past, recently data-centric approaches are also gaining importance (Xu et al., 2021;Liu et al., 2021).This trend was especially emphasized since 2021 when Andrew Ng launched his campaign for a more data-centric approach to AI by starting the data-centric competition2 , which encouraged participants to increase accuracy by solely improving the datasets while keeping the model fixed.
Large Language Models (LLMs), such as Generative Pre-trained Transformer 3 (GPT-3) (Floridi and Chiriatti, 2020), generate text that is grammatically correct, fluent, and informative.However, there is little to no control over the data that were used for model training.Consequently, LLMs are prone to hallucinating and providing untruthful outputs (Evans et al., 2021).Ironically, this reflects LLMs' ability to be better at learning the training distribution and consequently follow inverse scaling law (Lin et al., 2021).And while some of the recent research efforts are focused on providing explanations of where the LLM's outputs came from (Menick et al., 2022), such research is in its infancy.
In this work, we focus on language models with a Transformer encoder architecture such as BERT (Devlin et al., 2018), that extract relevant outputs from a domain-specific evidence-based text corpus.Deep neural networks trained on domain-specific datasets, including those used in Natural Language Processing (NLP), are most heavily dependent on the quality of the training dataset, which is usually small in size (Zarcone et al., 2021) as it is costly to engage a large number of domain experts for annotation.It is thus important to create high-quality training data for language models to perform better.In this paper, we propose a data-centric framework for Machine Reading Comprehension (MRC) datasets that increases the original dataset quality by both (i) keeping the size of the original dataset fixed, and (ii) augmenting the original dataset by adding new training samples.
MRC is a Natural Language Understanding (NLU) task.Its goal is to answer questions based on the information provided in a passage (Zhang et al., 2020).Training datasets for MRC models come in the form of triplets: passage (i.e., positive context), question, and answer.Typically, the MRC pipeline works in two phases, where a passage retriever is followed by a passage reader (Chen et al., 2017).For a given question, the retriever first extracts a set of relevant passages from a knowledge base (i.e., text corpus), and then the reader selects an answer (e.g., text span) from one of the retrieved passages (Zhu et al., 2021).
Data-centric approaches can be divided into (i) data quality enhancement methods that keep the original size of the dataset fixed (e.g., data filtering or label consistency checking), and (ii) data augmentation methods that increase the original dataset size (i.e., adding more training samples).Results from the literature on using data-centric approaches to improve model performance in MRC are inconclusive.
Several studies have reported that data filtering can lead to significant model improvements (Dou et al., 2020;Sanyal et al., 2021;Mollá, 2022).However, this might not hold if data are filtered in a random way (Firsanova, 2021).Additionally, while increasing labelling consistency and excluding or cleaning noisy data points were shown to improve model performance on the BioASQ dataset (Yoon et al., 2022), shortening answers in AASDQA led to a decrease of F1-score by 4% (Firsanova, 2021).
Adaptation of data augmentation is still comparatively less explored in NLP (Feng et al., 2021), with a body of work presenting positive results (Kaushik et al., 2019;Khashabi et al., 2020;Qin et al., 2020;Pappas et al., 2022) as well as papers showing little or no improvements for the given task (Huang et al., 2020;Chopard et al., 2021;Okimura et al., 2022).
To the best of our knowledge, this paper is the first that proposes framework for data quality enhancement for improving domain-specific MRC datasets by (i) keeping the original dataset size of data the same and (ii) increasing the original dataset size using augmentation methods.Our framework includes methods for (i) a better selection of negative passages for retriever training, and (ii) reformulating questions using paraphrasing, word substitution, and back translation.
Paraphrasing, word substitution, and back translation were previously used as data augmentation methods in various NLP tasks (Liu and Hulden, 2021;Pappas et al., 2022;Ishii et al., 2022).However, those papers did not look at how each of these methods enhanced the original dataset without increasing its size.Keeping the size of the dataset fixed is necessary in resources-constrained scenarios, as the resources (e.g., time) needed for finetuning are proportional to the size of training sets.Moreover, previous studies did not present a costbenefit analysis of the resources needed to generate extended training sets and perform fine-tuning processes in comparison with the performance increase.

A Data-centric Framework for MRC
In our framework, we first generate new training sets using four data quality enhancement methods.We then fine-tune retrieval and reader models on each new training set individually.Secondly, we fine-tune retrieval/reader models continually starting from the best individual checkpoint using enhanced training sets that showed improvements in the first step.Finally, we create new augmented datasets by concatenating all training sets if they show fine-tuning improvements in the first step.
Labels in MRC datasets are triplets which include a passage, a question, and an answer.In MRC datasets, an answer is part of a passage which is also called a positive context.To fine-tune a retrieval model as proposed in (Karpukhin et al., 2020), it is necessary to not only provide a positive context of passages that contains the answer to a given question, but also negative contexts.Some previous work employed a method of randomly selecting negative contexts from a text corpus (Bojic et al., 2022).Here we propose a method to improve the random selection of negative contexts.
One of the problems with manually collecting labels for MRC datasets is that questions are too similar to their answers (Rajpurkar et al., 2018).To solve this, we investigate the use of three different methods applied to the original set of questions: (i) paraphrasing -we use two different language models fine-tuned for paraphrasing; (ii) word substitution -we use two libraries: one to extract a keyword from a given question and another to obtain a list of synonyms for the chosen keyword; and (iii) back translation -we use 25 different machine translation language models to translate a source text into another language, and back into the original language.

Negative Contexts
To enhance the quality of the negative contexts for each passage, we implemented the following procedure.For each positive context, passages were sorted in ascending order of BERTScore (Zhang et al., 2019) similarity with the positive context, and the ones with the lowest score were kept to form negative contexts.A global counting dictionary was maintained to prevent the replication of negative contexts across different training examples.This ensured that each negative context did not exceed the threshold for number of occurrences in total in the whole dataset.

Questions
In this section, we describe the various techniques used to augment the questions from MRC datasets.
For question paraphrasing, we used two models: T53 and Pegasus4 .To enhance the data quality of an original dataset, for each original question, we used the two aforementioned methods to generate up to five paraphrased questions.Subsequently, we created five different training sets in which we grouped the most, second most, up to the least similar paraphrases for each original question together.The word similarity was calculated using a word vector model from spaCy5 .We also generated a sixth set comprising a randomly-selected question from the list of five unique paraphrases generated.
In word substitution process, we extracted a keyword from each question with the help of the spaCy library and obtained a list of synonyms for each keyword using Natural Language Tool Kit (NLTK)'s English dictionary, WordNet6 .The top five synonyms were extracted from this list in descending order of word similarity calculated using the aforementioned word vector model from spaCy.We then generated five versions of the training data for each dataset such that in set 1, the keyword for each question was replaced by its most similar synonym; in set 2, the keyword for each question was replaced by its second most similar synonym and so forth, with set 5 containing the questions with the least similar synonyms as substitutes.For keywords with n < 5 synonyms, we kept the question unchanged in the first (5 -n) versions and used the synonyms as substitutes in the remaining n versions.We also created a sixth set in which we randomly selected one of the top five (or n) synonyms to substitute the keyword for each question.
We used Hugging Face Helsinki model7 for back translation.In total, we generated 25 different training sets based on the number of downloads for translation from English to the respective languages, followed by the availability of translation models from the respective languages to English.We selected checkpoints based on the number of downloads, taking the top 25 most downloaded.
To understand how different the resulting questions obtained from each of the enhancement methods are, we performed pairwise comparisons between questions from each method using ROUGE-1.Results are shown in Appendix B.7. Backtranslation overall yields the questions most different to the baseline and the other enhancement methods.

Answers
Since MRC relies on extracting the exact answer (i.e., text span) from a passage, we could not apply any of the automatic data quality enhancement methods that we applied to questions (as explained in the previous section).However, we created new training datasets in which we manually shortened the original answers wherever appropriate.We explained further in Appendix A.3.

Datasets
To test our framework, we made adjustments (see Appendix A) to four biomedical datasets: BioASQ (Lamurias et al., 2020), COVID-QA (Möller et al., 2020), cpgQA (Mahbub et al., 2023) and SleepQA (Bojic et al., 2022).We refer the reader to Table 1 for statistics on the final version of datasets that we used in all experiments: original/final size of text corpus, original/final number of labels and finally, train/dev/test split.
Original BioASQ dataset contained over 3k manually-annotated biomedical labels.Questions in these datasets came in four different flavours: factoid, list, yes/no, and summary.We extracted only factoid questions for which the exact answer can be found in the positive context.Original COVID-QA dataset was annotated by biomedical experts and contained 2k labels on COVID-19 pandemic-related topics.Original cpgQA dataset contained 1k manually annotated labels in the domain of clinical practice guidelines.Original SleepQA was a manually annotated dataset in the sleep domain with 5k labels.We evaluated our framework by performing finetuning of retrieval and reader models using Bi-oLinkBERT (Yasunaga et al., 2022) and BioBERT BioASQ8 respectively.We used BioLinkBERT for retrieval model fine-tuning as it was recently shown to achieve state-of-the-art performance on lowresource bio-MRC tasks (Mahbub et al., 2023).
BioBERT BioASQ was used for fine-tuning of reader model as proposed in (Bojic et al., 2022).Intrinsic evaluation of fine-tuned models was done using automatic metrics on test sets: recall@1 for retrieval and Exact Match (EM) for reader models.

Fine-tuning on Enhanced Training Sets
Table 2 and Table 3 show recall@1/EM scores respectively for fine-tuned retrieval/reader models after enhancing the method of selecting negative contexts (i.e., using BertScore) for the retrieval training datasets, as well as reformulation of questions using paraphrasing, word substitution, back translation and answer shortening for the training datasets of both models.More specifically: • The first row (baseline) in each table shows the results of BioLinkBERT/BioBERT BioASQ models fine-tuned on the original datasets.
• Each subsequential row shows the best results for each dataset using the four aforementioned methods for negative contexts (only for the retrieval models) and questions (for both models) enhancement.
• The following row (answer shortening) shows recall@1/EM scores for fine-tuning of models on the training datasets in which the original answers were manually shortened if needed.
• The following row (continual) shows the results of continual fine-tuning: starting from the best individual checkpoint, we fine-tune on the secondbest training set, and so on.For example, for reader fine-tuning on the BioASQ dataset, we first took the checkpoint of fine-tuning on the training set created using paraphrasing and then continued fine-tuning on training sets created using back translation.Finally, we took the newest checkpoint and continued fine-tuning on the training set created using word substitution.
• The last row (augmentation) shows recall@1/EM scores for fine-tuning of models on the training datasets created by concatenating all data enhanced training sets if they showed fine-tuning improvements when using individually (i.e., rows 2-6 for retrieval and rows 2-5 for reader models).
For retrieval fine-tuning (Table 2), the most significant improvement of +8.3 (+33%) from baseline was achieved for BioASQ dataset when using back translation on the Catalan language.The enhanced methods of selecting negative contexts and word substitution improved all four datasets, while paraphrasing and back translation caused a decrease in recall@1 scores for SleepQA dataset.Continual retrieval fine-tuning showed improvements when compared to baselines for all datasets, however, only for the COVID-QA and cpgQA datasets it was better than the best results of individual fine-tuning.For fine-tuned reader models (Table 3), the most significant improvement of 2.1 (+40%) from baseline was achieved for BioASQ dataset when using back translation on the Dutch language, as well as paraphrasing.Continual reader fine-tuning increased the EM score only for cpgQA dataset compared with the corresponding baselines.Lastly, augmentation was better than the best results of individual fine-tuning only for the SleepQA dataset with the total increase of 2.6 (+4%).Greater relative improvements with backtranslation compared to other methods could be supported by this method creating more diverse questions (Appendix B.7).However, backtranslation gains are inconsistent from a dataset to the other.Moreover, we noticed that translation and paraphrasing with Pegasus gave questions noticeably more difference than the other data enhancing techniques.

Cost-benefit Analysis
In total, the data-centric methods that we described previously enabled us to generate 28 and 24 enhanced training sets for retrieval fine-tuning and reader fine-tuning respectively.Subsequently, we fine-tuned all retrieval/reader models on a single NVIDIA-A40 GPU with 46GB of GPU RAM.Table 4 and Table 5 shows time spent on fine-tuning.For example, we used one GPU for five hours to fine-tune retriever model on BioASQ dataset to achieve 33% improvement in recall@1 score.Meanwhile, we used one GPU for 22 hours to finetune retriever model on SleepQA dataset only to achieve a decrease in recall@1 score of 2%.
The last two rows in tables show the time needed for continual/augmentation fine-tuning only.However, in order to determine the order in which to fine-tune for continual learning, or which datasets to use for concatenation, all individual checkpoints need to be created.Hence, to obtain the total time for continual learning/augmentation, one needs to add up times from all previous rows as well.

Discussion and Conclusions
It is estimated that over 92% of data scientists who work in the Artificial Intelligence field encountered the "data cascades" phenomenon, which denotes downstream problems resulting from low-quality data (Sambasivan et al., 2021).One way to improve the original dataset quality is data-centric approach.In this paper, we showed that by enhancing the quality of original datasets, one can achieve model fine-tuning performance improvements for small datasets (e.g., biomedical datasets).However, the results suggest that the effects of data quality enhancement methods on performance improvements are small, and the performance of the test data deteriorates in many cases.
Despite the inconsistency of data-centric methods used in this paper in yielding improvement, two positive conclusions can be drawn.First, when taking into consideration the time needed to create data enhanced training sets as well as performance improvements in fine-tuning, word substitution method is the best, supporting previous findings (Feng et al., 2019;Pappas et al., 2022).Unlike other methods, word substitution is not modelbased and thus is run for a few minutes, rather than a few hours like back translation and paraphrasing.Second, the best relative improvements were achieved for the BioASQ dataset with the smallest number of labels, a similar finding presented in (Okimura et al., 2022).
In addition to the data-centric methods discussed in this work, there are other techniques such as pseudo-labelling (Abney, 2007;Ruder and Plank, 2018;Cui and Bollegala, 2019;Zhu and Goldberg, 2022), data selection (Axelrod et al., 2011;Plank and Van Noord, 2011;Ruder and Plank, 2017), and pre-training methods (Han and Eisenstein, 2019;Guo et al., 2020).In future work, we will investigate whether those techniques would produce more reliable and consistent results across different datasets.Moreover, we will also consider approaches that incorporate aspects of multiple techniques, resulting in hybrid data-centric techniques as proposed in (Ramponi and Plank, 2020).

A.1 Dataset Construction
In this subsection, we describe how we built the final version of datasets from Table 1.Where necessary, we divided passages from the original text corpus into one or more parts, so their length was less than 300 words.This step was done so that all passages were of a similar length across different datasets and that the same model hyperparameters can be used for fine-tuning retrieval and reader models.We then removed those labels for which the answer could not be found in the corresponding positive context.Finally, we divided each original dataset into three parts (in the ratio of 80:10:10) to create training, development, and test sets.Table 1 shows the original number of passages in each text corpora, the original number of labels, and the final numbers after the aforementioned adjustments were done.

A.2 Data Cleaning
BioASQ: The original dataset did not include positive passages, but instead contained links to the journal articles where the answers can be found.To obtain positive passages, we first retrieved them from the individual links provided in the dataset, and then divided them into passages of no longer than 300 words.Only triplets that contain the exact answers in the retrieved passages were included in the final dataset.We encountered a challenge that, of the 5,821 triplets of the factoid type identified, only 16% had the exact answers that could be found in the provided passages.
cpgQA: To prepare the text corpus, we partitioned passages into segments of no more than 300 words, resulting in a corpus of 235 passages.
Unfortunately, this division caused some answers to be separated from their corresponding positive contexts due to issues such as inaccurate sentence tokenization and answer fragmentation between two adjacent passages.These discrepancies were addressed through manual intervention.It should be noted that no labels were excluded from the original dataset as a result of this cleaning procedure.
SleepQA The original dataset already contained passages shorter than 300 words, and all answers were found in their provided passages.We eliminated leading and trailing spaces and changed all letters to lowercase.

A.3 Shortening Answers
BioASQ: The original answers varied from two to more than 120 words in length.Our focus was on shortening the answers which were excessively long, and thus all answers longer than 30 words were manually reviewed.The primary adjustments made to the answers involved isolating the main response to the corresponding question, thereby truncating lengthy sentences into shorter phrases.This approach effectively reduced answer length for both the test and training sets by a significant degree.The mean answer length for the training set decreased from 30.9 to 17.6 words (Figure 1), while the mean answer length for the test set decreased from 26.1 to 18.4 words (Figure 2).COVID-QA: In the original dataset, the length of the answers was not more than 120 words.However, some answers contained incomplete words at the beginning and/or end of sentences.To improve the dataset's accuracy, these words were either manually removed or completed.Moreover, scientific abbreviations were eliminated manually to improve the accuracy of exact matches.Unfortunately, this had no significant effect on the mean length of answers for both the training and test sets.This result can be attributed to the training set's prevalence of sentences with only one or two abbreviations.In other cases, completing the incomplete words also had no effect on the mean word count.cpgQA: In both the training and test sets, answers were shortened manually by removing extraneous phrases and articles (such as "a/an/the") from the beginning of the responses.After shortening, the mean answer length in the training set reduced from 12.7 words to 12.4 words, whereas for the test set, the mean answer length reduced from 12.1 words to 11.6 words.The minimal difference in the mean number of words is due to the fact that most answers in the original dataset were clear and concise.

sleepQA:
The initial average answer lengths for the sleepQA dataset are 10.15 and 9.13 for the train and test set respectively, making it the dataset with the shortest average answer length among all datasets studied.We focused on cutting down answers more than 15 words long, which range up to 40 words long.The was done by extracting the main phrases of the answers that directly respond to the associated questions.The resulting cleaned answers are in the form of shorter, more concise phrases instead of wordy full sentences.The final average answer lengths after the cleaning process are 9.11 and 8.01 for the train and test set respectively.

B Evaluation B.1 Model Hyperparameters
Hyperparameters of retrieval models fine-tuning are shown in Table 6, and of reader models in Table 7.When fine-tuning retrieval models on training sets in which method of selecting the negative contexts for each passage was enhanced, we changed other negatives hyperparameters to reflect the number of negative contexts in the corresponding training set (e.g., 1 to 5).Additionally, when fine-tuning reader models on different datasets, we set eval step to 50 for BioASQ, COVID-QA and cpgQA datasets and 500 for the SleepQA dataset.The reason behind this is because the SleepQA dataset has 4,000 labels in the train set, while the other datasets have less than 1,000 labels.For continual retrieval fine-tuning, we set the num train epochs to 60, and for reader to 30.Other parameters were left the same.

B.2 Negative Contexts
Using the enhanced method of selecting negative contexts, we produced five different training sets for each dataset (see Table 8).Although generally, this method produced enhanced training sets for each dataset, it is not possible to conclude which number of negatives improved the fine-tuning process the best, as this is very much dataset-specific.The last row in Table 8 shows the time (in hours) needed to generate all five training sets for each dataset using A100 GPU 40GB.While for most of the datasets, the generation process took around one hour, for SleepQA it took more than one day.8: Automatic evaluation of fine-tuned retrieval models using recall@1 scores when using the enhanced method of selecting negative contexts.

B.3 Paraphrasing
For question paraphrasing, we used T5 and Pegasus as they are based on Transformer architecture and utilize transfer learning, in which resource-rich sources can be efficiently adapted for resource-poor target fields, such as the domain-specific datasets (Yu et al., 2018).Previous research showed that the Pegasus method produces paraphrases that are semantically more different, while the T5 method is found to keep more of the original meaning (Martín Galván et al., 2023).We found that the Pegasus consistently produces the same set of paraphrased questions, regardless of the number generated.For T5, we generated paraphrased questions up to 50 times, after which we took the first five unique paraphrases.For several questions (between 3% for cpgQA dataset and 12% for COVID-QA dataset), T5 failed to produce the required number of unique paraphrases, for which cases we added the original question to the set of five paraphrased questions.Although we used two different libraries, question paraphrasing failed to enhance training set quality for cpgQA dataset altogether.Generating training sets took around 15 hours for SleepQA dataset and 3 hours for other datasets on one NVIDIA TESLA P100 GPU 16GB (Kaggle).

B.4 Word Substitution
Word substitution is the process of substituting similar words (such as synonyms or words with similar embeddings) from the original data (Pappas et al., 2022).This method for enhancing the original training sets increased almost all recall@1/EM scores for all datasets for both retrieval/reader finetuning, except for the reader models for cpgQA and COVID-QA datasets.In cases where applying word substitution on the original dataset did not increase the EM scores for the reader fine-tuning, the scores stayed the same as the corresponding baselines (i.e., this method did not worsen them).Moreover, the generation of training sets took only 11 minutes for SleepQA dataset and around two minutes for other datasets on one NVIDIA TESLA P100 GPU 16GB (Kaggle).

B.6 Mean and Standard Deviation
Table 17 shows the mean and standard deviation for different data quality enhancement methods for retrieval fine-tuning.Table 18 shows the mean and standard deviation for different data quality enhancement methods for reader fine-tuning.

B.7 Similarity Between Enhancement Methods
In the following tables, we show the average similarity computed with ROUGE-1 metric between questions generated through each of the enhancement techniques, over all four datasets {BioASQ,CovidQA,cpgQA,SleepQA}, with Retrieval (first four tables) then Reader (next four).

Figure 1 :Figure 2 :
Figure 1: Answer length (in number of words) before and after shortening answers for BioASQ training set.

Figure 3 :
Figure 3: Answer length (in number of words) before and after shortening answers for COVID-QA training set.

Figure 4 :
Figure 4: Answer length (in number of words) before and after shortening answers for COVID-QA test set.

Figure 5 :Figure 6 :
Figure 5: Answer length (in number of words) before and after shortening answers for cpgQA training set.

Figure 7 :Figure 8 :
Figure 7: Answer length (in number of words) before and after shortening answers for SleepQA training set.

Table 1 :
Dataset statistics, for original and final versions.

Table 3 :
Results of fine-tuned reader models (EM).*Since no single enhancement method could improve the baseline on cpqQA, we discarded continual and augmentation on this dataset.

Table 4 :
Total time spent (in hours) vs. maximum relative recall@1 improvements of retrieval fine-tuning.

Table 5 :
Total time spent (in hours) vs. maximum relative EM improvements of reader fine-tuning.

Table 6 :
Hyperparameters of retrieval model fine-tuning.

Table 7 :
Hyperparameters of reader model fine-tuning.

Table 9 :
Average similarity index of each training set for each dataset, calculated using a word vector model from spaCy for paraphrasing.

Table 12 :
Average similarity index of each training set for each dataset, calculated using a word vector model from spaCy for word substitution.