Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfluencies is an under-studied topic in NLP, even though it is ubiquitous in human conversation. This is largely due to the lack of datasets containing disfluencies. In this paper, we present a new challenge question answering dataset, Disfl-QA, a derivative of SQuAD, where humans introduce contextual disfluencies in previously fluent questions. Disfl-QA contains a variety of challenging disfluencies that require a more comprehensive understanding of the text than what was necessary in prior datasets. Experiments show that the performance of existing state-of-the-art question answering models degrades significantly when tested on Disfl-QA in a zero-shot setting.We show data augmentation methods partially recover the loss in performance and also demonstrate the efficacy of using gold data for fine-tuning. We argue that we need large-scale disfluency datasets in order for NLP models to be robust to them. The dataset is publicly available at: https://github.com/google-research-datasets/disfl-qa.


Introduction
During conversations, humans do not always premeditate exactly what they are going to say; thus a natural conversation often includes interruptions like repetitions, restarts, or corrections. Together these phenomena are referred to as disfluencies (Shriberg, 1994). Figure 1a shows different types of conventional disfluencies in an utterance, as described by Shriberg (1994).
With the growing popularity of voice assistants, such disfluencies are of particular interest for goaloriented or information seeking dialogue agents, because an NLU system, trained on fluent data, can easily get misled due to their presence. Figure 1b shows how the presence of disfluencies in a * Work done during an internship at Google.

Repetition
When is Eas ugh Easter this year? Correction When is Lent I meant Easter this year? Restarts How much no wait when is Easter this year?
(a) Conventional categories of Disfluencies. The reparandum (words intended to be corrected or ignored), interregnum (optional discourse cues) and repair are marked.
Passage: The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, . . .   (Shriberg, 1994) (b) A passage and questions (q i ) from SQUAD, along with their disfluent versions (dq i ) and predictions from a T5-QA model. question answering (QA) setting, namely SQUAD (Rajpurkar et al., 2018), affects the prediction of a state-of-the-art T5 model (Raffel et al., 2020). For example, the original question q 1 is seeking an answer about the location of Normandy. In the disfluent version dq 1 (which is semantically equivalent to q 1 ), the user starts asking about Norse and then corrects themselves to ask about the Normandy instead. The presence of this correctional disfluency confuses the QA model, which tend to rely on shallow textual cues from question for making predictions.
Unfortunately, research in NLP and speech community has been impeded by the lack of curated datasets containing such disfluencies. The datasets available today are mostly conversational in nature, and span a limited number of very specific domains (e.g., telephone conversations, court proceedings) (Godfrey et al., 1992;Zayats et al., 2014). Furthermore, only a small fraction of the utterances in these datasets contain disfluencies, with a limited and skewed distribution of disfluencies types. In the most popular dataset in the literature, the SWITCHBOARD corpus (Godfrey et al., 1992), only 5.9% of the words are disfluencies (Charniak and Johnson, 2001), of which > 50% are repetitions (Shriberg, 1996), which has been shown to be the relatively simpler form of disfluencies (Zayats et al., 2014;.
To fill this gap, we present DISFL-QA, the first dataset containing contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. DISFL-QA is constructed by asking human raters to insert disfluencies in questions from SQUAD-v2, a popular question answering dataset, using the passage and remaining questions as context. These contextual disfluencies lend naturalness to DISFL-QA, and challenge models relying on shallow matching between question and context to predict an answer. Some key properties of DISFL-QA are: • DISFL-QA is a targeted dataset for disfluencies, in which all questions (≈12k) contain disfluencies, making for a much larger disfluent test set than prior datasets.
• Over 90% of the disfluencies in DISFL-QA are corrections or restarts, making it a much harder test set for disfluency correction ( §2.2).
• DISFL-QA contains wider diversity in terms of semantic distractors than earlier disfluency datasets, and newer phenomenon such as coreference between the reparandum and the repair ( §2.3).
We experimentally reveal the brittleness of state-of-the-art LM based QA models when tested on DISFL-QA in zero-shot setting ( §4.1). Since collecting large supervision datasets containing disfluencies for training is expensive, different data augmentation methods for recovering the zero-shot performance drop are also evaluated ( §3.3). Finally, we demonstrate the efficacy of using the human annotated data in varying fractions, for both end-to-end QA supervision and disfluency generation based data augmentation techniques ( §4.2).
We argue that creation of datasets, such as DISFL-QA, are vital for (1) improving understanding of disfluencies, and (2) developing robust NLU models in general.
2 DISFL-QA: Adding Disfluencies to QA DISFL-QA builds upon the existing SQUAD-v2 dataset, a question answering dataset which contains curated paragraphs from Wikipedia and associated questions. Each question associated with the paragraph is sent for a human annotation task to add a contextual disfluency using the paragraph as a source of distractors. Finally, to ensure the quality of the dataset, a subsequent round of human evaluation with an option to re-annotate is conducted.

Source of Questions
We sourced passages and questions from SQUAD-v2 (Rajpurkar et al., 2018) development set. SQUAD-v2 is an extension of SQUAD-v1 (Rajpurkar et al., 2016) that contains unanswerable questions written adversarially by crowd workers to look similar to answerable ones from SQUAD-v1. We use both answerable and unanswerable questions for each passage in the annotation task.

Annotation Task
To ensure high quality of the dataset, our annotation process consists of 2 rounds of annotation: First Round of Annotation. Expert raters were shown the passage along with all the associated questions and their answers, with one of the question-answer pair highlighted for annotation. 1 The raters were instructed to use the provided context in crafting disfluencies to make for a nontrivial dataset.
The rater had to provide a disfluent version of the question that (a) is semantically equivalent to the original question (b) is natural, i.e., a human can utter them in a dialogue setting. When  writing the disfluent version of a question, we instructed raters not to include partial words or filled pauses (e.g., "um", "uh", "ah" etc.), as they can be detected relatively easily . Raters were shown example disfluencies from each of the categories in Table 1. On average, raters spent 2.5 minutes per question. Introduction of a disfluency increased the mean length of a question from 10.3 to 14.6 words.
Human Evaluation + Re-annotation. To assess and ensure high quality of the dataset, we asked a another set of human raters the following yes/no questions: 1. Is the disfluent question consistent with respect to the fluent question? i.e., the disfluent question is semantically equivalent to the original question in that they share the same answer.
2. Is the disfluent question natural? Naturalness is defined in terms of human usage, grammatical errors, meaningful distractors etc.
After the first round of annotation, we found that the second pool of raters found the disfluent questions to be consistent and natural 96.0% and 88.5% of the time, with an inter-annotator agreement of 97.0% and 93.0% 2 , respectively. This suggests that the initial round of annotation resulted in a high quality dataset. Furthermore, for the cases identified as either inconsistent or unnatural, we conducted a second round of reannotation with updated guidelines to make required corrections.

Categories of Disfluencies
To assess the distribution of different types of disfluencies, we sampled 500 questions from the training and development sets and manually annotated the nature of disfluency introduced by the raters. Table 1 shows the distribution of these categories in the dataset. A notable difference between DISFL-QA and SWITCHBOARD (Godfrey et al., 1992) is that DISFL-QA contains a larger fraction of corrections and restarts, which have been shown to be the hardest disfluencies to detect and correct (Zayats et al., 2014;. From Table 1, we can see that ≈30% and >65% of the disfluencies in DISFL-QA are restarts and corrections respectively.
In addition to the specific categories men-   We use two different modeling approaches to answer disfluent questions in DISFL-QA.
LMs for QA. We use BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020) as our QA models in the standard setup which has shown to achieve state-of-the-art performance for SQUAD. We fine-tune BERT for a span selection task, whereby predicting start and end probabilities for all the tokens in the context. T5 is finetuned under the standard text2text formulation, when given (question, passage) as input the model generates the answer as the output. For predicting <no answer>, the model was trained to generate "unknown".
LMs for Disfluency Correction. We also finetune the above LMs as disfluency correction models. Given the disfluent question as input, a correction model predicts the fluent question, which is then fed into a QA model. For BERT, we use the  state-of-the-art BERT-based disfluency correction model by Jamshid Lou and Johnson (2020) trained on SWITCHBOARD. We also train T5 models on DISFL-QA to prevent the distribution skew between SWITCHBOARD and DISFL-QA, and account for new phenomena like coreferences.

Training Settings
We train the BERT and T5 variants on the following two data configurations: ALL where the model is trained on all of SQUAD-v2, including the non-answerable questions. Evaluation is done against the entire test set.
ANS where the model is trained only on answerable questions from SQUAD-v1, without the capabilities of handling non-answerable questions.

Datasets
Human Annotated Datasets. We use 3 datasets in our experiments: SQUAD-v1, SQUAD-v2, and DISFL-QA. We split the 11, 825 annotated questions in DISFL-QA into train/dev/test set containing 7182/1000/3643 questions, respectively. The split was also done at an article level such that the questions belonging to the same passage belong in the same split. For zero-shot experiments, we only use the train of SQUAD. Evaluation is done on the subset of SQuAD-v2 development set that corresponds to the DISFL-QA test to ensure fair comparison.
Heuristically Generated Data. We also generate disfluencies heuristically to validate the importance of human annotated disfluencies. Inspired by the disfluency categories seen in our annotation task, we derive the following heuristics to  augment our data with silver 3 standard disfluencies: (i) SWITCH-Q which inserts prefix of another question as a prefix to the original question, and (ii) SWITCH-X, where X could be verb, adjective, adverb, or entity, and is inserted as a reparandum in the question.
To facilitate contextual disfluencies, we use the reparandums from the context. For SWITCH-VERB/ADJ/ADV/ENT, this was done by picking tokens and phrases from the context passage. For SHIFT-Q, we used other questions associated with the same passage. We used spaCy 4 NER and POS tagger to extract relevant entities and POS tags, and sample interregnum from a list of fillers. Table 3 shows an example from each of the heuristics. We then finally combine all the heuristics (ALL in Table 3) by uniformly sampling a single disfluent question from the set of possible transformations of the question.

Evaluation Method
In all our experiments, we evaluate QA performance using the standard SQUAD-v2 evaluation script which reports EM and F1 scores over the HasAns (asnwerable) and NoAns (nonanswerable) slices along with the overall scores. For brevity, we report only the F1 numbers as we 3 The silver nature of the data is due to the fact that we can not enforce naturalness or semantic equivalence of §2. 4 https://spacy.io/ observed similar trends in EM and F1 across our experiments.

Experiments
We conduct experiments with DISFL-QA to answer the following questions: (a) Are state-of-theart LM based QA models robust to introduction of disfluencies in the questions under a zero-shot setting ? (b) Can we use heuristically generated synthetic disfluencies to aid the training of QA models to handle disfluencies ? (c) Given a small amount of labeled data, can we recover performance by fine-tuning the QA models or training a disfluency correction model to pre-process the disfluent questions into fluent ones before inputting to the QA models ? (d) In the above setting, can we train a generative model to generate more disfluent training data ? Table 4 shows the performance of different variants measuring their zero-shot capabilities.

Zero-Shot Performance
Performance of BERT-QA and T5-QA. We see from Table 4 that when tested directly on on heuristics and DISFL-QA test sets, both the BERT-QA and T5-QA models exhibit significant performance drop, as compared to the performance on the fluent benchmark of SQUAD. The performance drop for the complete models is greater  Disfluency Correction + T5-QA. We use the BERT based state-of-the-art disfluency correction (Jamshid Lou and Johnson, 2020) as a preprocessing step before feeding the input to our T5-QA model. The models trained on SWITCH-BOARD are not able to fill a significant performance gap, with the complete and answerable models recovering 4.07 and 2.25 F1 points, respectively. We will revisit this setting in the fewshot experiments.
DISFL-QA test-set vs. Heuristics test-set. Next, we compare the performance of heuristically generated disfluent questions against the human annotated questions. In general, human annotated disfluent questions exhibit larger performance drop compared to heuristics, across different models. Taking a closer look at the T5-ALL model shows that DISFL-QA shows a bigger drop in HasAns cases and smaller increase in NoAns cases, as compared to the heuristics test set. For the T5-ANS model, DISFL-QA shows a larger drop in performance which is attributed to the model picking wrong answer span. Based on this, we hypothesize that between the two datasets, heuristics are able to confuse the models in overpredicting <no answer>, but DISFL-QA is superior when it comes to confuse the models to picking a different answer span altogether (as seen in Table 4 for models in ANS setting). This demonstrates that collecting a dataset like DISFL-QA via human annotation holds value for contextual disfluencies.  We believe that the disfluencies are causing the answerable questions to resemble the nonanswerable ones as seen by both BERT and T5 models under ALL setting. This results in an overly conservative model in terms of answerability and instead resorts to over-predicting <no answer>, causing gain in non-answerable recall at the cost of precision. In contrast, for a comparable ANS model the drop in F1 is smaller, primarily due to relatively easier decision making, i.e. not required to decide when to answer vs. not.
Fine-tuning on Heuristic Data. In this experiment, we fine-tune on heuristically generated data from §3.3 and directly test on DISFL-QA. Table 6 compares the performance of the heuristics fine-tuned model on the DISFL-QA test-set. The overall heuristics trained model (ALL) is able to cover a significant performance drop from 61.64 to 82.27, an increase of 20.63 F1 points. However, this still is 7.32 F1 points short of the fluent performance.
Amongst the individual heuristics, we observe the following order of effectiveness w.r.t. performance on the HasAns cases: ENT > SQ > ADJ > VERB > ADV. One possible expla- We can see that performance on HasAns cases increases monotonically with increase in gold data. However, for the NoAns cases, the performance first takes a drop (compared to zero-shot) and then increases.
nation for SWITCH-ENT and SWITCH-Q being more effective is the fact that our original annotated dataset has a relatively high percentage of entity and interrogative correction.

Few Shot Performance
Next, we evaluate the performance of the models when we use a part of human annotated gold disfluent data for training: (i) direct end-to-end supervision, (ii) generation based data augmentation, and (iii) training disfluency correction models.
Direct Supervision (k-shot). In this setting, we pick a SQUAD-v2 T5 model and then perform a second round of fine-tuning with varying percentages of DISFL-QA gold training data. We experiment with 1, 5, 10, 25, 50, and 100 percent of the total gold data. Figure 2 shows the performance for the HasAns and NoAns cases as we increase the amount of training data. The HasAns performance increases gradually from 35.31 F1 points, in the zero-shot setting, to 86.40 F1 points with complete training data. Interestingly, for the NoAns cases, the performance first drops from 90.06 F1 points, in the zero-shot setting, to 82.02 F1 with 5% data and then monotonically increasing to 86.53 F1 with complete data. This can be attributed to the fact that the zero-shot models were under-predictive (high recall, low precision for <no asnwer>) due to lack of robustness to disfluent inputs.  Furthermore, Table 7 compares the performance of using the gold training data of DISFL-QA against the heuristics data. It shows that the models trained with disfluent data from DISFL-QA are able to cover a major gap in answerable slice, which wasn't possible with the heuristically generated data. Direct supervision bring an additional performance improvement of 4.19 F1 points over the heuristics.
Generation Based Data Augmentation. We use the T5 model for synthetically generating disfluent question from fluent question in the text2text framework. We use the training set of DISFL-QA to train the following generative models: (i) context-free generation (Q → DQ), and (ii) context-dependent generation (CQ → DQ) which use passage as well for generation. Table 8 shows example generation from the two models. We observe that CQ → DQ is able to learn meaningful contextual disfluency generation, whereas Q → DQ can lead to non-meaningful or inconsistent disfluencies due to lack to context. We then pick 5k random (question, answer) pairs from SQUAD training data and apply our generative model to produce disfluent training data for the QA models. Table 7 shows the performance of using data augmentation. We perform data augmentation under two different train data settings: (1) 25% data, and (2) 100% data. Interestingly, for the models trained on 25% train data + generated data, we observe a gain of 1.81 F1 Passage: . . . Whereas a genome sequence lists the order of every DNA base in a genome, a genome map identifies the landmarks. A genome map is less detailed than a genome sequence and aids in navigating around the genome . . .  Table 8: Example disfluent question (DQ) as generated by the Q → DQ and CQ → DQ T5 generative models for data augmentation. We observe that CQ → DQ generates meaningful disfluencies compared to context-free generation, the latter leading to irrelevant or inconsistent questions in some cases. points (83.71 → 85.52) in the overall performance which is close to the absolute performance of using 50% gold data. However, for the setup with 100% gold data + generated data, we did not observe a similar improvement in the overall performance.
Pipelined: Disfluency Correction + QA. Unfortunately, existing disfluency correction models and datasets assume that fluent text is a subsequence of the disfluent one, and hence these approaches cannot solve disfluencies in DISFL-QA involving coreference. For fair comparison, we train a T5 generation model as a DISFL-QA specific disfluency correction model using the training set of DISFL-QA, with a simple DQ → Q and CDQ → Q T5 task formulation.
With this pipelined approach, we get further improvements with an overall F1 of 87.19 (Table 7), however, still lacking by ≈2.4 F1 points compared to the fluent dataset. This shows that such complex cases require better modeling, preferably in an end-to-end setup.

Disfluency Correction
The most popular approach in literature poses disfluency correction as a sequence tagging task, in which the fluent version of the utterance is obtained by identifying and removing the disfluent segments (Zayats et al., 2014;Ferguson et al., 2015;Zayats et al., 2016;Lou and John-son, 2017;Jamshid Lou and Johnson, 2020;Wang et al., 2020). . Traditional disfluency correction models use syntactic features (Honnibal and Johnson, 2014), language models Zwarts and Johnson, 2011), discourse markers (Crible, 2017), or prosody-based features for learning Wang et al., 2017) while recent disfluency correction models largely utilize pre-trained neural representations . Most of these models depend on human-annotated data. As a result, recently, data augmentation techniques have been proposed McDougall and Duckworth, 2017) to alleviate the strong dependence on labeled data. However, the resulting augmented data either via heuristics (Wang et al., 2020) or generation models  is often limited in terms of disfluencies types and may not well capture natural disfluencies in daily conversations.

Question Answering Under Noise
In the QA literature, our work is related to two threads that aim to improve robustness of QA models: (i) QA under adversarial noise, and (ii) noise arising from speech phenomena.
Prior work on adversarial QA have predominantly generated adversaries automatically (Zhao et al., 2018), which are verified by humans to ensure semantic equivalence (i.e. answer remains same after perturbation). For instance, Ribeiro et al. (2018) generated adversaries using para-phrasing, while Mudrakarta et al. (2018) perturbed questions based on attribution. Closest work to ours is Jia and Liang (2017), who modified SQUAD to contain automatically generated adversarial sentence insertions.
Our work is more closely related to prior work on making NLP models robust to noise arising from speech phenomena. Earlier work (Surdeanu et al., 2006;Leuski et al., 2006) have built QA models which are robust to disfluency-like phenomenon, but they were limited in the corpus complexity, domain, and scale. Recently there has been renewed interest in constructing audio enriched versions of existing NLP datasets, for example, the SPOKEN-SQUAD (Li et al., 2018) and SPOKEN-COQA (You et al., 2020) with the aim to show the effect of speech recognition errors on QA task. However, since collecting audio is challenging, another line of work involves testing the robustness of NLP models to ASR errors in transcribed texts containing synthetic noise using TTS → ASR technique (Peskov et al., 2019;Ravichander et al., 2021). Our work suggests a complementary approach to data collection to surface a specific speech phenomenon that affects NLP.

Conclusion
This work presented DISFL-QA, a new challenge set containing contextual semantic disfluencies in a QA setting. DISFL-QA contains diverse set of disfluencies rooted in context, particularly a large fraction of corrections and restarts, unlike prior datasets. DISFL-QA allows one to directly quantify the effect of presence of disfluencies in a downstream task, namely QA. We analyze the performance of models under varying when subjected to disfluencies under varying degree of gold supervision: zero-shot, heuristics, and k-shot.
Large-scale LMs are not robust to disfluencies. Our experiments showed that the state-of-the-art pre-trained models (BERT and T5) are not robust when directly tested on disfluent input from DISFL-QA. Although a naturally occurring phenomenon, the noise introduced by the disfluent transformation led to a non-answerable behavior at large.
Contextual heuristics partially recover performance. We derived heuristics, in attempt to resemble the contextual nature of DISFL-QA, by introducing semantic distractors based on NER, POS, and other questions. In our experiments, we found that heuristics are effective in: (1) confusing the models in zero-shot setup, and (2) partially recovering the performance drop on DISFL-QA with fine-tuning. This indicates that the heuristics might be capturing some key aspects of DISFL-QA.
Efficacy of gold training data. We use the gold data for supervising various models: (i) end-toend QA model, (ii) disfluency correction, and (iii) disfluency generation (for data augmentation). For all the experiments, gold supervision outperforms heurisitics' supervision significantly. Furthermore, we observed that in a low resource setup generation based data augmentation can match the performance of a high resource modeling setup.