Disfluency Generation for More Robust Dialogue Systems

Disfluencies in user utterances can trigger a chain of errors impacting all the modules of a dialogue system: natural language understanding, dialogue state tracking, and response generation. In this work, we first analyze existing dialogue datasets commonly used in research and show that they only contain a marginal number of disfluent utterances. Due to this relative absence of disfluencies in their training data, dialogue systems may then critically fail when exposed to disfluent utterances. Following this observation, we propose to augment existing datasets with disfluent user utterances by para-phrasing fluent utterances into disfluent ones. Relying on a pre-trained language model, our few-shot disfluent paraphraser guided by a dis-fluency classifier can generate useful disfluent utterances for training better dialogue systems. We report on improvements for both dialogue state tracking and response generation when the dialogue systems are trained on datasets augmented with our disfluent utterances.


Introduction
Disfluencies are common interruptions in the flow of speech.In English, it is estimated that disfluencies account for 20% of the words (Tree, 1995) and that there is a 50% probability that a sentence of 10-13 words will be disfluent (Shriberg, 1994).A probability that increases for longer sentences.
Since disfluencies are ubiquitous, they can have a significant impact on natural language processing (NLP) tasks.Previous work has largely addressed disfluency detection and studied the impact of disfluencies in various NLP tasks (Johnson and Charniak, 2004;Wang et al., 2010).Disfluency detection is a critical component of any NLP framework using speech transcriptions as input.
Disfluencies can mislead components of a dialogue system: natural language understanding (NLU), dialogue state tracking (DST), and response generation.On the other hand, disfluent utterances are usually absent in the publicly available dialogue datasets used for the research and development of dialogue systems.They are either removed, after disfluency detection or have never existed, for instance, in dialogue datasets made from non-spoken texts.The datasets on which dialog systems are trained and evaluated are often heavily curated.The dialogue systems trained on such datasets may then not be robust enough in real-world applications for which disfluent utterances are common.
In this paper, we propose to augment existing training datasets with disfluent paraphrases to train a more robust dialogue system.In contrast to previous work on disfluency generation, our disfluent paraphraser only requires a very limited amount of training data that makes it applicable to a wide range of scenarios.
Our contributions are as follows: • An analysis exposing the near absence of disfluent utterances in dialogue datasets and their impact on the robustness of dialogue systems.
• A framework to generate disfluent paraphrases.
• More accurate and more robust dialogues engines trained on our augmented datasets.

Disfluency in Dialogue
Disfluencies are usually categorized as in Table 1.
We can assume that depending on its category, a disfluency will not have the same impact on dialogue systems.For instance, "repair" and "restart" categories have more potential to mislead a system than "filled paused" since they may impact a large portion of an utterance.The example in Table 2 illustrates how a "repair" disfluency can impact the main modules of a dialogue system, with an error made by the NLU module on the slot values that propagates to the response generation.
To verify our assumption that most dialogue datasets used for research are not disfluent, we created a disfluency classifier (Section 3.2) applied to publicly available dialogue datasets commonly used for training and evaluating dialogue systems.The classification results are presented in Table 3.We observe that disfluent utterances are much more unlikely than in a normal English speech flow.For instance, less than 4% of the utterances in SIMMC2, often used to train and evaluate multimodal dialogue systems, are disfluent.
To train more robust dialogue systems, we augment their training data with synthetic disfluent utterances.While disfluency correction is a common task, there are only a few attempts in previous work for disfluency generation.Yang et al. (2020) proposes to generate disfluency with a neural model inserting n-grams at specific positions in fluent sentences.They focus on two disfluency categories: "repair" and "repetition".Their approach is able to generate natural disfluent sentences with a model trained on 29k disfluent sentences.In contrast, our approach relying on a paraphraser is able to generate any kind of disfluency but is not as conservative.Our approach is not constrained to inserting tokens at specific positions.
More recently, Gupta et al. (2021) and Passali et al. (2022) proposed to generate disfluent sentences using heuristics.While their approaches are admittedly generating less natural disfluent sentences than with a neural model, they do not require to be trained and are able to generate disfluencies from any category covered by the heuristics.

Disfluency Generation
Our disfluent paraphraser is applied to fluent utterances, identified by a disfluency classifier, from dialogue datasets.Then, the disfluent utterances generated are added to the dialogue datasets and used to train more robust dialogue systems following a standard training pipeline.

Disfluent Paraphraser
Pre-trained large language models (LLM) have demonstrated impressive results in most natural language generation tasks.Previous work proposed to use and evaluate LLM for disfluency correction (Saini et al., 2021;Gupta et al., 2021;Passali et al., 2022).We propose to also use LLM for disfluency generation. 3As for the training data for the paraphraser, we need disfluent dialogue utterances paired with their fluent version, manually created, so that the model can learn the sequenceto-sequence task of generating a disfluent utterance given a fluent one.Since we lack large training data for these tasks for most languages and domains, we propose to perform few-shot learning for disfluency generation.Concretely, we fine-tune the LLM on a few training examples.Since correcting a few disfluent utterances by hand is rather cheap, we assume this scenario to be realistic and applicable to most domains and languages.
In preliminary experiments, we observed beam search to be very conservative at inference time with our paraphraser, i.e., preserving the original structure and vocabulary of the fluent utterances.Since our goal is to augment datasets and generate diverse disfluencies, we propose to sample the search space during decoding to generate more diverse sequences with less overlap with the source utterance.This is particularly intuitive for generating disfluent utterances, for which a more aggressive sampling, to some extent, will generate more disfluent utterances.We found nucleus sampling (Holtzman et al., 2020) to generate outputs diverse enough with a top_p hyperparameter appropriately optimized (see Section 3.2).

Disfluency Identification
The dialogue datasets often contain manual annotations for NLU and DST for each user utterance.It is critical that these annotations remain valid for the generated disfluent utterances.If the paraphraser is too aggressive, the utterance may change meaning and will not match anymore the annotations.
We propose to use a disfluency classifier whose objective is to identify whether a user utterance is fluent or disfluent.If an utterance is classified as disfluent, our paraphraser will not be applied to this utterance.Moreover.we use the classifier decision to tune the aggressiveness of our paraphraser.For instance, if an utterance is identified as fluent but with a low probability, according to the classifier, we may only need to introduce a few modifications to make it disfluent.If an utterance is clearly found fluent by the classifier, a more aggressive disfluent paraphrasing should be performed to ensure it is disfluent enough.
In practice, this tunable aggressiveness is implemented in our paraphraser at inference time, using the probability α yielded by the classifier for an utterance to be disfluent to set the top_p hyperparameter of nucleus sampling as follows: where β is a constant between 0 and 1.In practice, we found that β = 0.2 yields useful disfluent utterances, but we argue that this may not be the case for all use cases, such as applying the paraphraser to datasets in a very different style and domain, and that consequently, β should be tuned. 4 As for the classifier itself, we propose to use BERT (Devlin et al., 2019) for binary classification.This is a simpler classification that the one proposed by previous work (Yang et al., 2020) that uses BERT to directly classify disfluency at token level.The training data for our classifier is then easier to create since we only need native speakers to label whether a sentence is fluent or disfluent.

Datasets
We trained our paraphraser and classifier on the Fisher English corpus created by Post et al. (2013)  5  which is a translation of the original Fisher Spanish corpus. 6We paired this corpus with its fluent version (Salesky et al., 2018) 7 in which the disfluencies have been manually corrected.Statistics of the full parallel corpora used are given in Table 4.
We report on experiments with dialogue tasks using SIMMC2 8 augmented with disfluencies for DST and response generation.

Settings and Baseline Systems
We trained our model for disfluency generation using T5. 9 We use the base version and acknowledge that we may get better results with a larger model but at a greater computational cost.The base version is a Transformer (Vaswani et al., 2017) with 12 layers, 12 attention heads, a feed-forward dimension of 3072, and embeddings of 768 dimensions. 4One of the drawbacks of using a varying top_p is that it complicates the implementation of batch decoding since we have utterances that would be paraphrased with different top_p in the same batch.Since we only paraphrase datasets for training, the decoding time was not our main concern and we simply paraphrase utterances one by one.Since we aim at few-shot learning, we fine-tuned T5 on subsamples of different sizes of the Fisher train fluent-disfluent parallel data, containing 50, 500, 5,000, or all the available parallel utterances, for 20 epochs with standard hyperparameters. 10 We select the best model according to BLEURT (Sellam et al., 2020) on the Fisher validation data.
We identified 36,873 fluent utterances in SIMMC2 using our BERT classifier, 11 trained on the same data as the paraphraser, and paraphrase them while keeping their annotations for DST the same.The 1,254 remaining utterances identified as disfluent are not paraphrased.The generated disfluent utterances are added to the original SIMMC2 yielding a new total of exactly 75,000 utterances.
For evaluation in dialogue, we use the same pipeline proposed by Kottur et al. (2021): GPT-2 is fine-tuned on the augmented training data for 5 epochs and is prompted with user utterances.We denote this configuration Disfluent Paraphraser.
For DST, we use the same evaluation script provided by the SIMMC2 repository.For response generation, we use BLEURT.We compared our approach with the following systems.
Original: This is the same baseline system proposed by Kottur et al. (2021).GPT-2 is fine-tuned on the original SIMMC2 for 10 epochs.
10 Fine-tuning T5 on all these subsamples took less than a day on an nVidia RTX3060 12Gb GPU.
11 Our classifier was trained using the Hugging Face Transformers default pipeline (Wolf et al., 2020).It reaches an F1 score of 81.4 on the Fisher test set.We released our model and code (links in the introduction).
LARD: We used the LARD heuristic-based framework,12 with default hyperparameters, to make the fluent utterances disfluent.LARD is not trainable and consequently cannot exploit the disfluent training examples.
Plan&Gen: We used the framework proposed by Yang et al. (2020) to insert disfluencies into the fluent utterances.This system can be considered as our baseline system.
General Paraphraser: We evaluate a standard paraphraser, i.e., not trained to generate disfluencies, using T5 fine-tuned on the "paranmt_filtered" compiled by Krishna et al. (2020) containing 75k paraphrases in mixed domains.
The only difference between LARD, Plan&Gen, General Paraphraser, and our Disfluent Paraphraser configurations is that they rewrite the same fluent utterances but using different approaches.

Results
We evaluated dialogue models on the entire devtest of SIMMC2, but also on the portions identified as fluent (8,321 utterances) or disfluent (288 utterances) to highlight where each model is the most effective.Our proposed approach for disfluency generation yields the most useful training data.Our disfluent paraphraser outperforms all the other systems for both DST and response generation.While LARD and Plan&Gen both improve the joint accuracy and slot F1 for the disfluent part of SIMMC2, the scores remain similar for the fluent part of SIMMC2.Interestingly, we observe the reverse with the general paraphraser which yields better results on the fluent part.Our disfluent paraphraser is the only system that improves the results on both fluent and disfluent utterances.Nonetheless, we also observe that our system requires at least 500 training examples to avoid a drop in BLEURT and joint accuracy on the fluent part.Indeed, we manually observed that when T5 is fine-tuned on only 50 fluent-disfluent utterance pairs, the generated disfluencies tend to be very noisy with many meaningless utterances, e.g., empty or containing sequences of many symbols.Those could be easily filtered with heuristics to improve the quality of the generated data.

Conclusion
We demonstrated that our disfluent paraphraser generates useful disfluent paraphrases to better train dialogue models and especially improve their robustness to disfluent utterances.Our approach improves dialogue state response and response generation for both fluent and disfluent user utterances.As future work, we would like to address the limitations discussed in Section 6.

Limitations
The main limit of our approach is that our paraphraser may generate meaningless utterances as we observed when trained on very few examples.To quantify these instances, an intrinsic evaluation of our paraphraser should be performed.Previous work proposed automatic evaluation of the disfluency generated using BLEU.We argue that the number of valid disfluent paraphrases for a fluent utterance is so large that BLEU cannot be a fair metric for our approach since it would only reward the specific utterances given as references.Only a thorough human evaluation can provide the necessary feedback on the naturalness, adequacy, and overall quality of the disfluency generated.Then, heuristics could be designed to filter out generated utterances of poor quality.
SIMMC2 evaluation has also a very small number of disfluent utterances which only exhibit a few instances of some of the disfluency categories presented in Section 2. Our results may not be as representative as we would like of a real-world scenario.Since all the publicly available dialogue datasets, annotated with intents and slot values, are mainly fluent, more representative evaluation datasets with very diverse types of disfluencies should be created.
Finally, the parallel Fisher corpus is not ideal to train an English paraphraser since it is a translation from Spanish.We did observe some translation errors and artifacts in the dataset, such as some Spanish characters like "¿", that may negatively affect the performance of our paraphraser.

Ethical Considerations
Language models are biased by the data used to train them.Our fine-tuning of BERT and T5 with the Fisher corpus potentially created biases or amplified some of the biases inherited from these two base models.We acknowledge that this work has the potential to be used to harm minorities, for instance, by unfairly classifying or amplifying disfluencies in utterances expressed by minority groups.
We decided to delay the public release of our models, datasets, and code used for disfluency generation until our work has gone under an entire peer-review cycle and publicly presented to receive as much feedback as possible.
On the other hand, we are releasing our disfluency classifier, in the form of fine-tuned BERT models and code for fine-tuning and evaluation, as we believe these resources can be useful for the research community while posing a much lower risk of harmful exploitation than our disfluent paraphraser.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.
Utterance I would like to book a ticket for Boston uh no sorry for Miami NLU Intent: book_ticket, slots: {destination: Boston} Response I booked your flight for Boston SIMMC2.a/b/c are scores obtained on the devtest with all the utterances (a), only the fluent utterances (b), and only the disfluent utterances (c), where fluent and disfluent utterances are identified by the classifier.The second column indicates the number of training examples from the Fisher parallel corpus exploited to train a disfluency generator.The highest numbers are in bold.

Table 1 :
Examples of different types of disfluencies.Tokens in bold are disfluent.

Table 2 :
Example of dialogue engine failure due to a disfluent utterance.

Table 4 :
Statistics of the parallel fluent-disfluent Fisher English corpus.We indicate between parentheses the original names of the datasets we used for dev and test.

Table 5 :
Results for as augmentation for disfluency detection.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1450-1460, Online.Association for Computational Linguistics.Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer.2022.Opt: Open pretrained transformer language models.C2.Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?section 4C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.