DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages

Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system. We release code to run our experiments along with our annotated dataset here.


Introduction
Humans often think and speak simultaneously in conversations, introducing erroneous words in utterances (Gupta et al., 2021).These words do not contribute to semantics of a sentence and hence can be removed to create fluent and easy-to-interpret utterances.Disfluency Correction (DC) is defined as the removal of such disfluent elements from spoken utterances (Shriberg, 1994).Motivation: Apart from making sentences readable and interpretable, DC also helps downstream natural language processing tasks like Machine Translation (MT) (Rao et al., 2007; Wang et al., 2010).Removing disfluencies shortens sentences, making it easier for automatic MT systems to translate these utterances.Moreover, the removed erroneous words are not translated which makes the output translation fluent containing all semantics from the source sentence.Table 1 illustrates examples where Google MT produces disfluent and difficult-to-read English translations of disfluent sentences in 3 languages -Hindi, German and French, establishing the need for DC.

Disfluent Sentence
Google MT output वाच में र न ग अह्ह्ह स्माटर् वॉच में रोका हुआ र न ग टाइमर रज्यू म करो running in watch ahhh resume running paused timer in smart watch je veux je veux euh enregistrer une une euh vidéo sur instagram I want I want uh record a uh video on instagram ich brauche eine fahrt äh eine fahrt zum bahnhof in einer stunde I need a ride er a ride to the train station in an hour Table 1: English translations produced by Google MT for disfluent sentences in Hindi, French and German.All disfluent words are marked in red.
Previous work in DC has leveraged variety of machine learning models for removing disfluent utterances from text (Ostendorf and Hahn, 2013b; Rasooli and Tetreault, 2015; Zayats et al., 2016b).However, data in DC is scarce, limiting the use of large transformer models.Switchboard (Godfrey et al., 1992), the most extensively available opensource DC corpus, contains English spoken utterances with only 5.9% disfluent words in the entire dataset (Charniak and Johnson, 2001).Synthetic Data Generation (SDG) has emerged as a viable solution to the data scarcity problem (Passali et al., 2022; Kundu et al., 2022).However, SDG can be challenging as it needs expert grammatical knowl-Table 2: Types of sentences observed in the DISCO corpus.All disfluencies are marked in red; EN-English, DE-German, FR-French, HI-Hindi.Examples in languages other than English, with their corresponding gloss and transliteration can be found in Appendix E edge and the data created can often fail to mimic complex disfluencies encountered in real-life dialogues (Gupta et al., 2021).
Hence there is a dire need to develop DC datasets with utterances from real-life conversational situations.Existing datasets have focused on increasing the available data in English.This paper presents a high-quality DC corpus in English and widely spoken languages like Hindi, German and French.Our dataset significantly expands the available data in English and Hindi.To the best of our knowledge, we are the first to create an opensource DC corpus for German and French 2 .Our contributions are: 1.A human-labeled dataset of 12K+ disfluent-2 Although Cho et al. (2014) annotated the KIT lecture corpus (Stüker et al., 2012) for disfluencies in German, their data is not shared publically.fluent text utterance pairs in 4 languages: English, Hindi, German and French with extensive data analysis (Section 3.4).2. Experimenting with various state-of-the-art techniques for DC ranging from traditional ML models to large transformers (Table 5).Our best models (fine-tuned multilingual transformers) achieve an F1 score of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French).Our results in English and Hindi are competitive with other approaches, but we do not report direct improvement due to the different testing datasets used.3. Improving BLEU score of a state-of-the-art MT system by 5.65 points in Hindi-English and German-English language pairs after automatic disfluency removal from source sen-tences (Table 10).Similar analyses for other language pairs are a part of our future work.

Related Work
The study of disfluencies as a spoken language phenomenon was first proposed in Shriberg (1994).DC has been established as a vital post-processing task for ASR transcripts (Rao et al., 2007; Wang et al., 2010) 2017; Johnson and Charniak, 2004; Zwarts and Johnson, 2011).Parsing-based methods use techniques such as dependency parsing to predict syntactic structure of an utterance along with disfluent elements (Rasooli and Tetreault, 2015; Honnibal and Johnson, 2014; Wu et al., 2015).Sequence tagging methods work well for disfluency removal from real-life spoken utterances, assigning disfluent/fluent label to every word in the sentence (Hough and Schlangen, 2015a; Ostendorf and Hahn, 2013a; Zayats et al., 2016a; Chen et al., 2022).Language clues and part-of-speech tags based systems have also been explored for DC (Bove, 2008; Christodoulides et al., 2014).There is a notable gap in literature regarding real data annotation in DC, with Switchboard (Godfrey et al., 1992) and Salesky et al. (2018) being the most extensive open-source labeled datasets for English DC.Although Gupta et al. (2021) introduced a dataset for disfluencies in English question answering, they have not been annotated for disfluent words.Without labeled data, various zero-shot, few-shot, and multi-task learning techniques have been proposed, which train on multilingual data, creating and utilizing synthetically generated disfluent sentences (Wang et al., 2018; Passali et al., 2022; Kundu et al., 2022; Bhat et al., 2023).In this work, we experiment with sequence tagging methods for DC.

DISCO: A Dataset for Disfluency Correction
This section analyzes the DISCO corpus, created with the help of English, Hindi, German and French language experts.DISCO contains paral-lel disfluent-fluent sentence pairs in the above four languages and English translations of fluent sentences in Hindi and German along with disfluency and domain labels.(1994) defines disfluencies as a composition of Reparandum, Interregnum and Repair (Figure 1).Reparandum refers to words erroneously uttered by the speaker.The speaker acknowledges that a previous utterance might be incorrect using interregnum, whereas repair contains words that correct mis-spoken words.Disfluent utterances might consist of an interruption point-a spoken phenomena like speech pauses.DC removes reparandum and interregnum while retaining repair to make the output sentence more fluent.We study four types of disfluencies observed in our dataset: Filler, Repetition, Correction and False Start.Additionally, there are some fluent sentences present in our corpus.Table 2 describes each type of sentence with some real examples from the DISCO dataset.Goel et al. (2023) released an open-source dataset containing real-life utterances of humans with AI agents for task-oriented dialogue parsing.We extract disfluent sentences and domain labels in English, Hindi, German and French from this corpus.These utterances consist of human dialogues like making notes, monitoring fitness, adding new contacts, opening apps, etc.All sentences are shared with respective language experts for fluent sentence creation and disfluency-type annotation.

Annotation Protocol and Challenges
For each language, we hired external annotators from reputed translation agencies with experience in data annotation.They were asked to create fluent sentences corresponding to disfluent utterances along with disfluency type labels.Each annotator was paid competitively based on current market standards (approximately $ 0.018 per word).Since we follow a sequence tagging approach towards DC, the annotators were asked to only remove disfluent words from utterances without changing word order or correcting original words/phrases.
Due to budget constraints, we could not utilize the entire dataset in German and French from Goel et al. (2023).However, we carefully select sentences in these languages to sufficiently cover all disfluency types with varied length and complexity of utterances.Since for every language, only one annotator created fluent sentences and disfluency type labels, ensuring high quality data was very important.We strongly encouraged the annotators to flag all dubious instances, after which the authors take a majority vote of retaining/removing doubtful disfluent words using high quality translation tools and subject knowledge wherever necessary.Flagged examples and our reasoning for specific annotations have been discussed in Appendix A.

Key Statistics
The DISCO corpus is carefully created to ensure healthy representation of various disfluency types and complexity of sentences.Table 4 describes average length of disfluent and fluent sentences for each language.Our analysis shows that in similar context, commands to AI agents are shorter in German and French than in English and Hindi.The standard deviation of the disfluent sentences demonstrates that the dataset also contains longer utterances, more than ten words long, in each language that are relatively difficult to correct.We showcase the distribution of disfluent sentences across disfluency types in figure 2. Our corpus also contains a good distribution of sentences across various task domains.Readers are urged to refer to Appendix B for the domainlevel distribution and other important plots pertaining to the corpus.

Helper Datasets
We also use some helper datasets, extracting unlabeled sentences to enable few shot learning-based experiments on DISCO.LARD: Contains synthetically generated English disfluent sentences using rule-based disfluency injection in fluent sentences (Passali et al., 2022).Samanantar: Consists of 49.7 million parallel sentences between English and 11 Indic languages (Ramesh et al., 2021).Source sentences were collected across many domains such as newspapers, government public archives, Wikipedia, etc.The corpus consists of fluent sentences, and we only use Hindi sentences for our experiments.GermEval 2014: Consists of 31K German fluent sentences collected from Wikipedia and various news corpora (Benikova et al., 2014).Originally used for Named Entity Recognition, we utilize unlabeled sentences from the train split.

DiaBLa:
Released by Bawden et al. (2021), this corpus consists of 5700+ sentence pairs for English-French MT.The dataset is curated from written and informal interactions between native speakers in both languages.

Dataset Evaluation
This section describes the experiments we perform to evaluate the DISCO corpus.Our evaluation strategy measures the efficacy of the corpus for robust disfluency correction in a wide variety of cases.Moreover, we also test the ability of our trained models to correct disfluencies for improving downstream machine translation.

Data Processing
All parallel sentence pairs are passed through a punctuation removal module to reduce the number of tokens for classification.As per the structure of disfluencies described in section 3.1, we consider fluent terms to always follow disfluent terms in an utterance.Disfluent utterances are marked with the positive label (1) and fluent utterances with the neutral label (0) (Kundu et al., 2022).

Baseline Models
We use a combination of smaller ML models, larger transformer models and transformers with adversarial training.All models are trained on an 80:10:10 train:valid:test split for each language.

ML Baselines
Previous work has shown the efficacy of using Conditional Random Fields (CRFs) and Recurrent Neural Network (RNN) based techniques for token classification in DC (Ostendorf andHahn, 2013b; Hough andSchlangen, 2015b).These models require fewer labeled data and are ideal for low-resource domain-specific training (Simpson et al., 2020).Token-level features from a powerful multilingual transformer, XLM-R (Conneau et al., 2020), were used for finetuning the CRF and RNN models.

Transformer Baselines
Transformers (Vaswani et al., 2017) are large and powerful neural networks capable of learning complex text representations for many downstream NLP tasks.We experiment with three multilingual transformers: mBERT (Devlin et al., 2019), XLM-R (Conneau et al., 2020) and MuRIL (Khanuja et al., 2021).Finetuning for sequence tagging is performed by adding a classification head (on top of these transformers) that performs sub-word level binary prediction.Prediction of a word to be disfluent/fluent is the prediction of the first subword to be disfluent/fluent.

Transformer with Adversarial Training (Seq-GAN-BERT)
In low-resource settings, adversarial training helps transformers improve the representations it learns for downstream tasks.We use the Seq-GAN-BERT model (Bhat et al., 2023), which supports adversarial training for transformers utilizing labeled and unlabeled data for token classificationbased DC.Unlabeled data is used from helper datasets specified in section 3.5.We obtain the best results using MuRIL transformer as the base model in Seq-GAN-BERT.

Experimental Setup
CRF and RNN models are trained using the Flair-NLP framework (Akbik et al., 2019) till the validation cross-entropy loss saturates.We start with a learning rate of 0.1 and reduce it by half each time the model does not improve for three consecutive epochs.Transformer models are trained using the popular transformers package (Wolf et al., 2020).We use a learning rate of 2e-5 and a weight decay of 0.01.All transformer models are trained for 40 epochs using the Adam optimizer (Kingma and Ba, 2014).

Hardware support
All ML baselines were trained with A100 GPUs provided by Google Colab.Transformers were trained with one NVIDIA GeForce RTX P8-11GB GPU per experiment.

Results and Analysis
We thoroughly analyse all experiments performed in DC.This section also discusses some case studies highlighting strengths and weaknesses of our best models.Our experiments in analyzing the impact of DC on MT provides interesting linguistic insights into the phenomenon of disfluencies.

Disfluency Correction
All results are reported using the F1 score metric (Jamshid Lou andJohnson, 2017; Passali et al., 2022).Combined results across all four languages are described in table 5.As the complexity of models increases, the overall accuracy also increases.Transformer architectures perform better than CRF and RNN-based models consistently.In each language, the best models produce 90+ F1 scores on blind test sets, indicating that our corpus successfully solves the data scarcity problem.As expected, F1 scores of multilingual transformers are close due to similiar range of parameters that are fine-tuned for token classification based DC.
Performance across disfluency types is described in table 6.We observe that the model performs poorly for fluent sentences in English and French due to fewer samples in the test set.In Hindi and German, false starts are the most difficult disfluencies to correct.Further examination reveals that our model often under-corrects longer false starts, especially in the presence of other disfluencies like fillers.Table 7 discusses some important case studies containing inferences produced by our model on unseen test sentences.Our models accurately correct complex cases such as multiple disfluencies and change in thought/speech plan.However, it also over-corrects clarifications and treats it as a correction to remove significant meaning.We observe that multi-words, a critical linguistic phenomenon in Indian languages, are often overcorrected to simplify the text.More case studies appear in Appendix E along with gloss and English transliteration for non-roman text.Table 8: Effect of each disfluency type and its removal on downstream MT for Hindi-English (Hi-En) and German-English (De-En) language pairs.F-Filler, R-Repetition, C-Correction and FS-False Start.

Impact of Disfluency Correction on Downstream Machine Translation
We use a strong baseline NLLB MT system (Costajussà et al., 2022) to compare English translations produced with and without disfluency removal (Appendix F) to understand the impact of DC mod-els on an important downstream NLP task.
The Ground Truth (GT) translations for Hindi-English and German-English were created by respective language experts.We use the sacrebleu package (Post, 2018) to calculate BLEU scores between: T1 (Translations without DC) and GT; T2 (Translations with Automatic DC) and GT; and T3 (Translations with Human DC) and GT.Table 10 summarises our results in both language pairs.DC improves downstream MT for Hindi-English by 6.44 points and for German-English by 4.85 points in BLEU score.We also observe that human DC outperforms automatic DC, highlighting scope of improvement of DC models.
Table 8 shows that translation BLEU score improves for every disfluency type after DC.Moreover, in repetition and false starts, the automatic removal of DC slightly outperforms Human DC.The most significant improvement in BLEU score is observed in fillers, with the lowest improvement in corrections.Our models also improve the scores  across all domains, as described in Appendix F.
We also compared the downstream MT improvement caused by a large transformer (MuRIL) trained separately on both DISCO and (Kundu et al., 2022) for Hindi DC followed by downstream Hindi -English translation using NLLB.

Case Study: Examining a few translations with and without disfluency correction
Table 9 discusses some interesting cases where disfluencies impact the English translation of Hindi and German sentences.Although removing disfluencies in most cases helps MT, there are few examples where DC leads to worse output.

Conclusion and Future Work
This paper introduces the DISCO dataset for disfluency correction in four widely spoken languages: English, Hindi, German and French.Our work highlights the importance of large-scale projects in NLP that scale the amount of labeled data available.Spoken interactions between humans and AI agents are riddled with disfluencies.Eliminating disfluencies not only improves readability of utterances but also leads to better downstream translations.Our dataset, which consists of roughly 3000 parallel disfluent-fluent sentences in each language, significantly reduces the data scarcity problem in DC.This allows training of large transformer models to correct spoken disfluencies from written transcripts with high accuracy.Lack of conversational translation datasets has led to most MT systems trained on fluent text.Our experiments show that such models if used in conversational settings do not perform well.By adding a DC model in the pipeline, which is often a smaller model with an incremental increase in latency, one can improve the downstream translations outputted by an MT system that does not adjust to conversational phenomena.Moreover, our dataset in German -English and Hindi -English can also be used to finetune conversational MT models.
Future work lies in experimenting with better ML models for sequence tagging-based DC supporting multilingual training.These should also incorporate linguistic features like reparandum, interregnum and repair.Multimodal DC presents a promising direction as it has the capability of using both speech and text features for correction tasks (Zhang et al., 2022).Additionally, trained DC models must be evaluated using diverse samples from various domains and dialects.Special efforts must be made to collect disfluent speech transcripts to be annotated and trained for DC in other low-resource languages.

Acknowledgements
We would like to thank the anonymous reviewers and area chairs for their suggestions to strengthen the paper.This work was done as part of the Bahubhashak Pilot Project on Speech to Speech Machine Translation under the umbrella of National Language Technology Mission of Ministry of Electronics and IT, Govt. of India.We would also like to thank the project managers, internal and external language translators at the Computation for Indian Language Technology (CFILT) IIT Bombay.

Limitations
Our work consists of two limitations.Firstly, since our annotation process consisted of one annotator for each language, we could not report metrics such as inter-annotator agreement or Cohen's kappa to prove the validity of our dataset.However, since DC is a relatively more straightforward task and consists of only removing disfluent words from spoken utterances, the authors were able to verify many samples as a part of their analysis.Moreover, the structure of disfluencies helps us recognize disfluency types easily.We have also provided a few flagged cases where annotators discussed their queries with us and how we resolved them.
Secondly, we do not compare trained models on DISCO with other datasets due to varied domain of existing datasets.We found that existing datasets like Switchboard (Godfrey et al., 1992), LARD (Passali et al., 2022) and Kundu et al. (2022) all consisted of utterances from very diverse data sources.However we include experiments in Appendix D that highlight the robustness of models trained on DISCO.

Ethics Statement
This work publishes a large scale human annotated dataset for disfluency correction in 4 Indo-European languages -English, Hindi, German and French.We have taken all steps to ensure that the data is collected and annotated using all ethical means.The source sentences of our dataset are extracted from Goel et al. (2023) which release the data using the CC by 4.0 license, allowing us to remix, transform, and build upon the material for any purpose.We also follow a stringent data annotation protocol with consent from the annotators and ensuring they are aware of the risks associated with data creation.We also mention the compensation paid to them for their contribution in section 3.3.Since this project is not sponsored by a federal body, we do not use the IRB approval for our work.However, attention is paid to the quality of our dataset with flagged cases discussed extensively with annotators to ensure appropriate resolution (Appendix A).A thorough and extensive analysis of our corpus is performed, details of which are provided in section 3.All datasets used in conjunction with our corpus are open-source and cited appropriately in section 3.5.We understand that the dataset might have some mistakes, and we will continuously work on monitoring and resolving such issues once the corpus is published for open-source research.Our robust results across domains and types of sentences ensure that changes to the dataset do not pose any technical issues or risks to both developers and model users.

A Flagged Cases in Data Creation
Data in DC is expensive and challenging to annotate.Language experts must not only have complete knowledge of the grammar and semantics of a language, but they should also be mindful of the disfluency structure and rules for token classification.In our work, we could only assign one annotator per language.To ensure that the data created is high quality, we had regular discussions with our annotators to resolve flagged examples.Table 12 showcases such examples with the reasoning behind our data annotation decisions.After verifying the quality of data created, we proceed with data analysis as described in section 3.4.

B Additional Dataset Analysis
In this section, we provide information about the origin of the utterances that are part of the DISCO dataset.We use domain type labels from Goel et al. (2023).To better understand our data, we describe each domain type under broader categorization.In each example, disfluent utterances are marked in red.The number of sentences in each domain type for each language is specified in table 13.
• Health Fitness: Sentence pairs belonging to this domain consist of utterances where the user wants to perform a fitness or health checkup task like recording his/her exercises, nutrition or blood sugar.Any interaction where the user discusses any fitness query can be tagged in this category.Domains such as Get health stats, Log exercise, Log nutrition, Start exercise, Stop exercise, Pause exercise and Resume exercise fall under this type.
Example -Go to Fitbit and show me my um my blood sugar reading • Order Status: Sentence pairs belonging to this domain consist of utterances where the user wants to check the status of the already placed order.The Check order status domain falls under this type.
Example -Check the status of um of my Poshmark order with FedEx.
• Finance : Sentence pairs belonging to this domain consist of utterances where the user wants to perform a finance task like checking stock market prices or getting information from a finance app.The domain Get security price falls under this type.
Example -I want to um check stock prices.
• Bill Payment or Purchase: Sentence pairs belonging to this domain consist of utterances where the user wants to complete a bill payment or is instructing the AI agent to purchase something for him/her.Domains such as Get bill, Pay bill, Get product, BuyEventTickets, GetGenericBusinessType, Order menu item fall under this type.
Example -Pay my um my phone bill for this month.
• Internal Task: Sentence pairs belonging to this domain consist of utterances where the user wants the AI agent to perform a task which does not involve any extra application.Examples could be sending a message/email to someone, cancelling some plan, taking some notes, etc.If a third-party application is used in the utterance, it is an "External Task"; if not, it is an "Internal Task".Domains such as Get message content, Add contact, Create note, Open app, Take photo, Add item to list fall under this type.
Example -I want to e-mail Zane this photo and cc um and cc Zach.
• External Task: Sentence pairs belonging to this domain consist of utterances where the user wants the AI agent to perform a task with the help of a third-party application.In this domain, you will find utterances where the user specifies the AI agent and which application should the AI agent use to complete the task.Domains such as Cancel ride, Order ride, Post message fall under this type.
Example -Use WhatsApp to to send location to Jim.
We also depict the word cloud of disfluent sentences across the four languages.Our analysis shows the most common disfluent words across four languages.Since the Filler class occupies a majority in the distribution, for each language, we see filler words like um, uh, er, and umm occupy a considerable size in the cloud for English.Similarly, common fillers in Hindi, German and French are the biggest in the respective word clouds (figure 3).
Correlation analysis between original Hindi and German sentences and their respective English translations was also performed to ensure that the number of outliers was minimum and the slope of points followed a natural straight line.Figure 4 depicts the straight-line scatter plots observed.
The disco corpus contains a good representation of shorter and longer disfluent sentences across each language, increasing the complexity of corrections needed.Figure 5 depicts the box plot of disfluent sentences, indicating the average sentence lengths of spoken utterances across four languages.These plots summarize our analysis and motivate us to test this dataset across various ML models (section 4).

C Domain Level Analysis of DC Results
We show the domain-wise performance of our DC models in each language in table 14.The best models can reach 100 per cent accuracy for many domain types.However, there are still some domains, such as Log exercise, Get Bill and Log nutrition, where the performance varies significantly for each language.Our results show robust performance across domain as well as disfluency types (section 5.1).

D Evaluating models trained on DISCO using other DC datasets
Models trained on DISCO outperformed test sets from other DC datasets.

E More examples of DC inference
This section shows more examples of inferences from our best models across all four languages -English (Table 16), Hindi (Table 17), German (Table 18) and French (Table 19).Since Hindi sentences are written in the Devanagari script, we provide transliteration and gloss for every example discussed.Strong results of our models for DC motivate us to test their performance for downstream MT improvement (Section 5.2)

F Setup for downstream DC-MT experiments and domain level results
Disfluency Correction has been studied predominantly as post editing task for downstream tasks.In this section, we discuss important experiments in understanding the impact of disfluencies for downstream machine translation.We work with the Hindi -English and German -English language pairs.We use the NLLB MT system ((Costa-jussà et al., 2022)) for our experiments.The following steps are followed -Disfluency Correction has been studied predominantly as a post-editing task for downstream problems.This section discusses essential experiments in understanding the impact of disfluencies on downstream machine translation.We work with the Hindi -English and German -English language pairs.Our experiments use the NLLB MT system ((Costa-jussà et al., 2022)).The following steps are followed -• Disfluent sentences are passed through the MT system to create translations T1 • Automatic Disfluency Corrected sentences (using our best DC models) are passed through the MT system to create translations T2 • Human-corrected sentences (as provided by our annotators) is passed through the MT system to create translations T3 • We denote ground truth translations as GT.T1 is compared with GT to calculate BLEU_DIS.Similarly, BLEU_ADC is the score between T2 and GT, and BLEU_HDC is the score between T3 and GT.

Figure 2 :
Figure 2: Distribution of sentences across disfluency types for all four languages in DISCO.
Running in the Watch Ahh Smart Watch Restart the Running Time that has been blocked in the Watch T2: Resume the running timer that is blocked in the smartwatch T3: Resume stopped running timer in smart watch.T1 is difficult to interpret due to the presence of translated false start phrase.Stop running 5 miles now.Due to incomplete disfluency correction, dubious translation (bus or stop) of the following word causes an issue: बस

Figure 3 :
Figure 3: Word cloud of disfluent sentences across each language in the DISCO corpus, showcasing the most common disfluent words observed in spoken utterances

Figure 5 :
Figure 5: Box plot of disfluent sentence lengths across all languages in DISCO corpus Table 3 summarizes the total amount of data created and the amount of disfluency present in the corpus.

Table 5 :
Our model performs robustly across all domain types of utterances.Readers are Results in DC for each language.For Seq-GAN-BERT, we report best results with helper datasets (section 3.5): English (En): LARD, Hindi (Hi): Samanantar, German (De): GermEval 2014, French (Fr): DiaBLa.Since we are the first to create DC corpus in German and French and with existing English and Hindi datasets being vastly different in its properties and sources, we do not provide zero-shot metrics of our best models on other datasets.strongly urged to refer to Appendix C for domainlevel analysis of DC results.Although existing DC datasets are of diverse domains, our experiments show that models trained on DISCO outperform test sets from other DC datasets (Appendix D).
Table 6: F1 scores for every disfluency type in each language using our best DC model.*We report F1 score of fluent class here because for disfluent class, true positives is equal to zero.

Table 9 :
Examining some examples where disfluencies impact machine translation output for German-English (De-En) and Hindi-English (Hi-En) language pairs

Table 11 :
Table 11 highlights that MuRIL trained on DISCO leads to a 3.66 BLEU score improvement relative to the baseline.Comparing the performance of DC systems trained on different datasets on the Hindi -English DISCO MT improvement task when used with a stateof-the-art MT system (NLLB)

Table 15
English sentences and changes in BLEU score observed with and without DC are given in table 20 and table 21, respectively.Some important examples where DC impacts downstream MT is discussed in section 5.2.1.

Table 14 :
F1 scores for disfluency correction for every domain type in each language

Table 15 :
Test F1 scores of a transformer based model trained on the DISCO dataset compared to other open-source datasets.For LARD, our model performs comparatively even in out-of-domain synthetic test sets.

Table 16 :
Inference examples from English DC models; En -English

Table 17 :
Inference examples from Hindi DC models; hi -Hindi

Table 18 :
Inference examples from German DC models; De -German