Disfluency Correction using Unsupervised and Semi-supervised Learning

Spoken language is different from the written language in its style and structure. Disfluencies that appear in transcriptions from speech recognition systems generally hamper the performance of downstream NLP tasks. Thus, a disfluency correction system that converts disfluent to fluent text is of great value. This paper introduces a disfluency correction model that translates disfluent to fluent text by drawing inspiration from recent encoder-decoder unsupervised style-transfer models for text. We also show considerable benefits in performance when utilizing a small sample of 500 parallel disfluent-fluent sentences in a semi-supervised way. Our unsupervised approach achieves a BLEU score of 79.39 on the Switchboard corpus test set, with further improvement to a BLEU score of 85.28 with semi-supervision. Both are comparable to two competitive fully-supervised models.


Introduction
Disfluencies are disruptions to the regular flow of speech, typically occurring in conversational speech. They include filler pauses such as uh and um, word repetitions, irregular elongations, discourse markers, conjunctions, and restarts. For example, the disfluent sentence "well we're actually uh we're getting ready" has its fluent form as, "we're getting ready". Here, the words highlighted in green, blue and red refer to discourse, filler and restart disfluencies, respectively.
Disfluencies in the text can alter its syntactic and semantic structure, thereby adversely affecting the performance of downstream NLP tasks such as information extraction, summarization, translation, and parsing (Charniak and Johnson, 2001;Johnson and Charniak, 2004). These tasks also employ pretrained language models that are typically trained * Joint first authors to expect fluent text. This motivates the need for disfluency correction systems that convert disfluent to fluent text. Prior work has predominantly focused on the problem of disfluency detection (Zayats et al., 2016;Wang et al., 2018;Dong et al., 2019). Inspired by recent work on unsupervised machine translation and style-transfer models for text, we propose an unsupervised encoder-decoder based model to tackle the problem of disfluency correction. Our model does not require access to a parallel corpus of disfluent and fluent sentences. We also show a semi-supervised variant of our model that uses a small amount of parallel disfluent-fluent text and significantly improves performance. To our knowledge, this is the first work to use stateof-the-art unsupervised models for the task of disfluency correction. Our main contributions are as follows: • We cast the problem of disfluency correction as one of translation from disfluent to fluent text and we propose an unsupervised transformer-based encoder-decoder model for disfluency correction.
• We compare and contrast an unsupervised and semi-supervised approach for disfluency correction, where the latter has access to a small amount of parallel text. We also implement fully-supervised methods as a skyline and show how our models come very close in performance to these approaches, which are very resource-intensive and require large amounts of parallel text.
• We show detailed ablation analyses across disfluency types and present a qualitative study of disfluency corrections that our model can achieve. Figure 1: Illustration of (a) Style transfer model modified to use type embedding drawn from a pretrained CNN classifier. (b) Conditioning on domain embeddings in the transformers' decoder. Pred(i) and Input(i) are the decoder's prediction and input to the decoder at the i th time-step respectively.

Related work
Current literature has primarily focused on disfluency detection in both speech and text in fully supervised settings (Wang et al., 2016;Georgila et al., 2010;Zayats et al., 2014;Tran et al., 2019;Wang et al., 2018;Bach and Huang, 2019;Zayats et al., 2016;Lou and Johnson, 2020a). The grammatical error correction (Omelianchuk et al., 2020) approach does not perform well on the disfluency correction tasks. In most cases, simply removing disfluencies from an utterance can render the sentence ill-formed. More meaningful and syntactically well-formed utterances are generated by performing automatic disfluency removal from speech (Kaushik et al., 2010;Lou and Johnson, 2020b) and text Honal and Schultz, 2005;Hassan et al., 2014). With the popularity of end-to-end spoken translation systems, several works translate fluent utterances from disfluent speech (Salesky et al., 2018;Ansari et al., 2020;Fukuda et al., 2020) or disfluent text (Cho et al., 2013;Saini et al., 2020;Cho et al., 2016). Most of these approaches work in a supervised setting or mitigate the lack of parallel disfluent-fluent text via data augmentation, model design, incorporating domain knowledge of the language, or using multi-lingual NMT. (Salesky et al., 2019) proposes a system for conversational speech translation with the joint removal of disfluencies.

Our Approach
We draw inspiration from unsupervised neural machine translation models (Lample et al., 2017) and style transfer models (He et al., 2020) to design the disfluency correction model illustrated in Fig-ure 1a. It consists of a single encoder and a single decoder, used to translate in both directions, i.e., from disfluent to fluent text and vice-versa. The decoder is additionally conditioned using a domain embedding to convey the direction of translation, signifying whether the input to the encoder is a fluent or disfluent sentence. More details about our framework are described below. Figure 1a shows the two directions of translation. The model obtains latent disfluent and latent fluent utterances from the non-parallel fluent and disfluent sentences, respectively, which are further reconstructed back into fluent and disfluent sentences. We employ a backtranslation-based objective, followed by reconstruction for both domains, i.e., disfluent and fluent text. For every mini-batch of training, soft translations for a domain are first generated (denoted byx andȳ in Figure 1a), and are subsequently translated back into their original domains to reconstruct the mini-batch of input sentences. The sum of token-level cross-entropy losses between the input and the reconstructed output serves as the reconstruction loss. Borrowing from prior work on unsupervised style transfer model (He et al., 2020), the decoder is conditioned on a domain embedding that specifies the direction of translation. In this work, we employ two types of embeddings: A vanilla binary domain embedding that takes a bit as input to indicate whether the input text is fluent or disfluent and a classifier-based domain embedding. The latter is obtained from a trained standalone CNN-based classifier (Kim, 2014) that predicts the disfluency type of a disfluent input sentence. (Here, we as-sume that disfluency type labels are available for the disfluent sentences in our training data.) The classifier's penultimate layer acts as our classifier embedding, which is further used to condition the decoder. We hypothesize that additional information about disfluency types via the classifier-based embedding might help guide the process of disfluency correction better.

Unsupervised Disfluency Correction
Furthermore, similar to the noise models adopted by (He et al., 2020;Lample et al., 2017), a randomly sampled noisy version of every sentence in the input mini-batch is fed to the model, forcing it to behave like a denoising auto-encoder. We use noise perturbations (Lample et al., 2017) in the form of word-shuffle(α), word-blank(β) and word-dropout(γ) operations.
We explore two choices to implement our encoder-decoder modules: 1) BiLSTM-based (Bahdanau et al., 2015) and 2) Transformerbased (Vaswani et al., 2017). For the BiLSTM model, as proposed by (He et al., 2020), the BOS vector, i.e., the input to the decoder at the first timestep, is replaced by the domain embeddings. In the Transformer model, this conditioning needs to be carefully done. Figure 1b illustrates how we conditioned the transformer-based decoder. Word embeddings (with their dimensionality reduced) are concatenated with the domain embedding (denoted by DE) at every time-step to form the input for the decoder.

Semi-Supervised Disfluency Correction
Our unsupervised disfluency correction model can be easily fine-tuned using small amounts of parallel text, when available, lending itself to semisupervised learning. The encoder-decoder modules are initialized using the unsupervised training described in the previous section and further fine-tuned with a supervised cross-entropy loss using small amounts of parallel disfluent-fluent text. We do not use domain embeddings during semisupervised training; the inference is done as in the unsupervised model, i.e., with domain embeddings.

Experiments and Results
In this work, we use the Switchboard corpus (Godfrey et al., 1992) that includes telephonic conversations and their disfluency annotations (Schriberg, 1994;Zayats et al., 2014

Implementation Details
Our BiLSTM model uses a single layer of recurrent units of hidden size 750 with max-pooling over a window size of 5. The noise perturbation parameters, α, β, γ were tuned on the validation set and set to 0. The model was trained for 15 epochs with 10 for annealing, using mini-batches of size 32, with Adam optimizer (Kingma and Ba, 2015) and a learning rate 0.01 linearly scheduled with the rate of decrements of 0.5. Empirically, we also found it essential to allow gradients to pass through the backtranslations to generate meaningful sentences. The transformer model uses 8 attention heads, word embedding and domain embedding dimensionalities of 1024 and 512. The noise perturbation parameters, α, β, γ are set as 3, 0.2, 0.1. Adam optimizer is used with an initial learning rate of 0.00001, with a linear scheduler and 10 warm-up steps. We used mini-batches of size 32. Dropout (Gal and Ghahramani, 2016) and labelsmoothing (Szegedy et al., 2016) values were 0.3 and 0.1, respectively. Table 1 shows BLEU and METEOR scores between the gold fluent and the disfluency corrected output from five different models. We train two fully supervised skylines, based on Seq2Seq (Sutskever et al., 2014) and BART (Lewis et al., 2019), to compare against our approaches. The BLEU score using original disfluent text as the hypothesis is 71.53. The two supervised skylines use 55K pairs of parallel disfluent-fluent sentences during training and yield up to 90 BLEU score. In comparison, the unsupervised approach yields up to 80 BLEU scores without any parallel data. Fine- tuning the unsupervised model with a small parallel corpus containing only 554 pairs (i.e. two orders of magnitude smaller than the complete set of 55K pairs) significantly bridges this gap and yields up to 85 BLEU score. In terms of METEOR, the score using original disfluent text as the hypothesis is 57.19. The difference between unsupervised and supervised approaches is much smaller, indicating that with respect to adequacy or content preservation, these approaches perform at par. These results also show that the last few additional BLEU points (i.e., the difference between BART and SS) come at a high cost with having to create a large parallel corpus. We obtain 77.34 and 77.97 BLEU on the dev and test sets using binary embeddings, respectively, whereas the disfluency-type classifier embedding yields 78.72 and 76.90 on the dev and test sets. The classifier embeddings do marginally improve performance. However, the BLEU scores obtained using the binary embeddings are almost comparable, which shows that our proposed model can effectively use non-parallel text without any disfluency type labels.

Results
Sentence Length: Figure 2 shows BLEU scores as a function of maximum sentence length on the test set. The BLEU score is highest for the utterances smaller than ten tokens; on longer sentences, the BLEU scores drop. This trend is uniform across all models. Our transformer-based model significantly outperforms the BiLSTM-based model on utterances of all lengths. Interestingly, our semisupervised approach is very similar in performance to the fully supervised approach for smaller (<10 token) utterances.   Semi-supervised Learning: Table 2 shows the performance when our unsupervised model is finetuned with varying amounts of parallel text. By having access to only 554 parallel pairs (i.e., 1% total pairs), the performance improves by an impressive 5.89 BLEU on the test set. While BLEU improvements are a monotonically increasing function of the amount of parallel text, we see a trend of diminishing returns soon after the 1% mark.
Performance Across Disfluency Types: Intuitively, certain types of disfluencies (e.g., fillers) are easier to correct than others (e.g., edits). Table 3 reports the BLEU scores from all our models across disfluency types. Conjunctions and discourse disfluencies mark the easy end of the disfluency correction spectrum, while edits and asides mark the challenging end. (Edits are also hard to correct because of the lack of training data.) Qualitative Analysis: Table 4 shows examples  using five different models along with the corresponding disfluent and fluent sentences. All five models can remove simple disfluencies (e.g., fillers and discourse) in shorter sentences. Conjunctions and repetitions are removed by all models except the unsupervised BiLSTM model. The third example shows how the transformer model is much better than the BiLSTM model in terms of content retention and adequacy. It also highlights better fluency of the semi-supervised model compared to the unsupervised model. The fourth example illustrates increased complexity due to the presence of multiple disfluency types(conjunction, discourse, restart) within a single utterance. The fifth example illustrates a case of an aside, which is difficult for all models. It shows how even the supervised BART model fails to detect the disfluent phrase "i forgot sally's last name anyway". (Additional contextual information is required for the disfluent phrase to be correctly identified.)

Conclusion
We propose an unsupervised disfluency correction model drawing motivation from prior work on unsupervised machine translation and style transfer. We investigate two kinds of domain embeddings for our model. We also present a semi-supervised disfluency correction approach. We finetune our model using only about 500 parallel sentences, which comes very close in performance (based on BLEU scores) to a state-of-the-art, fully supervised system. In future work, we intend to explore how these techniques can be integrated more closely with spoken translation.