MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup

Current disfluency detection models focus on individual utterances each from a single speaker. However, numerous discontinuity phenomena in spoken conversational transcripts occur across multiple turns, hampering human readability and the performance of downstream NLP tasks. This study addresses these phenomena by proposing an innovative Multi-Turn Cleanup task for spoken conversational transcripts and collecting a new dataset, MultiTurnCleanup1. We design a data labeling schema to collect the high-quality dataset and provide extensive data analysis. Furthermore, we leverage two modeling approaches for experimental evaluation as benchmarks for future research.


Introduction
Spontaneous spoken conversations contain interruptions such as filled pauses, self-repairs, etc. (Shriberg, 1994). These phenomena act as noise that hampers human readability (Adda-Decker et al., 2003) and the performance of downstream tasks such as question answering (Gupta et al., 2021) or machine translation (Hassan et al., 2014) on transcripts of human spoken conversations. State-of-the-art disfluency detection methods (Yang et al., 2020;Jamshid Lou and Johnson, 2020) identify and remove disfluencies in order to improve the readability of spoken conversational transcripts (Wang et al., 2022;Chen et al., 2022). For instance, Figure 1(a) shows that disfluency detection methods can remove self-repairs of single turns. However, these models focus on removing interruptions and errors that commonly occur within single-turn utterances and cannot handle discontinuities across multiple turns. For example, in Figure 1(b), speaker B is in the middle of a thought when Speaker A interrupts to signal that they are Figure 1: A comparison of (a) existing Disfluency Detection Task (yellow highlights indicate disfluencies) with (b) the proposed Multi-Turn Cleanup task (red highlights indicate multi-turn cleanups) for spoken conversational transcripts.
following along ("A: Exactly"). B continues their train of thought ("B: Just in the last little while. Because...") by paraphrasing their own last sentence ("...just in the last generation."). The result is an exchange that is longer and more difficult to follow than necessary to understand what B is conveying.
This paper aims to "clean up" spoken conversation transcripts by detecting these types of multiturn "discontinuities" inherent in spontaneous spoken conversations. Once detected, they can be removed to produce transcripts that look more like hand-written conversations conducted over text messaging, social media, or e-mail as shown in Figure 1(b). Given that this is a novel task, with no pre-existing labeled data or benchmarks, we first  define a taxonomy of non-disfluency discontinuities (see Figure 1). Then we collect a new dataset, MultiTurnCleanup, for the Multi Turn spoken conversational transcript Cleanup task, based on the Switchboard Corpus (Godfrey et al., 1992) and label according to the proposed taxonomy. Finally we develop two baseline models to detect these discontinuities which we evaluate as benchmarks for future Multi-Turn Cleanup task studies. Our data analysis suggests that the MultiTurnCleanup dataset is of high quality. We believe it will help to facilitate research in this under-investigated area.

Data Collection and Analysis
We propose an innovative Multi-Turn Cleanup task and collect a novel dataset for this task called Mul-tiTurnCleanup 2 . This section presents the task definition, data collection process, and analysis.

Task Definition
Compared with the existing disfluency detection task, which aims to detect disfluencies (e.g., selfrepairs, repetitions, restarts, and filled pauses) that commonly occur within single-turn utterances (Rocholl et al., 2021;Chen et al., 2022), the Multi-Turn Cleanup task requires identifying discontinuities both within a single turn and across multiple turns in the multi-party spoken conversational transcripts. To explicitly define the task and discontinuity taxonomy, we conducted an in-depth analysis of the Switchboard corpus 3 (Godfrey et al., 1992). Specifically, we randomly sampled a subset of Switchboard conversations, annotated the discontinuity spans other than existing disfluency types, and grouped the annotated discontinuities into five main categories. Note that we conducted the discontinuity annotation and category grouping process iteratively with all authors to reach the consensus. We demonstrate the finalized taxonomy of discontinuities in Table 1.

Data Preprocessing
We preprocessed the Switchboard corpus by automatically removing the single-turn disfluencies with pre-defined rules based on Treebank-

Efficient Schema for High-quality Data Collection
We further propose a four-step data labeling schema to label both multi-turn cleanups and categories.

Labeling Procedure
Given the preprocessed data, we then conducted the human annotation process based on a data labeling schema 5 shown in Figure 2. Preparation and qualification selection. In steps 1 and 2, we prepared a suite of data preprocessing and user interface (UI) variations and conducted seven pilot studies to select the optimal task design. The final UI (see Appendix A.4) consists of: i) an introduction to the task, ii) an annotation example with highlighted discontinuities, and iii) the task workspace with affordances for annotation. In step 3, we recruited a set of qualified MTurk workers using a "Qualification HIT". We compared all 580 workers' submissions with the ground truth (authors' consensus) and select the 222 workers (38.3%) with an F1 ≥ 0.3 6 to participate in step 4. Large-scale data labeling. Controlling annotation quality for large-scale data labeling is challenging in MTurk (Daniel et al., 2018). To address this, we employed a "batch-wise labeling with quality checkpoint filter" (Bragg and Weld, 2016). Specifically, we split the dataset into small batches and posted them with "Quality Checkpoint HITs" (QCH) mixed in. Overall, we posted 22 batches including 7277 HITs and 11 QCH in total. We leverage these checkpoint HITs to exclude unqualified workers (F1 ≤ 0.3). Annotation filtering and aggregation. After fin- 5 Steps 3 and 4 lasted about one month. More annotation quality control details are available in Appendix A.3. 6 The 0.3 threshold is reasonable due to task subjectivity.  ishing the final batch, we collected all annotated batches and excluded 72 unqualified workers with all their HITs. Then we reposted 26% of the assignments where the conversations had less than two annotations to the remaining qualified workers. Finally, we aggregated the annotations for each turn by only keeping the best worker's (highest F1score) labels to compose the MultiTurnCleanup dataset. The average F1 for raters of labeled turns in MultiTurnCleanup is 0.57. We summarize the per-category statistics in Table 1 and the statistics of MultiTurnCleanup in Table 2 7 .

Validating Human Annotation Accuracy
During the whole data labeling process, we consistently assessed the human annotation accuracy and filtered out unqualified workers to control the data quality. We visualize the annotation quality in terms of the distribution of workers' F1 scores (see Figure 3(A)(B)-left), as well as the correlation between each worker's F1 score and their finished HIT counts and the average elapsed time per HIT (see Figure 3(A)(B)-right). These figures show how removing unqualified annotations with checkpoints can effectively control quality during the annotation process. Particularly, we observe that at the start (A), even after passed our initial "Qualification HIT" in step 3, 23% of workers perform at F1 < 0.3 but complete over 80% of all assignments, leaving only a limited amount of data for more com- 7 We leave out sw4[2-4]* subgroups in Switchboard as they are less commonly used, resulting in 1082 total conversations. For each one, we plot participated workers' F1 score distribution (left) and the correlation between each worker's F1 score and finished HIT count (right), the circle size means each worker's averaged elapsed time to finish a HIT.

Turn-based Inter-Rater Reliability
We compute Inter-Rater Reliability using Fleiss' Kappa Agreement (Fleiss and Cohen, 1973) for each annotated turn and average all turns' scores. Table 3 shows that the workers' Fleiss' Kappa scores are comparable to those of the authors.

Multi-turn Cleanup Models
Given the collected MultiTurnCleanup dataset, we leverage two different BERT-based modeling approaches, including a two-stage model and a combined model, for the Multi-Turn Cleanup task to remove both single-turn disfluencies and multiturn discontinuities.

The Two-Stage Model
The two-stage model is composed of a Single-Turn Detector (STD) to remove the traditional singleturn disfluencies and a successive Multi-Turn Detector (MTD) to remove the discontinuities occurring across multiple turns. We employ the BERT-based modeling, presented in Rocholl et al. (2021), for both STD and MTD stages. Particularly, we fine-tune the STD based on the traditional single-turn disfluency dataset (Godfrey et al.,  1992), whereas the MTD is fine-tuned based on our collected MultiTurnCleanup dataset. We concatenate STD and MTD successively into the pipeline of the two-stage model, so that both the single-turn disfluencies and multi-turn discontinuities in the raw conversational transcript can be removed with one pass.

The Combined Model
We design the combined model, using only one BERT-based detector (Rocholl et al., 2021), to simultaneously remove both single-turn disfluencies and multi-turn discontinuities. To this end, we create a UnionDiscontinuity dataset, which combines both the single-turn disfluency and multi-turn discontinuities labels in Godfrey et al. (1992) and MultiTurnCleanup datasets, respectively. Then we achieve the combined model by fine-tuning the detector with this UnionDiscontinuity dataset.

Experimental Setup
The Two-Stage Model. The STD and MTD are separately trained. We train the STD with the existing disfluency dataset, where the input is a single sentence (i.e., slash unit), with a maximum sequence length of 64. In comparison, we train the MTD with MultiTurnCleanup dataset, where the input consists of multiple slash-units (demarcated with [SEP] token between turns) with a maximum sequence length of 512. We feed full transcripts to the MTD in chunks with an overlap of 50% for prediction context. Then we predict discontinuities where either of the overlapping predictions for a given token was positive. During inference, the stage-2 MTD module loads the outputs from the stage-1 STD module, removes all of the tokens classified as disfluencies, and uses this redacted texts as its own input. The Combined Model. We train the combined model with the UnionDiscontinuity dataset using the same training settings of the aforementioned MTD module. During inference, we predict both single-turn and multi-turn discontinuities, as nondistinctive labels, simultaneously. Baseline. We employ the state-of-the-art BERT based disfluency detection model (Rocholl et al., 2021) trained with the widely used disfluency dataset (Godfrey et al., 1992) as the Baseline. Deployment. We train the models on Google's Au-toML platform, where it selects the optimal training settings as: Adam optimizer with learning rate as 1e − 5, batch size of 8, and 1 epoch.

Evaluation Metrics
We evaluate all models' performance with pertoken Precision (P), Recall (R), and F1 score (F1) on predicting if the token should be cleaned as single-turn disfluencies (STD of two-stage model), or multi-turn discontinuities (MTD of two-stage model), or their mixtures (the combined model).

Results
Evaluation on two sub-tasks. The Multi-Turn Cleanup task inherently involves two different subtasks, including the single-turn disfluency detection (i.e., with Disfluency dataset) and multi-turn discontinuity detection (i.e., with our collected Mul-tiTurnCleanup dataset), we first validate that the presented models can achieve state-of-the-art performance on the two sub-tasks (i.e., with the two different datasets), respectively. Particularly, Table 4 illustrates the performance of Baseline and presented models on the two datasets. The STD module achieves cutting-edge performance (Chen et al., 2022) to detect singleturn disfluencies. Also, the MTD module outperforms the Baseline on detecting multi-turn discontinuities with our proposed MultiTurnCleanup dataset. The significant disparity between MTD and Baseline methods (e.g., 56.8 vs. 15.5 in F1) also indicate the difficulty of detecting multi-turn discontinuities in MultiTurnCleanup dataset.
Evaluation on removing all discontinuities. Furthermore, we evaluate the overall model performance on jointly detecting the single-turn disfluencies and multi-turn discontinuities with one pass based on the UnionDiscontinuity dataset. As shown in Table 5, we observe that both the proposed Two-Stage Model and Combined Model can outperform Baseline method. In addition, the Combined Model achieves a 6.7 higher F1 score than Two-Stage Model on the Multi-Turn Cleanup task.

Related Work
Recent disfluency detection studies develop BERTbased models (Bach and Huang, 2019;Rocholl et al., 2021;Rohanian and Hough, 2021) and show significant improvement over LSTM-based models (Zayats et al., 2016;Wang et al., 2016;Hough and Schlangen, 2017) in disfluency detection tasks. Prior studies also show the importance of data augmentation methods, by leveraging extra transcript sources to improve disfluency detection performance Johnson, 2017, 2020). While most of the research has been focused on improving single-turn disfluency detection accuracy, little exploration has been done in detecting multi-turn transcript discontinuities.
Obtaining reliable annotated datasets via crowdsourcing is challenging and expensive (Alonso et al., 2014;Wong et al., 2022;Northcutt et al., 2021). To collect qualified dataset for multi-turn cleanup task, this work designs a data labeling schema which efficiently collects qualified dataset via the MTurk.

Limitation and Conclusion
We are aware that, in some specific scenarios, it might be undesirable to remove some multi-turn discontinuities because they convey social meaning in human interactions (e.g., engagement). We address this issue by providing category labels. As a result, future research can flexibly select subsets of the MultiTurnCleanup labels to train the model and clean up multi-turn discontinuities.
This study defines an innovative Multi-Turn Cleanup task and collects a high-quality dataset for this task, named MultiTurnCleanup, using our presented data labeling schema. We further leverage two modeling approaches for experimental evaluation as the benchmarks for future research. for providing their constructive feedback on this study. We thank the Amazon MTurk workers for their excellent annotations. We thank the reviewers for their thoughtful comments.

Ethics Statement
The collected MultiTurnCleanup dataset is built upon the published Switchboard Corpus (Godfrey et al., 1992). The dataset is sufficiently anonymized, so it is impossible to identify individuals. In addition, we protect privacy during the data collection process through the MTurk platform. The posted dataset does not list any identifying information of MTurk workers. Also, the data collection process does not access any demographic or confidential information (e.g., identification, gender, race, etc.) from the MTurk workers. In general, the dataset can be safely used with low risk in research and application fields for cleaning up spoken conversations and speech transcripts.