FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To tackle the challenges in stutter removal, we propose FluentSpeech, a stutter-oriented automatic speech editing model. Specifically, 1) we propose a context-aware diffusion model that iteratively refines the modified mel-spectrogram with the guidance of context features; 2) we introduce a stutter predictor module to inject the stutter information into the hidden sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE) dataset that contains spontaneous speech recordings with time-aligned stutter labels to train the automatic stutter localization model. Experimental results on VCTK and LibriTTS datasets demonstrate that our model achieves state-of-the-art performance on speech editing. Further experiments on our SASE dataset show that FluentSpeech can effectively improve the fluency of stuttering speech in terms of objective and subjective metrics. Code and audio samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.


Introduction
Recently, text-based speech editing (Jin et al., 2017(Jin et al., , 2018;;Morrison et al., 2021;Tan et al., 2021;Tae et al., 2021;Wang et al., 2022;Bai et al., 2022) has made rapid progress, and stutter removal is a critical sub-task in speech editing.There are various application scenarios for stutter removal, like short-form videos, movies, podcasts, YouTube videos, and online lectures, since it provides great convenience for media producers.
Previous speech editing systems (Jin et al., 2017(Jin et al., , 2018) ) successfully enable the user to edit the speech recording through operations in the text transcript.Some neural text-to-speech (TTS) based methods (Tan et al., 2021;Tae et al., 2021) achieve smooth transition at the boundaries of the edited region.And most recently, the mask prediction based methods (Wang et al., 2022;Bai et al., 2022) learn better contextual information from the input melspectrogram and outperform previous approaches at speech quality and prosody modeling.However, the existing approaches only aim at modifying reading-style speeches, while removing stutters from spontaneous speeches remains a considerable challenge.
When applied to the stutter removal task, previous efforts are still subject to the following limitations: 1) the generated mel-spectrogram is usually blurry and lacks frequency bin-wise details, resulting in unnatural sounds in the boundaries of the modified region; 2) when the speech recording is full of stutters, the edited speech is usually not robust due to the noise introduced by the discrepancy between text and stuttering speech content; 3) the stutter region should be manually determined one by one, which is costly and laborious for media producers.
To tackle these challenges, we propose Flu-entSpeech, the first generative model to solve the stutter removal task, which automatically detects the stutter regions, removes them, and generates fluent speech with natural details.Specifically, • Non-probabilistic models tend to generate over-smooth mel-spectrograms (Huang et al., 2022;Popov et al., 2021), while probabilistic models (e.g., GAN and diffusion) generate mel-spectrograms with richer frequency details and natural sounds.Based on this observation, we adopt a context-aware diffusion model that utilizes rich contextual information to guide the diffusion and reverse processes, which helps FluentSpeech to generate highquality and expressive results.
• To improve the robustness against stuttering speeches, we introduce a conditional stutter predictor that localizes the stutter region and injects the stutter information into the frame-level hidden sequence to reduce the discrepancy between text and stuttering speech.Moreover, the predicted stutter region can be utilized as the mask for automatic stutter removal.
• We propose a novel dataset called the stutteroriented automatic speech editing (SASE) dataset, which contains spontaneous speech recordings with time-aligned stutter labels for automatic stutter removal.
Experiments on the VCTK (Yamagishi et al., 2019) and LibriTTS (Zen et al., 2019) dataset show that FluentSpeech outperforms state-of-theart models on speech editing towards reading-style speech with fewer model parameters.And in the experiments on our newly collected SASE dataset, FluentSpeech enjoys much robustness against stuttering speech and demonstrates the ability to improve the fluency of stuttering speech significantly.The main contributions of this work can be summarized as follows: • We analyze the characteristics of different speech editing approaches (e.g., algorithm, architecture, alignment learning approaches, etc.) and propose a context-aware diffusion probabilistic model that achieves state-of-theart performance on speech editing.
• We propose a stutter predictor module to improve the robustness against the stuttering speech and localize the stutter region.The stutter predictor can also control the stutter representations by removing the stutters from the spontaneous speech to improve its fluency, which solves the automatic stutter removal task for the first time.
• We contribute a novel SASE dataset which contains 40 hours of spontaneous speech crawled from online lectures or open courses given by 46 speakers.We will publish our model and dataset as the benchmark for the evaluation of future SASE algorithms.

Background
In this section, we describe the background of speech editing and the basic knowledge of diffusion model.We also review the existing applications of diffusion model in speech tasks and analyze their advantages and disadvantages.

Speech Editing
Conventional speech editing methods (Derry, 2012;Whittaker and Amento, 2004) provide users with interfaces for cut, copy, paste, volume adjustment, time-stretching, pitch bending, de-noising, etc.Then text-based speech editing systems (Jin et al., 2017(Jin et al., , 2018) ) allow the editor to perform select, cut, and paste operations in the text transcript of the speech and apply the changes to the waveform accordingly.However, they mainly face two problems.One is that the edited speech often sounds unnatural because the edited region does not match the prosody of the speech context.(e.g., mismatches in intonation, stress, or rhythm) (Jin et al., 2017).Another is that the interfaces do not support the ability to synthesize new words not appearing in the transcript (Morrison et al., 2021).There are a series of studies on these problems.Jin et al. (2017) propose to insert a synthesized audio clip using a combination of the text-to-speech model and voice conversion model (Sun et al., 2016), which leads to unnatural prosody near the boundaries of the edited regions.Tan et al. (2021) use neural TTS model with auto-regressive partial inference to maintain a coherent prosody and speaking style.Most recently, the mask prediction based methods (Wang et al., 2022;Bai et al., 2022) can capture more contextual information from the input mel-spectrogram.Wang et al. (2022) propose to learn the relation between text and audio through cross-attention but suffer from the extremely slow convergence rate.Bai et al. (2022) introduce the alignment embedding into the Conformer-based (Gulati et al., 2020;Guo et al., 2021) backbone to improve the speech quality.However, previous methods only focus on the modification of reading-style speeches, which is not stutter-oriented.

Diffusion Model
Basic knowledge of diffusion model Denoising diffusion probabilistic models (DDPMs) have achieved state-of-the-art performances in both image and audio synthesis (Dhariwal and Nichol, 2021;Kong et al., 2020b;Huang et al., 2022).DDPMs (Ho et al., 2020;Dhariwal and Nichol, 2021) are designed to learn a data distribution p(x) by gradually denoising a normally distributed variable through the reverse process of a fixed Markov Chain of length T .Denote x t as a noisy version of the clean input x 0 .DDPMs choose to parameterize the denoising model θ through directly predicting ϵ with a neural network ϵ θ .The corresponding objective can be simplified to: with t uniformly sampled from {1, ..., T }.

Applications of diffusion model in speech tasks
Applications of diffusion model in speech tasks mainly lie in speech synthesis.Diff-TTS (Jeong et al., 2021), Grad-TTS (Popov et al., 2021), and DiffSpeech (Liu et al., 2021)

Alignment Modeling
Due to the modality gap between text and speech, alignment modeling is essential in text-based speech editing.There are three types of approaches to model the monotonous alignment between text and speech: 1) cross-attention, Wang et al. (2022) propose to learn the alignment information with the cross-attention module in the transformer decoder, which suffers from the slow convergence rate and is usually not robust; 2) alignment embedding, Bai et al. (2022) introduce the alignment embedding from external alignment tools into the selfattention based architecture to guide the alignment modeling; 3) length regulator (Ren et al., 2019;Tan et al., 2021), the length regulator expand text embedding into frame-level embedding according to the phoneme duration predicted by the duration predictor (Ren et al., 2019;Tan et al., 2021), which ensures hard alignments and is more robust than the above two methods.However, the duration predictor in Tan et al. (2021) does not consider the existing context duration.It only predicts the duration of the entire sentence from text representations and applies the duration of the edited words to the masked region, which results in unnatural prosody.Therefore, in FluentSpeech, we train the duration predictor with the mask prediction proce- The sinusoidal-like symbol, FC, Swish, and • denote the positional encoding, fully-connected layer, swish activation function (Ramachandran et al., 2017), and element-wise multiple operation.LR denotes the Length Regulator proposed in FastSpeech (Ren et al., 2019).N is the number of residual blocks.The dashed black line denotes that the operation is only executed when the dataset contains spontaneous speeches.
dure to achieve the fluent duration transition at the edited region, which is called the masked duration predictor.

Context-Aware Spectrogram Denoiser
Context Conditioning As shown in Figure 1(c), in the context conditioning module, we adopt frame-level text embedding e t , acoustic feature sequence x, masked acoustic feature sequence x, speaker embedding e spk , pitch embedding e pitch , and stutter embedding e stutter as the condition for our spectrogram denoiser.The phoneme embedding e p is first expanded into frame-level text embedding e t by the length regulator with the duration information from the masked duration predictor.We add e t to the context condition c.We also extract the speaker embeddings e spk from audio samples using open-source voice encoder 1 and feed them into the context condition c following the common practice (Min et al., 2021;Huang et al., 2022;Tan et al., 2021).Then we adopt a nonlinear feed-forward acoustic encoder to transform the speech feature x and x into the acoustic embeddings e x and e x following Bai et al. (2022).The masked acoustic embedding e x is also added to the condition to provide more contextual information for mel-spectrogram reconstruction.Moreover, the masked pitch predictor utilizes e t and the masked pitch embedding êpitch to predict the pitch F 0 of each frame in the edited region.We further con-1 https://github.com/resemble-ai/Resemblyzervert it into the pitch embedding vector and add it to the context condition c.To promote the natural transition at the edited boundaries, we train the duration predictor and pitch predictor with the mask prediction procedure: where we use d and p to denote the target duration and pitch respectively, and use g d and g p to denote the corresponding duration predictor and pitch predictor, which share the same architecture of 1D convolution with ReLU activation and layer normalization.The loss weights are all set to 0.1 and the reconstruction losses are also added to train the linguistic encoder.(Salimans and Ho, 2021;Liu et al., 2022;Huang et al., 2022) to significantly accelerate sampling from a complex distribution.Specifically, in the generator-based diffusion models, p θ (x 0 |x t ) is the implicit distribution imposed by the neural network f θ (x t , t) that outputs x 0 given x t .And then x t−1 is sampled using the posterior distribution q(x t−1 |x t , x 0 ) given x t and the predicted x 0 .The training loss is defined as the mean absolute error (MAE) in the data x space: and efficient training is optimizing a random t term with stochastic gradient descent.Inspired by (Ren et al., 2022), we also adopt structural similarity index (SSIM) loss L SSIM θ in training to capture structural information in mel-spectrogram and improve the perceptual quality: The loss weights are all set to 0.5.Since the capability of our spectrogram denoiser is powerful enough, we do not adopt the convolutional Post-Net to refine the predicted spectrogram like previous works (Wang et al., 2022;Bai et al., 2022).

Stutter Predictor
The stutter predictor is introduced only when the speech corpus contains stuttering recordings.The stutters in the speech content will introduce noise to the training pipeline due to the noise introduced by the information gap between text and stuttering speech content.As shown in Figure 2, the stuttering word "to" in the speech content makes the speech editing model learn unintentional sounds in the pronunciation of the word "to".Therefore, we introduce the stutter embedding into the text hidden sequence to disentangle the stutter-related gradients from the speech content, which significantly improves the pronunciation robustness of our FluentSpeech.Let s = (s 1 , . . ., s s ) be a time-aligned stutter label that defines the stutter regions in the corresponding spontaneous speech, where s i ∈ {0, 1} (0 for normal and 1 for stutter) for each frame (See Appendix C for further details about the stutter label in our SASE dataset).In training, we take the ground-truth value of the stutter label as input into the hidden sequence to predict the target speech.At the same time, we use the ground-truth labels as targets to train the stutter predictor, which is used in inference to localize the stutter region in target speech.
The stutter predictor consists of 1) a 4-layer 1D conditional convolutional network with ReLU activation, each followed by the layer normalization and the dropout layer; 2) an extra linear layer and a softmax layer to predict the probability of stutter tag.As shown in Figure 1(c), we propose a textguided stutter predictor module, which takes framelevel text embedding e t and mel-spectrogram embedding e x as input and seeks to locate the textirrelevant stutter regions.The main objective function for stutter prediction is the binary crossentropy loss L BCE .The Focal loss (Lin et al., 2017) L F ocal is also introduced since the misclassification of fluent regions is tolerable and we want the stuttering regions to be accurately classified.The α 0 , α 1 is set to 5e −3 , 1 and γ is set to 3.

Training and Inference Procedures
Training The final training loss terms consist of the following parts: 1) sample reconstruction loss L M AE θ ; 2) structural similarity index (SSIM) loss L SSIM θ ; 3) reconstruction loss for pitch and duration predictor L p , L d ; 4) classification loss for stutter predictor L BCE , L F ocal .In the training stage, we randomly select 80% phonemes spans and mask their corresponding frames since 80% masking rate shows good performances on both seen and unseen cases.Then we add the stutter embedding to the context condition.The objective functions only take the masked region into consideration.
Inference for reading-style speech editing Given a speech spectrogram x, its original phonemes p and the target phonemes p. Denote the spectrogram region that needs to be modified as µ.When the speech recording is reading-style, we do not utilize the stutter predictor.We first use an ex-ternal alignment tool2 to extract the spectrogram-tophoneme alignments.x is the spectrogram masked according to the region µ.FluentSpeech takes p, x, x, e spk , êdur , and êpitch as inputs and generates the spectrogram of the masked region µ.Finally, we use a pre-trained vocoder to transform this spectrogram into the waveform.
Inference for stutter removal When the speech recording is spontaneous, the stutter predictor first predicts the stutter region µ ′ .Since the stutter region µ ′ also influences the prosody (e.g., duration and pitch) of the neighboring words, we find all of the phoneme spans that overlap with or are adjacent3 to µ ′ and denote them as μ.Then the spectrogram region that needs to be modified can be defined as µ = µ ′ ∪ μ.To make the spontaneous speech fluent, the stutter embedding is not added to the hidden sequence.Following the masked spectrogram reconstruction process in the inference for reading-style speech editing, FluentSpeech is able to perform automatic stutter removal.

Datasets
Reading-Style We evaluate FluentSpeech on two reading-style datasets, including: 1) VCTK (Yamagishi et al., 2019), an English speech corpus uttered by 110 English speakers with various accents; 2) LibriTTS (Zen et al., 2019), a large-scale multi-speaker English corpus of approximately 585 hours of speech.We evaluate the text-based speech editing performance of FluentSpeech and various baselines on these datasets.
Spontaneous We also evaluate FluentSpeech on the stutter-oriented automatic speech editing (SASE) dataset collected and annotated by us (See Appendix C for further details).The SASE dataset consists of approximately 40 hours of spontaneous speech recordings from 46 speakers with various accents.All the audio files are collected from online lectures and courses with accurate official transcripts.Each recording is sampled at 22050 Hz with 16-bit quantization.We evaluate the SASE performance of FluentSpeech and various baselines on this dataset.
For each of the three datasets, we randomly sample 400 samples for testing.We randomly choose 50 samples in the test set for subjective evaluations and use all testing samples for objective evaluations.The ground truth mel-spectrograms are generated from the raw waveform with the frame size 1024 and the hop size 256.

Experimental Setup
Model Configuration FluentSpeech consists of a linguistic encoder, an acoustic encoder, a masked variance adaptor, a spectrogram denoiser, and a stutter predictor.The linguistic and acoustic encoders consist of multiple feed-forward Transformer blocks (Ren et al., 2019) with relative position encoding (Shaw et al., 2018) following Glow-TTS (Kim et al., 2020).The hidden channel is set to 256.In the spectrogram denoiser, we set N = 20 to stack 20 layers of convolution with the kernel size 3, and we set the dilated factor to 1 (without dilation) at each layer following (Huang et al., 2022).The number of diffusion steps T is set to 8. The stutter predictor is based on the non-causal WaveNet (Oord et al., 2016) architecture.We have attached more detailed information on the model configuration in Appendix A.1.

Training and Evaluation
We train the Flu-entSpeech with T = 8 diffusion steps.The Flu-entSpeech model has been trained for 300,000 steps using 1 NVIDIA 3080 GPU with a batch size of 30 sentences.The adam optimizer is used with β 1 = 0.9, β 2 = 0.98, ϵ = 10 −9 .We utilize HiFi-GAN (Kong et al., 2020a) (V1) as the vocoder to synthesize waveform from the generated melspectrogram in all our experiments.To measure the perceptual quality, we conduct human evaluations with MOS (mean opinion score), CMOS (comparative mean opinion score), and average preference score on the testing set via Amazon Mechanical Turk (See Appendix A.3 for more details).We keep the text content and text modifications consistent among different models to exclude other interference factors, only examining the audio quality.We further measure the objective evaluation metrics, such as MCD (Kubichek, 1993), STOI (Taal et al., 2010), and PESQ (Rix et al., 2001).More information on evaluation has been attached in Appendix A.4.

Results of Reading-Style Speech Editing
We compare the quality of generated audio samples of our FluentSpeech with other baseline systems, including 1) EditSpeech (Tan et al., 2021); 2) CampNet (Wang et al., 2022); 3) A 3 T (Bai et al., 2022) (detailed descriptions can be found in Appendix A.2).For objective evaluation, we conduct the spectrogram reconstruction experiment to evaluate these systems.As shown in Table 1, Flu-entSpeech demonstrates superior performance in MCD, PESQ, and STOI metrics.
For subjective evaluation, we manually define modification operations (i.e., insertion, replacement, and deletion) of 50 audio samples.We then conduct the experiments on the VCTK dataset.For each audio sample, we ask at least 10 English speakers to evaluate the generated audios' speech quality and speaker similarity.The results are presented Table 2 and Table 3.For the seen case, each speaker's examples would be split into train and test sets.And for the unseen case, the test set contains 10 speakers' examples, and the other 99 speakers' examples are used for training following (Bai et al., 2022).It can be seen that FluentSpeech achieves the highest perceptual quality and speaker similarity on both seen and unseen settings compared to all baselines, which demonstrates the effectiveness of our proposed context-aware spectrogram denoiser.

Results of Stutter-Oriented Automatic Speech Editing
We evaluate the accuracy of FluentSpeech on the stutter localization task, and the results are shown in Table 4.It can be seen that our FluentSpeech achieves 80.5% accuracy and 94.4% precision on the stutter localization task.We then compare the naturalness and fluency of generated audio samples of our FluentSpeech with the original spontaneous recordings.We conduct a subjective average preference score evaluation, where 50 sentences are randomly selected from the test set of our SASE dataset.The listeners are asked to judge which utterance in each pair has better naturalness (or fluency) or no preference in the edited area.As shown in Figure 3, FluentSpeech achieves similar naturalness compared to the original audio.Moreover, the fluency of the speeches generated by our FluentSpeech is significantly improved, which further shows the effectiveness of our stutter-oriented Original text is "We didnt enjoy the first game, but today they were excellent".In (b,c,d,e,f) subfigures, the portion with red box is "the first game" which is masked and reconstructed.MDP denotes the masked duration predictor.automatic speech editing strategy.

Visualizations
As illustrated in Figure 4, we visualize the melspectrograms generated by FluentSpeech and baseline systems.We can see that FluentSpeech can generate mel-spectrograms with richer frequency details compared with other baselines, resulting in natural and expressive sounds.Moreover, when we substitute the masked duration predictor with the duration predictor utilized in Tan et al. (2021); Wang et al. (2022); Bai et al. (2022), an unnatural transition has occurred in the left boundary of the edited region of FluentSpeech, which demonstrates the effectiveness of our proposed masked duration predictor.

Ablation Studies
We conduct ablation studies to demonstrate the effectiveness of several designs in FluentSpeech, including the stutter embedding and the masked predictors.We perform CMOS and MCD evaluations for these ablation studies.The results are shown in Table 5.We can see that CMOS drops   Bai et al. (2022) and PP denotes the pitch predictor proposed in (Ren et al., 2020).
rapidly when we remove the stutter embedding, indicating that the noise introduced by the textspeech pair's discrepancy greatly reduces the naturalness of the generated audio.Thus, the stutter embedding successfully improves the robustness of our FluentSpeech; Moreover, when we remove the MDP, MPP and use the DP following recent speech editing algorithms (Tan et al., 2021;Wang et al., 2022;Bai et al., 2022), the speech quality also drops significantly, demonstrating the effectiveness of our proposed masked predictors.It is worth mentioning that the pitch predictor without masked training also results in a performance drop in terms of voice quality.

Conclusion
In this work, we proposed FluentSpeech, a stutteroriented automatic speech editing model for stutter removal.FluentSpeech adopts a context-aware spectrogram denoiser to generate high-quality and expressive speeches with rich frequency details.To improve the robustness against stuttering speeches and perform automatic stutter removal, we propose a conditional stutter predictor that localizes the stutter region and injects the stutter embedding into the text hidden sequence to reduce the discrepancy between text and stuttering speech recording.We also contribute a novel stutter-oriented automatic speech editing dataset named SASE, which contains spontaneous speech recordings with time-aligned stutter labels.Experimental results demonstrate that FluentSpeech achieves state-ofthe-art performance on speech editing for readingstyle speeches.Moreover, FluentSpeech is robust against stuttering speech and demonstrates the ability to improve the fluency of stuttering speech significantly.To the best of our knowledge, Flu-entSpeech is the first stutter-oriented automatic speech editing model that solves the automatic stutter removal task.Our extensive ablation studies demonstrated that each design in FluentSpeech is effective.We hope that our work will serve as a basis for future stutter-oriented speech editing studies.

Limitations
We list the limitations of our work as follows.
Firstly, the model architecture we use to localize the stuttering speech is simple.Future works could explore a more effective model to perform automatic stutter removal with the help of our SASE dataset.Secondly, we only test the English datasets.And other languages except for English and multilanguage stutter-oriented speech editing remain for future works.Finally, after being pre-trained on our SASE dataset, the stutter embedding in Flu-entSpeech could also be used to inject stutters into the reading-style speech to change its speaking style, and we leave it for future works.

Ethics Statement
FluentSpeech improves the naturalness of edited speech and promotes the automatic stutter removal of stuttered speech, which may cause unemployment for people with related occupations.Besides, the free manipulation of speeches may bring potential social damage.Further efforts in automatic speaker verification should be made to lower the aforementioned risks.• For stutter removal evaluations, we perform average preference score tests for speech quality and fluency.For the speech quality AB test, each listener is asked to select their preferred audio according to audio quality.We tell listeners to answer "Which of the audio has better quality?Please focus on the audio quality and ignore other factors".For the speech fluency AB test, each listener is asked to select the audio they prefer according to audio fluency, and we tell listeners to answer "Which of the audio sounds more fluent?Please focus on speech fluency and ignore other factors.The stutter in the audio typically sounds like "emm", "uhhh", "hmmm", or words repetition".The screenshots of instructions for stutter removal evaluations are shown in Figure 5(c) and Figure 5(d).

A.4 Details in Objective Evaluation
The effectiveness of our FluentSpeech is measured by MCD (Toda et al., 2007), STOI (Taal et al., 2011),PESQ (Hu and Loizou, 2007)  by: where L is the order of mel cepstrum and L is 34 in our implementation.
The traditional PESQ measure is given by: where a 0 ,a 1 ,a 2 are the parameters, D ind represents the average disturbance value and A ind represents the average asymmetrical disturbance values.STOI is a function of a TF-dependent intermediate intelligibility measure, which compares the temporal envelopes of the clean and degraded speech in short-time regions by means of a correlation coefficient.The following vector notation is used to denote the short-time temporal envelope of the clean speech: where N = 30 which equals an analysis length of 384 ms.

B Detailed analysis of duration and pitch
To further dive into the detailed performance of our model, we evaluate the duration and pitch errors between our FluentSpeech and the baseline models.For duration errors, the ground truth duration is obtained from the Montreal Forced Aligner (MFA) (McAuliffe et al., 2017).We calculate MSE of word-level durations for the duration predictor (DP) used in Tan et al. (2021); Bai et al. (2022) and the masked duration predictor (MDP) in Flu-entSpeech.The results on the VCTK dataset are shown in Table 7.It can be seen that the masked duration predictor predicts more accurate duration, demonstrating the effectiveness of the masked prediction training.For pitch errors, we compare our FluentSpeech with all other baseline models.
We firstly extract frame-level pitch information using parselmouth4 , then calculate the MSE of the mean pitch distance between the model-generated speeches and the ground-truth speeches.The results on the VCTK dataset are shown in table 8.It can be seen that FluentSpeech achieves the lowest average pitch error.Moreover, the average pitch error of FluentSpeech with the masked pitch predictor (MPP) is significantly lower than the Flu-entSpeech with the pitch predictor proposed in Ren et al. (2020), demonstrating the effectiveness of our masked pitch predictor.

C More details of SASE dataset
The SASE dataset consists of approximately 40 hours of spontaneous speech recordings from 46 speakers with various accents.The speech recordings are crawled from online lectures and courses with accurate official transcripts.Each recording is sampled at 22050 Hz with 16-bit quantization.We substitute the speakers' names with speaker IDs to protect their personal information, and the dataset can only be accessed for research purposes.
To obtain the time-aligned stutter labels, we recruit annotators from a crowdsourcing platform, Zhengshu Technology, to label the stuttering re-gion according to the audio and transcription.Specifically, the stuttering region may be 1) stammers and repetitive words, for instance, "I am go...go...going... out for a...a...a... trip"; 2) filled pauses (FP) such as "em, um, then, due to, uh, the speaker's custom of speaking"; 3) sudden occasions such as cough, voice crack, etc.The annotators are asked to mark the corresponding time boundaries and give the stuttering label as shown in Figure 6.We then use the given timestamps in the official transcriptions to cut the audio and text into fragments ranging from 7 to 10 seconds.Finally, we convert each text sequence into phoneme sequence with an open-source graphemeto-phoneme tool5 .The audio samples in our SASE dataset are available at https://speechai-demo.github.io/FluentSpeech/.

Figure 1 :
Figure 1: The overall architecture for FluentSpeech.In subfigure (c), the spectrogram denoiser θ takes noisy spectrogram x t as input and computes f θ (x t |t, c) conditioned on diffusion time index t and context information c.The sinusoidal-like symbol, FC, Swish, and • denote the positional encoding, fully-connected layer, swish activation function(Ramachandran et al., 2017), and element-wise multiple operation.LR denotes the Length Regulator proposed in FastSpeech(Ren et al., 2019).N is the number of residual blocks.The dashed black line denotes that the operation is only executed when the dataset contains spontaneous speeches.

Figure 2 :
Figure 2: The illustration of the discrepancy between the given transcription and stuttering speech content.

Figure 3 :
Figure 3: Average preference score (%) evaluation on naturalness and fluency on the SASE dataset, where "Neural" stands for "no preference".

Figure 4 :
Figure 4: Visualizations of the ground-truth and generated mel-spectrograms by different speech editing models.Original text is "We didnt enjoy the first game, but today they were excellent".In (b,c,d,e,f) subfigures, the portion with red box is "the first game" which is masked and reconstructed.MDP denotes the masked duration predictor.
ing tests are shown in Figure 5(a) and Figure 5(b).
(a) Screenshot of MOS testing for speech quality in the speech editing evaluation.(b) Screenshot of MOS testing for speaker similarity in the speech editing evaluation.(c) Screenshot of average preference score testing for speech quality in the stutter removal evaluation.(d) Screenshot of average preference score testing for speech fluency in the stutter removal evaluation.

Table 1 :
The objective audio quality comparisons.We only measure the MCD, STOI, and PESQ of the masked region.MCD and PESQ indicate speech quality, and STOI reflects speech intelligibility

Table 2 :
The MOS evaluation (↑) for speech quality on speech editing task on the VCTK dataset with 95% confidence intervals.

Table 3 :
The MOS evaluation (↑) for speaker similarity on speech editing task on the VCTK dataset with 95% confidence intervals.

Table 4 :
The stutter localization evaluation (↑) on the SASE dataset.Accuracy (%) denotes the overall accuracy; Precision (%) indicates the proportion of the correctly classified stutter regions.

Table 5 :
Audio quality comparisons on the SASE dataset for ablation study.MDP denotes the masked duration predictor; MPP denotes the masked pitch predictor; DP denotes the duration predictor used in Tan et al. (2021);

Table 7 :
Average duration error comparisons on the VCTK dataset.DP denotes the duration predictor used in Tan et al. (2021); Bai et al. (2022) and MDP denotes the masked duration predictor in our FluentSpeech.

Table 8 :
Ren et al. (2020)or comparisons on the VCTK dataset.PP denotes the pitch predictor proposed inRen et al. (2020)and MPP denotes the masked pitch predictor in our FluentSpeech.