FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis

Generative adversarial networks (GANs) and denoising diffusion probabilistic models (DDPMs) have recently achieved impressive performances in image and audio synthesis. After revisiting their success in conditional speech synthesis, we find that 1) GANs sacrifice sample diversity for quality and speed, 2) diffusion models exhibit outperformed sample quality and diversity at a high computational cost, where achieving high-quality, fast, and diverse speech synthesis challenges all neural synthe-sizers. In this work, we propose to converge advantages from GANs and diffusion models by incorporating both classes, introducing dual-empowered modeling perspectives: 1) FastD-iff 2 (DiffGAN), a diffusion model whose de-noising process is parametrized by conditional GANs, and the non-Gaussian denoising distribution makes it much more stable to implement the reverse process with large steps sizes; and 2) FastDiff 2 (GANDiff), a generative adversarial network whose forward process is constructed by multiple denoising diffusion iterations, which exhibits better sample diversity than traditional GANs. Experimental results show that both variants enjoy an efficient 4-step sampling process and demonstrate superior sample quality and diversity. 1


Introduction
Speech synthesis has seen extraordinary progress with the recent development of deep generative models in machine learning (Lv et al., 2023b;Ye et al., 2023b;Zhang et al., 2021Zhang et al., , 2022c;;Li et al., 2023).Previous models (Oord et al., 2016;Kalchbrenner et al., 2018) generate waveforms autoregressively from mel-spectrograms yet suffer from slow inference speed.Non-autoregressive methods (Huang et al., 2022c(Huang et al., , 2023a;;Ye et al., 2023a;Jiang et al., 2021) have been designed to address this issue, they generate samples with extremely fast speed and achieve comparable voice quality with autoregressive models.
Among them, Generative adversarial networks (GANs) (Creswell et al., 2018;Mao et al., 2019;Jiang et al., 2022) and denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020;Song et al., 2020) are two popular classes of deep generative models that have demonstrated surprisingly good results and dominated speech synthesis: Jang et al. (2021) utilize local-variable convolution to capture different waveform intervals with adversarial learning.Kong et al. (2020a) propose multireceptive field fusion (MRF) to model the periodic patterns matters.(Kong et al., 2020b) introduce a time-aware wavenet for conditional diffusion modeling.Huang et al. (2022b) and Lam et al. (2022) utilize a noise predictor to learn a tight inference schedule for skipping denoising steps.
Despite their success in the high-fidelity generation, few studies have compared these two classes of deep generative models in conditional speech synthesis.In this work, we conduct a comprehensive study to revisit GANs and diffusion models, and empirically demonstrate that: 1) GANs tend to generate high-quality speeches but do not cover the whole distribution, which sacrifice sample diversity for quality and speed; and 2) diffusion models exhibit outperformed sample quality and diversity, buy they typically require a large number of iterative refinements.To this end, simultaneously achieving high-quality and diverse speech synthesis at a low computational cost has become an open problem for all neural synthesizers.
In this work, we converge advantages from both classes by incorporating GANs and diffusion models, introducing dual-empowered modeling perspectives for high-fidelity speech synthesis: 1) Fast-Diff 2 (DiffGAN): a diffusion model whose denoising process is parametrized by conditional GANs, and the non-Gaussian denoising distribution makes it much more stable to implement the reverse process with large step sizes; and 2) FastDiff 2 (GAN-Diff): a generative adversarial network whose forward process is constructed by multiple denoising diffusion iterations, which exhibits better sample diversity than traditional GANs.Experimental results show that both variants enjoy an effective 4-iter sampling process and demonstrate the outperformed sample quality and diversity.Moreover, we show that both variants generalize well to the mel-spectrogram inversion of unseen speakers.
The main contributions of this work are summarized as follows: • We revisit two popular deep generative models (diffusion models and GANs) in conditional speech synthesis, introducing dual-empowered modeling perspectives to converge advantages from both classes.
• FastDiff 2 (DiffGAN) removes the common assumption of Gaussian distribution and utilizes conditional GANs to parametrize the multimodal denoising distribution, implementing the reverse process with large step sizes more stably.
• FastDiff 2 (GANDiff) breaks the one-shot forward of conditional GANs into several denoising diffusion steps in which each step is relatively simple to model, and thus it exhibits better sample diversity than traditional GANs.
• Experimental results show that both enjoy an effective 4-iter sampling process, providing a principled way for high-fidelity and diverse speech synthesis at a low computational cost.

Background on Speech Synthesis
With the development of deep generative models (Ye et al., 2023b;Lv et al., 2023aLv et al., , 2022;;Zhang et al., 2022a,b), speech synthesis technology has made rapid progress up to date.Most models (Wang et al., 2017;Ren et al., 2019;Huang et al.;Cui et al., 2021;Huang et al., 2023b;Ye et al., 2022) first convert input text or phoneme sequence into mel-spectrogram, and then transform it to waveform using a separately trained vocoder (Kumar et al., 2019;Kong et al., 2020a;Huang et al., 2022a).In this work, we focus on designing the second-stage model that efficiently synthesizes high-fidelity waveforms from mel-spectrograms.
Neural vocoders require diverse receptive field patterns to catch audio dependencies, and thus previous models (Oord et al., 2016;Kalchbrenner et al., 2018) generate waveforms autoregressively from mel-spectrograms yet suffer from slow inference speed.In recent years, non-autoregressive methods (Prenger et al., 2019;Kumar et al., 2019;Kong et al., 2020b) have been designed to address this issue, which generates samples with extremely fast speed while achieving comparable voice quality with autoregressive models.Below we mainly introduce two popular classes of deep generative models (diffusion models and GANs) for conditional speech synthesis:

Generative Adversarial Networks
Generative adversarial networks (GANs) (Kumar et al., 2019;Huang et al., 2021) are one of the most dominant non-autoregressive models in speech synthesis.Morrison et al. (2021) propose a chunked autoregressive GAN for conditional waveform synthesis, Lee et al. (2022) utilize a large-scale pretraining to improve out-of-distribution quality, Bak et al. (2022) investigate GAN-based neural vocoders and proposes an artifact-free GAN-based neural vocoder.
The generator G aims to transform noise z into G(z) that mimics real data, while the discriminator D learns to distinguish the generated samples G(z) from real ones.GANs jointly train a powerful generator G and discriminator D with a min-max game: However, GAN-based models are often difficult to train, collapsing (Creswell et al., 2018) without carefully selected hyperparameters and regularizers, and showing less sample diversity.

Diffusion Probabilistic Models
Denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020) are likelihood-based generative models that have recently advanced the state-of-the-art results in most image and audio synthesis tasks.Denote data distribution as q(x 0 ), the diffusion process is defined by a fixed Markov chain from data x 0 to the latent variable x T , which gradually adds noise to the data q(x 0 ) in T steps Table 1: Comparison of GANs and diffusion models for speech synthesis.We crowd-source 5-scale MOS tests via Amazon Mechanical Turk, which are recorded with 95% confidence intervals (CI).We implement real-time factor (RTF) assessment on a single NVIDIA V100 GPU.
AH2, linear 42 with pre-defined noise schedule β t : The reverse process is to recover samples from Gaussian noises parameterized by shared θ.A guarantee of high sample diversity typically comes at the cost of hundreds of denoising steps: It has been demonstrated that diffusion probabilistic models (Dhariwal and Nichol, 2021;Xiao et al., 2021) can learn diverse data distribution in multiple domains, such as images and time series.However, an apparent degradation could be witnessed when reducing reverse iterations, making it challenging to get accelerated.

Preliminary Study
In image generation, superior sample diversity (Dhariwal and Nichol, 2021;Ho et al., 2020;Song et al., 2020) is a crucial reason for the diffusion model to produce high-quality samples even on the challenging dataset.Due to the distinctive advantages of diversity and distribution coverage over GANs, diffusion models have been demonstrated to generate realistic and vivid images, achieving the current state-of-the-art measured by FID.
Despite the comprehensive studies of GANs and diffusion models for image generation, few have compared these two classes of deep generative models in speech synthesis, where an audio signal is different (Oord et al., 2016;Kalchbrenner et al., 2018) for its long-term dependencies, high sampling rate, and strong condition.In this section, we provide an empirical study and investigate the characteristic of both classes with close model capacity in speech.Specifically, we evaluate the performance (including sample quality, speed, and diversity) and explore how distribution coverage impacts sample quality by auditory sensation.

Experimental Setup
We prepare 20 unseen samples from the benchmark LJSpeech dataset (Ito and Johnson, 2017) for evaluation.For a fair comparison, we implement the GAN and diffusion model with a shared backbone (Huang et al., 2022b), which comprises three Diffusion-UBlock and DBlock with the up/downsample rate of [8,8,4].Following the common practice (Kumar et al., 2019;Yamamoto et al., 2020), we remove the time embedding in GAN and introduce an auxiliary multi-resolution STFT loss to stabilize adversarial learning.More information has been attached in Appendix D.1.

Visualization
We further visualize the marginal distributions P (x|ph) of diffusion models and GANs in Figure 1.Specifically, we 1) randomly sample 100 latent noises z for each testing audio and obtain 2000 utterances in total.2) split the generated utterances into phoneme-level samples according to the boundary obtained by forced alignment (McAuliffe et al., 2017) and transform them into linear spectrograms; 3) compute the histograms2 and smooth them into probability density functions with kernel density estimation for better visualization.

Analyses
Based on the evaluation results presented in Table 1 and the marginal distributions illustrated in Figure 1, we have the following observations: Diffusion models demonstrate better sample diversity at the cost of slow inference speed.A more diverse data distribution could be observed in samples generated by diffusion models, demonstrating a better mode convergence.Diffusion models are better at data sharpness, diversity, and matching marginal label distribution of training data.However, sampling from diffusion models often requires thousands of network iterations, which is significantly slower than GAN and makes their application expensive in practice.
GANs trade off diversity for quality and speed.A distinct degradation of mode convergence could be witnessed in GANs, which tend to produce samples but do not cover the whole distribution, indicating a collapsed distribution and less sample diversity.To conclude, GANs sacrifice diversity for quality and speed, while the constrained distribution does not hinder their ability to generate high-fidelity samples.Compared to diffusion models, GANs enjoy high-quality speech synthesis with a minor gap of 0.08 in MOS, while even demonstrating an outperformed performance in MCD evaluation.Regarding inference speed, GANs enjoy an effective one-shot sampling process, significantly reducing the inference time compared with competing diffusion mechanisms.

Methods
After revisiting GAN and diffusion models for speech synthesis, we witness that 1) GANs sacrifice sample diversity for better quality and speed, producing high-quality samples but not covering the whole distribution.2) Diffusion models exhibit outperformed sample quality and diversity, requiring iterative refinement at a high computational cost.In this section, we aim to converge advantages from both classes, introducing dualempowered modeling perspectives for high-fidelity, fast, and diverse speech synthesis.

Overview
This section presents our proposed models dually empowered by GANs and diffusion: 1) FastDiff 2 (DiffGAN): a diffusion model whose denoising process is parametrized by conditional GANs, and thus the non-Gaussian denoising distribution makes it much more stable to implement the reverse process with large step sizes; and 2) FastDiff 2 (GANDiff): a generative adversarial network whose forward process is constructed by multiple denoising diffusion distributions, thus exhibiting better sample diversity than traditional GANs.

Diffusion Mechanism Leveraging GAN
Diffusion models commonly assume that the denoising distribution can be approximated by Gaussian distributions.However, the Gaussian assumption holds only in the infinitesimal limit of small denoising steps, which requires numerous steps in the reverse process.As such, reducing the number of iterative steps always causes a distinct degradation in perceptual quality.
In this work, we propose FastDiff 2 (DiffGAN) leveraging conditional GANs to model the denoising distribution q(x t |x t−1 ), and thus the non-Gaussian multimodal distribution makes it much more stable to implement the reverse process with large steps sizes.Specifically, our forward diffusion process is set up with the main assumption that the number of diffusion iterations is small (T = 4).
The training is formulated by matching the conditional GAN generator p θ (x t−1 |x t ) and q(x t |x t−1 ) using an adversarial loss that minimizes a divergence D adv per denoising step.The discriminator D ϕ (x t−1 , x t , t) is designed to be diffusion-stepdependent, which supervises the generator to produce high-fidelity speech sample.The min-max objective can be expressed as: LG (5) x 0

⋯ ⋯
Real/Fake?Where D adv depends on the adversarial training setup, and the fake samples from p θ (x t−1 | x t ) are contrasted against the real one from q (x t−1 | x t ).
Reparameterization on diffusion model.Different from the conventional diffusion models that require hundreds of steps with small β t to estimate the gradient for data density, recent works (Salimans and Ho, 2022;Liu et al., 2022) have witnessed that approximating some surrogate variables, e.g., the noiseless target data gives better quality.We reparameterize the denoising model by directly predicting the clean data x 0 .Free from estimating the gradient for data density, it only needs to predict unperturbed x 0 and then add perturbation with the posterior distribution q(x t−1 |x t , x 0 ) (formulated in Appendix B), and the reverse transition distribution can be expressed as:

GAN Leveraging Diffusion Mechanism
GAN-based models are often difficult to train, collapsing (Mao et al., 2019) without carefully selected hyperparameters and regularizers, and showing less sample diversity.Besides, these models show distinct degradation in training stability, which cannot generate deterministic values due to the complex data distribution.
In this work, we propose FastDiff 2 (GANDiff) leveraging diffusion mechanism to construct the forward process by multiple denoising iterations, and thus we expect it exhibits better training stability and sample diversity compared to traditional one-shot GANs.To be more specific, we 1) initialize the generator G with a pre-trained diffusion teacher; 2) conduct 4-iter denoising to generate x0 with gradient, which is regarded as the forward process of the generator; and finally 3) G plays an adversarial game with the discriminator D, and the min-max objective can be expressed as: We empirically find that the initialization of diffusion teacher provides a better understanding of noise schedules, and it reduces the difficulties of adversarial learning by orders of magnitude.Fast-Diff 2 (GANDiff) breaks the forward process of one-shot conditional GAN into several denoising diffusion iterations, in which each step is relatively simple to model.Thus, it exhibits better sample diversity than traditional one-shot GANs.

Architecture
As illustrated in Figure 2(a), we take a stack of time-aware location-variable convolution (Huang et al., 2022b) as a shared backbone to model longterm time dependencies with adaptive conditions efficiently.Convolution is conditioned on dynamic variations (diffusion steps and spectrogram fluctuations) in speech, which equips the model with diverse receptive field patterns and promotes robustness.
We build the basic architecture of discriminator upon WaveNet (Oord et al., 2016).It consists of ten layers of non-causal dilated 1-D convolutions with weight normalization.The discriminator is trained to correctly classify the generated sample as fake while classifying the ground truth as real.More details have been attached in Appendix C.

Loss Objective
Adversarial GAN Objective.For the generator and discriminator, the training objectives follow (Mao et al., 2017), which replaces the binary cross-entropy terms of the original GAN objectives (Goodfellow et al., 2014) with least squares loss functions for non-vanishing gradient flows.

Frequency-domain Reconstruction Objective.
To stabilize adversarial learning, we include frequency-domain sample reconstruction loss objective by applying the multi-resolution STFT (Short Time Fourier Transform) operation ST F T (•) (given in Appendix F):

Training Algorithm
The training procedures of the proposed FastDiff 2 (GANDiff) and FastDiff 2 (DiffGAN) have been illustrated as follows.

Diffusion Probabilistic Model
The diffusion probabilistic model is a family of generative models with the capacity to learn complex data distribution, which has recently attracted a lot of research attention in several important domains.Diffusion models generate high-fidelity samples yet inherently suffer from slow sampling speed, and thus multiple methods have conducted extensive investigations to accelerate the sampling process: Chen et al. ( 2020) utilize a grid search algorithm for a shorter inference schedule.Liu et al. (2021) introduces a shallow diffusion mechanism that starts denoising at a particular distribution instead of Gaussian white noise.Huang et al. (2022b); Lam et al. (2022) utilize a noise predictor to learn a tight inference schedule for skipping denoising steps.Their designs make diffusion models more applicable to real-world deployment, while the diffusion/denoising mismatch leads to quality degradation during jumping sampling steps.In this work, we avoid this mismatch by incorporating GANs into diffusion models, which makes it much more stable to implement the reverse process with large step sizes.

Generative Adversarial Network
Generative adversarial networks (GANs) (Jang et al., 2021;Kong et al., 2020a) are one of the most dominant deep generative models for speech generation.UnivNet (Jang et al., 2021) has demonstrated its success in capturing different waveform intervals with local-variable convolution.HIFI-GAN (Kong et al., 2020a) proposes multi-receptive field fusion (MRF) to model the periodic patterns matters.However, GAN-based models are often difficult to train, collapsing (Creswell et al., 2018) without carefully selected hyperparameters and regularizers, and showing less sample diversity.Differently, we incorporate diffusion models into GANs and break the generation process into several conditional denoising steps, in which each step is relatively simple to model.Thus, we expect our model to exhibit better sample diversity.LJSpeech dataset (Ito and Johnson, 2017) which consists of 13,100 audio clips of 22050 Hz from a female speaker for about 24 hours.To evaluate the model generalization ability over unseen speakers in multi-speaker scenarios, we prepare the VCTK dataset (Yamagishi et al., 2019), which is downsampled to 22050 Hz to match the sampling rate with the LJSpeech dataset.VCTK consists of approximately 44,200 audio clips uttered by 109 native English speakers with various accents.Following the common practice, we conduct preprocessing and extract the spectrogram with the FFT size of 1024, hop size of 256, and window size of 1024 samples.

Model Configurations
FastDiff 2 (DiffGAN) and FastDiff 2 (GAN-Diff) share the same backbone comprising three Diffusion-UBlocks and DBlocks with the up/downsample rate of [8,8,4], respectively.The discriminator consists of ten layers of non-causal dilated 1-D convolutions, whose strides are linearly increasing from one to eight except for the first and last layers.Channels and kernel sizes are set to 64 and 5, respectively.Both variants share the same number of denoising steps (T = 4) in both training and inference.The multi-resolution STFT loss is computed by the sum of three different STFT losses described in Appendix F.

Training and Evaluation
Both models are trained with constant learning rate lr = 2 × 10 −4 on 4 NVIDIA V100 GPUs.We use random short audio clips of 25600 samples from each utterance with a batch size of 16 for each GPU.We crowd-source 5-scale MOS tests via Amazon Mechanical Turk to evaluate the audio quality.The MOS scores are recorded with 95% confidence intervals (CI).Raters listen to the test samples randomly and are allowed to evaluate each audio sample once.We adopt additional objective evaluation metrics including STOI (Taal et al., 2010), PESQ (Rix et al., 2001) to test sample quality, and NDB, JS (Richardson and Weiss, 2018) for sample diversity.To evaluate the inference speed, we implement the real-time factor (RTF) assessment on a single NVIDIA V100 GPU.More information about objective and subjective evaluation is attached in Appendix E.

Comparsion With Other Models
We compared our proposed models in audio quality and sampling speed with competing models, including 1) WaveNet (Oord et al., 2016), the autoregressive generative model for raw audio.2) Wave-Glow (Prenger et al., 2019), the parallel flow-based model.3) HIFI-GAN V1 (Kong et al., 2020a) and UnivNet (Jang et al., 2021), the most popular GANbased models.4) Diffwave (Kong et al., 2020b), WaveGrad (Chen et al., 2020), and FastDiff (Huang et al., 2022b), three diffusion probabilistic models that generate high-fidelity speech samples.For easy comparison, the results are compiled and presented in Table 2, and we have the following observations: For our GAN-empowered diffusion model, Fast-Diff 2 (DiffGAN) has achieved the highest MOS compared with the baseline models, with a gap of 0.16 compared to the ground truth audio.Regarding inference speed, it enjoys an effective 4-iter sampling process and enables a speed of 58x faster than real-time on a single NVIDIA V100 GPU without engineered kernels.FastDiff 2 (DiffGAN) provides a principled way to accelerate DDPMs in both training and inference, avoiding quality degradation caused by a training-inference mismatch in baseline diffusion models (FastDiff, WaveGrad, Diffwave).It is worth mentioning that FastDiff 2 (DiffGAN) maintains the outperformed sample diversity inherited in DDPMs.
For diffusion-empowered GANs, FastDiff 2 (GANDiff) also demonstrates high-quality speech synthesis with the MOS of 4.12.For objective evaluation, it further presents the new state-of-the-art results in PESQ and STOI, superior to all baseline models.Moreover, we can see that it achieves a higher JSD and NDB compared to baseline GAN models.It breaks the generation process into several conditional denoising diffusion steps, in which each step is relatively simple to model.Thus, we expect our model to exhibit better mode coverage and sample diversity than traditional GANs (HIFI-GAN, UnivNet).
To conclude, by incorporating GAN and diffusion models, the dual-empowered speech models converge advantages from both classes and achieve high-quality and diverse speech synthesis at a low computational cost.

Ablation Study
We conduct ablation studies to demonstrate the effectiveness of several designs, including the diffusion reparameterization and frequency-domain objective in dual-empowered speech models.The results of both subjective and objective evaluation have been presented in Table 3, and we have the following observations: 1) Replacing the diffusion reparameterization design and parameterizing the denoising model by predicting the Gaussian noise ϵ has witnessed a distinct degradation in perceptual quality.Specifically, FastDiff 2 (DiffGAN) directly predicts clean data to avoid significant degradation when reducing reverse iterations.2) Removing the sample reconstruction loss objective results in blurry predictions with distinct artifact (Kumar et al., 2019) in both variants, demonstrating the effectiveness of the multi-resolution STFT regularization in stabilizing adversarial learning, which is helpful to improve the quality of generated waveforms with a MOS gain.

Generalization To Unseen Speakers
We use 40 randomly selected utterances of 5 unseen speakers in the VCTK dataset that are not used in training for out-of-distribution testing.Table 4 shows the experimental results for the mel-  spectrogram inversion of the samples from unseen speakers: We notice that both variants produce high-fidelity samples and outperform baseline models.They universally generate audio with strong robustness from entirely new speakers outside the training set.

Conclusion
In this work, through revisiting two popular classes (diffusion models and GANs) of deep generative models, we observed that 1) GANs tended to generate samples but did not cover the whole distribution, and 2) diffusion models exhibited outperformed sample quality and diversity while requiring iterative refinement at a high computational cost.
To achieve high-quality, fast and diverse speech synthesis, we converged advantages by incorporating GANs and diffusion models, introducing dualempowered modeling perspectives: 1) FastDiff 2 (DiffGAN), a diffusion model whose denoising process was parametrized by conditional GANs, and the non-Gaussian denoising distribution made it much more stable to implement the reverse process with large step sizes; and 2) FastDiff 2 (GANDiff): a generative adversarial network whose forward process was constructed by multiple denoising diffusion iterations, and it exhibited better mode coverage and sample diversity.Experimental results showed that both variants enjoyed an efficient 4step sampling and demonstrated superior sample quality and diversity.We envisage that our work serve as a basis for future speech synthesis studies.

Limitations and Potential Risks
The adversarial learning still requests a proper selection of hyperparameters, otherwise the training procedure could be unstable.Besides, training speech diffusion probabilistic models typically require more computational resources, and degradation could be witnessed with decreased training data.
Our proposed model lowers the requirements for high-quality speech synthesis, which may cause unemployment for people with related occupations, such as broadcasters and radio hosts.In addition, there is the potential for harm from non-consensual voice cloning or the generation of fake media, and the voices of the speakers in the recordings might be overused than they expect.

Figure 1 :
Figure 1: Comparison of sample distribution coverage between diffusion models and GANs.We randomly choose 4 different phonemes (ph = AH2, S, AW 1, T H) in this case study.

Figure 2 :
Figure 2: The overall architecture for dual-empowered speech models.In subfigure (a), it takes noisy audio x t as input and conditions on diffusion time index t and Mel-spectrogram c.

Table 3 :
Ablation study results.Comparison of the effect of each component on quality.DR: diffusion reparameterization, RO: reconstruction objective.

Table 4 :
Comparison with other neural vocoders of synthesized utterances for unseen speakers.