DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models

Diffusion models have gained prominence in generating high-quality sequences of text. Nevertheless, current approaches predominantly represent discrete text within a continuous diffusion space, which incurs substantial computational overhead during training and results in slower sampling speeds. In this paper, we introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space, thereby enhancing its capacity to recover conditional signals. During the sampling phase, we employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process. Comprehensive experimental evaluations reveal that our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster, rendering it significantly closer to practical application. \footnote{The code is released at \url{https://github.com/Shark-NLP/DiffuSeq}


Introduction
After diffusion models gained significant attention in the vision domain (Ho et al., 2020), progress has been made in applying them to text generation tasks, including constrained text generation in Diffusion-LM (Li et al., 2022) and sequenceto-sequence (Seq2Seq) text generation in Dif-fuSeq (Gong et al., 2023).Numerous subsequent studies demonstrate that diffusion models have achieved results comparable to traditional autoregressive models and non-autoregressive models in tasks such as machine translation (Yuan et al., 2022;Gao et al., 2022;Zheng et al., 2023) and summarization (Lin et al., 2022;Zhang et al., 2023;Mahabadi et al., 2023;Zhou et al., 2023).However, most of these works suffer from slow convergence during training and slow generation speed, particularly considering that these approaches require the Minimum Bayes Risk (MBR) decoding strategy (Koehn, 2004) to enhance generation quality, resulting in a doubling of time consumption.
In order to further narrow the gap between diffusion models and the prevailing autoregressive models, we aim to propose an accelerated version of DiffuSeq for both the training and sampling stages.For training, in addition to GPU acceleration techniques such as FP16, an improved training scheme can enable the model to better represent knowledge and learn data distribution more quickly (Hang et al., 2023).For sampling, a well-trained diffusion model can achieve similar quality within a single sampling and without the need for MBR decoding, thus saving generation time.Furthermore, we can borrow the state-of-the-art ODE sampler DPM-solver++ (Lu et al., 2022a,b) which is already applied in fast image generation.The progress in discrete text diffusion models (Zheng et al., 2023) exhibits its superiority in using fewer sampling steps, thus intuitively inspiring us to bridge the gap between continuous and discrete spaces.
Based on the BERT (Zhang et al., 2019) and BART (Lewis et al., 2020), as well as the absorbing state in D3PM (Austin et al., 2021), we propose incorporating an extra learned soft absorbing state and discretely adding it with Gaussian noise to jointly denoise the noise from two sources.The processes are illustrated in Figure 1.Specifically, after posing Gaussian noise, we randomly replace the continuous vector of the sequence with the absorbing state.The ratio is set according to the time step.This approach bridges the gap between continuous and discrete diffusion processes and also makes the training and inference stages better aligned.It also facilitates the integration of the DPM-solver++ and reduces the number of required sampling steps.In summary, our contributions are: 1. We introduce the learned soft absorbing state to help continuous diffusion models converge faster and eliminate the need for MBR decoding to ensure quality during sampling.
2. We adapt the DPM-solver++ to our enhanced diffusion text generation approach and demonstrate its feasibility in accelerating the generation speed experimentally.

Continuous Diffusion Models
Ho et al. (2020) and Song et al. (2020) formulate diffusion models in continuous space including forward and reverse processes.The forward process gradually corrupts data point x 0 into a standard Gaussian noise x T ∼ N (0, I).For each forward step t ∈ [1, 2, ..., T ], the perturbation is followed by q(x t |x t−1 ) = N (x t ; √ 1 − β t x t−1 , β t I), with β t ∈ (0, 1) as different scales.After the forward process, the reverse denoising process tries to gradually reconstruct the original data x 0 via sampling from x T by learning a diffusion model f θ (x t , t).Diffusion-LM (Li et al., 2022) and Dif-fuSeq (Gong et al., 2023) design an embedding function EMB(w) to map the discrete text w into a continuous space and operate clamping on x t to map it back to word embedding space at each sampling step to reduce rounding errors.

Discrete Diffusion Models
For discrete diffusion probabilistic models, each x t is a discrete random variable as one-hot vectors in {0, 1} K , indicating the current state of each token, where K is the size of the vocabulary.Multinomial diffusion (Ho et al., 2020) adopts a uniform noise distribution over the vocabulary.D3PM (Austin et al., 2021) specifies q(x t |x t−1 ) through a transition matrix, and makes it to be a point mass with the probability on an absorbing state [MASK].Zheng et al. (2023) further derive an equivalent reparameterization to the discrete diffusion process.The resulting formulation is more amenable to training and leads to much-improved generation quality.However, discrete diffusion models may miss the opportunity to directly leverage the existing techniques from continuous diffusion models.

Methods
In DiffuSeq, it formulates Seq2Seq tasks as conditional generation and learns p(w x |w y ), where w x and w y refer to the input and target sequence separately.We follow the notation, concatenating two sequences as z t = x t ⊕ y t , where x t and y t represent parts of z t that belong to w x and w y , respectively.To accelerate continuous diffusion models as well as leverage the absorbing state of discrete diffusion models, we add learnable soft absorbing state into DiffuSeq.We first combine the continuous Gaussian noise and discrete absorbing noise, and then jointly denoise them.The detailed derivations can be found in Appendix A.

Training Stage
Forward process with soft absorbing state In continuous space, we first add Gaussian noise ϵ for each time step with α t = 1 − β t , ᾱt = t i=1 α i : Considering the i-th token in the hidden representation z t of the word sequence, we replace its representation with our soft absorbing state m at a certain probability.The soft absorbing state m is in the same hidden dimension as word embeddings and is also jointly learned along with the whole diffusion process.
where ρ = Bernoulli(β t * γ), and γ is the [MASK] ratio when t = T .This operation keeps the diffusion model in continuous space but discretely replaces the representation of some tokens in the sequence, and the replacement probability is also scaled to the time step, same with β t .Noted that these two kinds of noise are posed partially on the target y t space in the manner of DiffuSeq. (3)

Sampling Stage
Previous continuous text diffusion models adopt clamp operation to make the vector predictions more precise and reduce rounding errors (Li et al., 2022) during sampling.However, this operation is not deployed in the training stage, and the gap between training and sampling (Tang et al., 2023) may hinder the performance and the further optimization for sampling speed.
On the contrary, for our methods, during sampling, the same discrete noise in Eq. ( 2) is sprinkled in the continuous Gaussian noise, which bridges the training and inference in discrete space.Using the exact solution of diffusion ODEs proposed by DPM-solver++ (Lu et al., 2022a,b), given an initial value z s at time s > 0, we have: where the λ is a strictly decreasing function of t, σ t is monotonic to β t , and f θ (ẑ λ , λ) is aligned with training objectives.The integral term can be analytically computed by repeatedly applying integrationby-parts n times, and we can just approximate the first several orders and drop the high-order error terms.We use second order in the experiment.

Experiments
We try to validate two main research questions.RQ1: Can the soft absorbing state added in continuous diffusion models improve the generation quality and boost the training convergence?RQ2: To what extent does the DPM ODE solver improve sampling speed and affect the performance?

Experiment setup
Dataset We adopt QQP2 for paraphrasing the sentence to the same semantic content, which is lightweight to train and has been widely used in many Seq2Seq text diffusion models.
Baselines We choose DiffuSeq (Gong et al., 2023), BG-DiffuSeq (Tang et al., 2023) as the representative of continuous diffusion models.The latter is the enhanced version of DiffuSeq, targeting bridging the gap between training and sampling to generate high-quality texts within fewer steps.We choose multinomial (Hoogeboom et al., 2021), D3PM-absorbing (Austin et al., 2021), and a reparameterized version of them (Zheng et al., 2023) as the representative of discrete diffusion models, with fewer diffusion steps for training.

Implementation details
We follow the training, sampling, and evaluation implementation of Dif-fuSeq3 .The experiment is deployed on NVIDIA A100 80G GPUs, with 2 GPUs for training.More details can be seen in Appendix C.2.   relative improvement.Our method is also superior to BG-DiffuSeq, which bridges the gap between training and sampling in continuous space.The speedup versions of our methods still have an advantage over discrete diffusion models, especially in terms of generation speed.We do not directly compare with RDMs because it designs an algorithm to route the discrete change of words, while ours just uses vanilla transition which uniformly changes tokens to the soft absorbing state.

Training speed
We use FP16 for GPU acceleration (Ott et al., 2018), which reduces the total training time from 28 to 11 hours with 2 GPUs and brings 2.5× speed up.According to Figure 2, the jointly denoise training scheme expedites training convergence by at least 1.75×, probably because the absorbing state perturbs the representation of sequence discretely, which empowers the model with better capacity to reconstruct the discrete text information.The training consumption is saved more than 4× in total.

Sampling speed
Sampling with FP16 is approximately 2× faster than the original DiffuSeq.Furthermore, the incorporation of DPM-solver++ shrinks the sampling step to 10 or even 2 without sacrificing the performance much, as shown in Figure 3.This improvement is significant compared with DDIM (Nichol and Dhariwal, 2021) used in DiffuSeq.Comparing our speedup version (step=2) with DiffuSeq (MBR=1), texts with higher quality are generated conditionally and meanwhile about 800× faster.

Ablation study
We test different sampling strategies in Table 2.After removing the clamp operation, the performance of DiffuSeq gets affected evidently, while ours seldom drop, which validates our previous assumption that adding the soft absorbing state bridges the gap between training and sampling, and we can remove the clamp operation.And further plugging in with DPM-solver++ will not introduce extra rounding errors.We test sampling without soft absorbing state on our model which still adds this discrete noise in the training stage.A significant drop can be seen, demonstrating the importance of the alignment of training and sampling.We also analyze the choice of the hyper-parameter γ.As seen in Figure 4, too small or too large [MASK] rate will harm the performance, and closer to the middle tends to perform better.So we choose γ = 0.5 as the default setting in our experiment.

Conclusions
In this work, we present a simple but effective training scheme for joint discrete and continuous text diffusion models, which additionally resets some tokens into the soft absorbing state.The discrete noise bridges the training and sampling stages, saving time consumption of these two stages and the plugged DPM-solver++ further makes the sampling faster.Our method is orthogonal to many other techniques such as self-conditioning (Mahabadi et al., 2023;Chen et al., 2022), choosing tokens with different importance (He et al., 2022), which can also further enhance generation quality.Our method is fundamental to diffusion text generation and can be applied beyond DiffuSeq (Chen et al., 2023).

Limitations
Regarding the methods, we opt not to incorporate length prediction, unlike other approaches.Instead, we use the [PAD] token to indicate the length automatically.This may require more GPU memory for text generation.We validate our methods on the QQP dataset, which is one of the sequence-tosequence text generation tasks.However, due to resource and time constraints, we are unable to test the effectiveness of our methods on more complex tasks such as machine translation and summarization.Additionally, this work does not explore the impact of scaling up the model size.
and for brevity, we moit the coefficient of z t and z 0 as constants.We can use the variational lower bound to optimize the negative log-likelihood E[− log p θ (x 0 )] ≤ L VLB .The objective can be further rewritten to be a combination of several KL-divergence.
A.2 Sampling stage DPM-solver++ (Lu et al., 2022a,b) is proposed totally based on the continuous diffusion models.
According to it, we have the exact solution of diffusion ODEs, given an initial value z s at time s > 0: where the λ is a strictly decreasing function of t, σ t is proportional to β t , specifically, σ t = √ 1 − ᾱt .
We need to approximate e λ f θ dλ, which can be analytically computed by repeatedly applying n times of integration-by-parts.According to the second order multistep DPM-Solver++ algorithm, the reconstruct of z 0 relies on the f θ , after posing discrete denoise in our methods, the algorithm still applies since our f θ (ẑ λ , λ) is exactly aligned with training objectives.

B Related Work
Continuous diffusion models are first applied in image generation (Song et al., 2020;Ho et al., 2020) and then applied in text generation (Li et al., 2022;Gong et al., 2023).Meanwhile, discrete diffusion models (Austin et al., 2021;Hoogeboom et al., 2021;Zheng et al., 2023) is designed for text generation.(He et al., 2022) and (Zhou et al., 2023) directly leverage the [MASK] token used in pretrained language models.By contrast, our method learns the soft absorbing state from scratch along with the whole diffusion process and bridges discrete diffusion with continuous space.The idea of absorbing state or discretely corrupt data can be seen in many NLP work like BERT (He et al., 2022) or BART (Lewis et al., 2020), or even in diffusion image generation (Hu et al., 2022).

C.1 General setting
We use 2 A100 80G GPUs for training using FP16 with a batch size of 425 and a single GPU on sampling with a batch size of 100.The generation speed is measured under the setting of 100 batch size on one NVIDIA A100 80G GPU for all models, averaged by 3 runs.
To evaluate the quality, we use the standard metric BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) score.Since string-similarity-based metrics can be unsatisfactory for open-ended generation, we also report BERTScore (Zhang et al., 2019) that assesses the semantic similarity between generated sentences and references.For sentence-level diversity evaluation, we consider sentence-level self-BLEU (Zhu et al., 2018) to measure the n-gram overlap between the set of outputs w.r.t one source sentence.The self-BLEU score is computed using 2 samples for each test case generated with different seeds.All ablation study is conducted using the original version of our models if not specified, to validate the effectiveness of introducing the soft absorbing state.

C.2 Baselines setting
All baselines we used are open-sourced.DiffuSeq4 (Gong et al., 2023) and BG-DiffuSeq5 (Tang et al., 2023) are implemented based on HuggingFace Transformers6 .RDM7 (Zheng et al., 2023) are implemented using Fairseq, where the temperature sampling is adopted when sampling and beam size is set to 1.For generation speed, we report the results on a single NVIDIA A100 80G GPU with batch size 100.

Figure 1 :
Figure 1: Training and sampling stages with discrete noise, which helps the two stages align better.

Figure 2 :Figure 3 :
Figure 2: The test BLEU score along with training hours under different training schemes.

Figure 4 :
Figure 4: The test BLEU score at different training hours for different settings of the ratio γ.

Table 1 :
Sequence-to-sequence text generation results on QQP.All results are reported without MBR decoding if not specified.The best result is bolded, and the gray columns are excluded from comparison considering the fairness.The sampling step of BG-DiffuSeq is 20.The relative improvement ↑ is computed between our speedup version (step=2) and the DiffuSeq (MBR=1).Jointly denoiseThe reverse process is to jointly reconstruct the corrupted data point.The simplified loss function is almost the same with DiffuSeq, except for z t in different noise strategies:

Table 2 :
Different sampling strategies.[C] denotes clamp operation and w/o [M] denotes stopping adding discrete noise during sampling.