DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

We present DiffusionBERT, a new generative masked language model based on discrete dif- fusion models. Diffusion models and many pre- trained language models have a shared training objective, i.e., denoising, making it possible to combine the two powerful models and enjoy the best of both worlds. On the one hand, dif- fusion models offer a promising training strat- egy that helps improve the generation quality. On the other hand, pre-trained denoising lan- guage models (e.g., BERT) can be used as a good initialization that accelerates convergence. We explore training BERT to learn the reverse process of a discrete diffusion process with an absorbing state and elucidate several designs to improve it. First, we propose a new noise schedule for the forward diffusion process that controls the degree of noise added at each step based on the information of each token. Sec- ond, we investigate several designs of incorpo- rating the time step into BERT. Experiments on unconditional text generation demonstrate that DiffusionBERT achieves significant improve- ment over existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous generative masked language models in terms of perplexity and BLEU score. Promising re- sults in conditional generation tasks show that DiffusionBERT can generate texts of compa- rable quality and more diverse than a series of established baselines.


Introduction
Diffusion models (Sohl-Dickstein et al., 2015;Ho et al., 2020;Song et al., 2021) have recently emerged as a new class of state-of-the-art generative models, achieving high-quality synthesis results on image data (Ramesh et al., 2022;Rombach et al., 2022;Saharia et al., 2022). Though these models captured widespread attention from not only the research community but also the public, applying diffusion models to text data is still challenging and under-explored due to the discrete nature of the text. A few prior works that explored * Equal contribution.
q(x t |x t−1 ) (a) Diffusion models for discrete data x t x t−1 · · · (2) The diffusion process of DiffusionBERT is non-Markovian in that it generates noise samples x t conditioning not only on x t−1 but also on x 0 . Such a non-Markov process is due to our proposed noise schedule.
using diffusion models on text data can be divided into two lines. The first is to extend diffusion models to discrete state spaces (Hoogeboom et al., 2021;Austin et al., 2021). The second is to perform the diffusion process and its reverse process in the continuous domain and bridge the continuous and the discrete domain through embedding and rounding Gong et al., 2022). However, none of these works leveraged pre-trained language models (PLMs, Devlin et al. (2019); Lewis et al. (2020); Raffel et al. (2020); Brown et al. (2020); Qiu et al. (2020)), which are an unmissable treasure in the NLP community.
This work, to our knowledge, is the first attempt to combine diffusion models with PLMs. Such a combination is built upon a shared training objective between diffusion models and PLMs, i.e., denoising. Diffusion models consist of a forward process (data to noise) and a reverse process (noise to data). In the forward process, a small amount of noise is gradually added to the data. Then, a neural network (p θ in Figure 1) is employed to learn the reverse process step by step, i.e., learn to denoise. Such a denoising neural network is naturally related to a wide class of PLMs that are pre-trained with denoising objectives such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020). Hence, pre-trained denoising language models can serve as a good start point to learn the reverse diffusion process. On the other hand, diffusion models also offer a promising training strategy for generative PLMs. In contrast to commonly used generative PLMs (e.g., GPT (Brown et al., 2020)) that relies on an autoregressive factorization of the joint probability, diffusion models provide another way of factorization along the dimension of time and therefore allow the model to be not necessarily autoregressive. Thus, diffusion models can be combined with a variety of PLMs that may not be pre-trained for generation.
In the discrete domain, the forward diffusion process can be implemented by a chain of transition matrices that gradually corrupt the clean text. As shown in Figure 1, the clean text "Hello world !" is gradually corrupted into " [MASK] [MASK] [MASK]" during the diffusion process. In this work, we explore using pre-trained denoising language models (e.g., BERT) to learn the reverse diffusion process and demonstrate their advantages in accelerating convergence and improving generation quality. Further, we propose a new noise schedule of the forward process based on the principle of distributing the corrupted information uniformly across the forward process. The noise schedule, called spindle schedule, generates noise for x t conditioned not only on x t−1 but also on x 0 , making the forward process non-Markovian without changing the original training objective. Note that the denoising model takes as input x t and time step t to predict x t−1 , where t is unseen during the pre-training of language models so we investigate several ways of incorporating the time step into PLMs. As a result, we find that the best result is achieved by throwing away the time information, which we call time-agnostic decoding (TAD).
Experimental results on unconditional text generation demonstrate the benefit of combining diffusion models with PLMs: the proposed Dif-fusionBERT significantly improves the generation quality over existing diffusion models for text generation (e.g., D3PM (Austin et al., 2021) and Diffusion-LM (Li et al., 2022)) and previous generative masked language models (e.g., BERT-Mouth (Wang and Cho, 2019)). The effectiveness of the proposed spindle schedule and time-agnostic decoding is confirmed by ablation studies. In a nutshell, DiffusionBERT enjoys the best of both worlds.

Diffusion Models
Diffusion models (Sohl-Dickstein et al., 2015;Ho et al., 2020) are a class of latent variable models that are originally designed for continuous domains. A diffusion model is consisting of a forward diffusion process and a reverse diffusion process. Given a sample x 0 ∼ q(x 0 ), a Markov chain of latent variables x 1 , · · · , x T are produced in the forward process by progressively adding a small amount of Gaussian noise to the sample: where {β t ∈ (0, 1)} T t=1 is a noise schedule controlling the step size of adding noise. Eventually x T becomes an isotropic Gaussian distribution. If β t is small enough, the reverse process q(x t−1 |x t ) is also a Gaussian, which is learned by a parameterized model (2) where µ θ (·) and Σ θ (·) can be implemented by a U-Net or a Transformer. When conditioning also on x 0 , q(x t−1 |x t , x 0 ) has a closed form so we can manage to minimize the variational lower bound to optimize log p θ (x 0 ): where E q (·) denotes the expectation over the joint distribution q(x 0:T ).

Diffusion Models in Discrete Domain
For discrete domains, each element of x t is a discrete random variables with K categories. For text data, K = |V | is the size of the vocabulary. Denote x t as a stack of one-hot vectors, the process of adding noise can be written as  where Cat(·) is a category distribution and Q t is a transition matrix that is applied to each token in the sequence independently: where Q t = Q 1 Q 2 · · · Q t . Note that is elementwise multiplication and the division is row-wise. With q(x t−1 |x t , x 0 ) at hand, according to Eq. (3), we can use a parameterized model p θ (x t−1 |x t , t) to learn the reverse diffusion process.

DiffusionBERT
In contrast to recently proposed diffusion models for text, e.g., Diffusion-LM (Li et al., 2022) and DiffuSeq (Gong et al., 2022), which are based on continuous diffusion models, we instead explore discrete diffusion models to integrate PLMs as the backbone. We first introduce a specific instance of discrete diffusion models (Austin et al., 2021), which considers a transition matrix with an absorbing state for the sake of using PLMs ( § 3.1). Secondly, we introduce a new noise schedule of the forward diffusion process, called spindle schedule, which is based on the principle of distributing the corrupted information uniformly across the forward process ( § 3.2). Then, we investigate several alternatives of incorporating the time step into PLMs for predicting x t−1 given x t and t ( § 3.3).

Diffusion Models with a Discrete
Absorbing State To be combined with pre-trained denoising language models, we incorporate an absorbing state, e.g., [MASK] for BERT, in the Markov process. In particular, each token in the sequence either stays the same or transitions to [MASK] with some probability. Formally, each entry of the transition matrix at step t is as follows, where [M] is the abbreviation of [MASK]. Such a Markov process converges to a stationary distribution q(x T ), which places all probability mass on a sequence with all [MASK] tokens.
The t-step marginal q(x i t |x i 0 ) can be easily obtained in a closed form, where x i t denotes the i-th token in the sequence at step t. Combining with Eq. (3) and (5), we can derive a training objective to optimize p θ (x t−1 |x t , t) and generate a sample by performing the reverse diffusion process:

Spindle Noise Schedule
The noise schedule in the continuous domain, such as the linear schedule (Ho et al., 2020) and the cosine schedule (Nichol and Dhariwal, 2021a), has shown to be important to the performance of diffusion models.
In contrast to the continuous domain where the noise can be easily controlled by the variance of the Gaussian, (1) it is less obvious how to control the degree of noise added at each step in the discrete domain. For the discrete domain, the noise schedule β t = (T − t + 1) −1 has been explored for the case of the uniform transition matrix (Sohl-Dickstein et al., 2015;Hoogeboom et al., 2021) and the absorbing-state transition matrix (Austin et al., 2021). However, (2) such a schedule assumes all tokens carry the same amount of information and does not consider the linguistic difference among the tokens in a sequence. Besides, (3) it violates the easy-first-generation nature of denoising language models. That is, the model tends to generate tokens that are most frequently appearing (and is least surprising) in the training corpus to achieve a higher likelihood. As the context becomes richer, more details come up in the sequence.
To address the above issues, we consider a noise schedule that (1) measures the added noise at each step by the corrupted information and encourage the corrupted information to be uniformly distributed across the diffusion steps. Since the information is measured independently for each token, (2) different tokens in a sequence are assigned different probabilities of transitioning to the [MASK] token. Moreover, inspired by the easyfirst-generation phenomenon, (3) we put the tokens in a sequence in descending order of their information and divide them into T buckets. Each bucket is ensured to contain the same amount of information. That is, we mask the most informative tokens at the start of the forward process and mask the least informative tokens at the end of the forward process such that the learnable reverse process follows an easy-first generative behavior.
In particular, distributing corrupted information uniformly across the forward steps can be formally described by where H denotes the entropy, which measures the amount of information of a random variable, x i denotes the i-th token in the sequence and n denotes the length of the sequence. According to Eq. (7), α i t = t j=1 (1−β i j ) denotes the probability that the i-th token remains the same at step t, i.e., x i t = x i 0 .
We expect that α i t > α j t if H(x i t ) < H(x j t ) such that easy (low-information) tokens emerges earlier than hard (high-information) tokens during the reverse process.
Considering these aforementioned properties, we construct α i t as follows, where S(t) is introduced to control the effect of the informativeness at time step t. It is designed to be sinusoidal to ensure S(0) = S(T ) = 0 such that x t can retain all (zero) information when t = 0 (t = T ). The effect of S(t) is controlled by a hyperparameter λ. When λ = 0, the noise schedule is degraded to β t = (T −t+1) −1 as in Sohl-Dickstein et al. (2015); Hoogeboom et al. (2021); Austin et al. (2021). Figure 2 shows how α progresses during the forward process. The schedule is named as spindle due to the shape of the probability curves. In our proposed schedule, the transition probability at time step t depends not only on the current state but also on the original text, making the forward diffusion process non-Markovian. Nevertheless, as revealed by Eq. (5), this does not change the original training objective.

The Design Space of Feeding Time Steps
Typically, a diffusion model takes as input a noised sample and the time step to predict the denoised sample during the reverse process, i.e., p θ (x t−1 |x t , t). However, t is an additional variable that is unseen during the pre-training of language models and therefore it is less trivial how to feed the time information into the PLMs. Here we explore three design choices of feeding time steps.
Layer-wise Time Embedding A straightforward choice is to include the time step as the same way as positional encoding, i.e., using the Transformer sinusoidal embedding or a learnable MLP in each Transformer layer. Note that this way is commonly adopted in previous work (Ho et al., 2020;Austin et al., 2021;. Figure 2: Each token in a sequence has a specific noise schedule depending on how much information is lost when they are masked. For instance, in the sentence "Bella is sitting over there.", "Bella" is the most informative word. Thus it is encouraged to be masked at the early stage so that our model learns to recover it in the last place. Prefix Time Embedding Prompting language models by prepending trainable soft tokens to the input sequence has shown promising results recently (Lester et al., 2021;Sun et al., 2022). Hence, we also explore including a time step token embedding v(t) as a prefix of the input token embeddings v(x 1 t ), v(x 2 t ), · · · , v(x n t ) . In particular, the time step token is inserted in between the [CLS] token and the input sequence. These added time step token embeddings are trained along with the PLM.

Time-Agnostic Decoding
Another alternative is not to explicitly incorporate the time step t because it can be implied by the noised sample x t . In contrast to the image data, it is easier to implicitly infer the diffusion time step by counting the number of corrupted tokens (i.e., [MASK]) in the noised sequence. In this way, the PLM has to perform iterative decoding while being ignorant of the current time step, i.e., p θ (x t−1 |x t ).

Experimental Setup
We mainly focus on unconditional text generation in complex scenarios where the training data covers a wide range of topics and is composed of a large vocabulary. Experiments are conducted on the One Billion Word dataset (LM1B) (Chelba et al., 2014). LM1B is a language corpus with about 30 million sentences and a vocabulary of about 793k. We use the standard train-test split and take 1% of the training set for validation. All text data are lower-cased to align with the settings of Austin et al. (2021).
Our DiffusionBERT is based on BERT-BASE-UNCASED with about 110M parameters. We train DiffusionBERT using the AdamW optimizer (Loshchilov and Hutter, 2019) for 1.9 million steps with learning rate of 3e-6, dropout probability of 0.1, batch size of 32. For the first 10K steps, we use a linear warmup schedule starting from learning rate of 1e-8.

Baselines
We conduct comparison on unconditional text generation against several non-autoregressive (NAR D3PM D3PM is a general framework of discrete diffusion models. We implement an instance of D3PM with the absorbing state and a layer-wise time embedding. Both DiffusionBERT and D3PM are implemented with a sequence length n = 128 and diffusion steps T = 2048. During inference, we perform decoding with 16 time steps in each iteration. The total inference cost is 128 iterations, which is smaller than that chosen in existing diffusion or diffusion-like models for unconditional generation (Hoogeboom et al., 2021;Savinov et al., 2022). This has no impact on our conclusions since increasing the diffusion step does not bring substantial improvement.
Diffusion-LM Diffusion-LM learns an embedding to map discrete text into the continuous space where it performs Gaussian diffusion process. A rounding step is required to map the continuous embeddings into discrete texts. We re-implemented Diffusion-LM with the model architecture of BERT and diffusion steps T = 2000. Since the performance drop of Diffusion-LM is bigger than D3PM and DiffusionBERT when we sample less steps during generation, we do not skip steps so the number of inference is about 4 times that of Diffusion-BERT and the exact generation time comparison is reported in § 4.5.  BERT-Mouth BERT-Mouth samples text from BERT via order-agnostic autoregressive masked language modeling. Starting from a sequence of [MASK], BERT samples one token at each time step in random order. Another option is decoding from left to right, like autoregressive models. In our preliminary experiments, we find that random position sampling performs better. We continue pretraining BERT on LM1B to adapt BERT to downstream training corpus.

Main Results
Our main results are included in Table 2. We choose BLEU-4 as the metric for generation qual-ity and diversity. For each method, we sample 1K text for evaluating BLEU score and another 1K for self-BLEU. Note that with different sampling strategy, the BLEU/self-BLEU results may vary. For fair comparison, the sentences sampled by D3PM and DiffusionBERT have a fixed length n = 64 and are sampled by a top-K filter where K = 30. Diffusion-LM and BERT-Mouth are trained and sampled following their original implementation. Overall, DiffusionBERT achieves the best generation quality and diversity trade-off among the considered NAR methods. Besides, the perplexity of DiffusionBERT with the spindle noise schedule is substantially higher. Evidence of lower bound is used as a proxy of the perplexities of Dif-fusionBERT and D3PM since the exact likelihood of diffusion models is intractable.
DiffusionBERT vs. Other Generative BERT Models We compare DiffusionBERT with another representative generative masked language model, BERT-Mouth (Wang and Cho, 2019). Experimental results show that Diffusion-BERT achieves better performance in terms of the perplexity and the BLEU score. We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens, which helps Diffusion-BERT generate more coherent text, especially in a long range. Although such decoding may face the problem of multimodality (Gu et al., 2018), inappropriate phrases can be fixed in the upcoming diffusion steps. The probabilistic modeling offers more flexibility in that generated tokens with low probability are more likely to be masked and resampled. In BERT-Mouth, however, the tokens are fixed once sampled. Wang and Cho (2019) also proposed to continue masking and predicting tokens after the whole sequence is complete, revising the sentence for higher quality. But such randomness in the selection and replacement of tokens results in low inference speed.

Discrete vs. Continuous Diffusion Models
We then focus on the comparison of discrete and continuous diffusion models for text generation. To achieve this, we mainly compare Diffu-sionBERT with recently proposed Diffusion-LM, which is based on continuous diffusion models. As a result, despite of its outstanding controlling ability, we show that the texts generated by Diffusion-LM have a lower quality than DiffusionBERT. Though both DiffusionBERT and Diffusion-LM adopt the same configuration of Transformer, it is worth noting that the superior performance of DiffusionBERT may be contributed by not only the discrete diffusion models but also the use of pre-trained models. To disentangle the effect of pre-training and discrete/continuous diffusion models, we also explore initializing Diffusion-LM with BERT. As shown in Table 2, training Diffusion-LM from BERT initialization performs even worse than training from scratch. We conjecture that the continuous nature of Diffusion-LM is not compatible with the initialization from BERT since the embedding learned by BERT may not be suitable for the Gaussian diffusion process. In contrast, the comparison of D3PM and DiffusionBERT shows that DiffusionBERT benefits much from the BERT initialization due to its discrete diffusion process.

Effect of Time
Step In terms of both likelihood and generation quality, the layer-wise time embedding (LTE) lags far behind the other two time step designs for DiffusionBERT while time-agnostic decoding (TAD) achieves the best result. By contrast, D3PM without time step embedding performs significantly worse. In a nutshell, simplifying time step design has positive effect on Dif-fusionBERT but is quite harmful for D3PM. This suggests that initializing p θ with PLMs enables DiffusionBERT to perform generation without explicitly providing time information yet achieving better generation results. The resemblance between BERT pre-training objective and absorbing diffusion models makes it easier for DiffusionBERT to generalize to noisier scenarios while a Transformer encoder trained from scratch needs a specific timeaware module to model the reverse process.
Effect of the Spindle Noise Schedule We try our proposed spindle noise schedule on both Diffu-sionBERT and D3PM. The perplexity is improved by 18% and 19% for D3PM and DiffusionBERT, respectively. Besides, D3PM with the spindle schedule outperforms that with the standard (T −t+1) −1 schedule in generation quality. The same trend holds for DiffusionBERT but with a smaller margin.

Quality-Diversity Trade-off
As shown in Figure 3, DiffusionBERT exhibits comparable generation ability with a Transformer decoder trained from scratch and pushes the Pareto front of NAR generation quality/diversity trade-off by a large margin. However, it still falls behind pretrained AR models of the same size.

Efficiency of Training and Generation
One important feature of DiffusionBERT is that with time-agnostic decoding, all parameters are initialized by pretrained models. Consequently, Dif-fusionBERT includes fewer parameters and is free from adapting new parameters, improving training and decoding efficiency.
Faster Convergence DiffusionBERT converges remarkably faster than D3PM. Figure 4 demonstrates the curve of validation ELBO in the training process. Even if the training budget is cut to 30% (i.e. 0.5 million steps), DiffusionBERT is still able to match the performance reported in Table 2.
Sampling Speed With the x 0 -parameterization proposed in Song et al. (2021) and Austin et al. (2021), DiffusionBERT is able to perform inference with any given budget by controlling the step size in the reverse process. We also control the sampling time of BERT-Mouth by adjusting the max iteration count of its mask-predict process. We list the decoding speed and the corresponding perplexity on the LM1B test set in Table 3. Overall, Diffu-sionBERT exhibits competitive performance even when it reaches comparable speed to GPT and outperforms BERT-Mouth in efficiency-performance tradeoff.

BERT for Text Generation
It has been shown by Wang and Cho (2019) that the transfer-learning ability of BERT does not only helps to achieve impressive results in natural language understanding but also benefits sequential sampling for text generation. However, its bidirectionality nature holds BERT from matching the decoder-only counterparts (Radford et al., 2018) in modeling text from left to right.

Diffusion Models for Text
This work lies in the line of diffusion models, a latent variable generative framework proposed by Sohl-Dickstein et al. (2015). It has been architecturally improved by Ho et al. (2020) and has gained broad attention for its impressive generation ability in continuous domain (e. g. image and audio) (Ramesh et al., 2022;Kong et al., 2021;Nichol and Dhariwal, 2021b). Despite their great success and state-of-the-art sample quality in the above domains, diffusion models for text still struggle to match autoregressive models in various generation tasks. Since the Gaussian noise proposed in Sohl-Dickstein et al. (2015) cannot be directly applied to discrete data, they also introduced a discrete forward process with a Bernoulli transition kernel. Hoogeboom et al. (2021) made a step forward from Bernoulli to categorical distributions. A more general family of discrete diffusion processes was introduced in Austin et al. (2021); Hoogeboom et al. (2022), including absorbing kernels and combinations of absorbing and uniform transition kernels.  models text in the continuous embedding space, which is closer to the settings in earlier works of diffusion models and shows impressive performance in classifier-controlled text generation. While the decoding and convergence speed are substantially slower and the generated text lacks coherence. Moreover, in scenarios where the vocabulary is large, the k-nearest-neighbor algorithm used in decoding holds up decoding even more severely.

Non-Autoregressive Text Generation
Absorbing discrete diffusion models resembles conditional masked language models (CMLMs) (Ghazvininejad et al., 2019) in that both methods predict the whole sequence simultaneously and follows a construct-destruct pattern to iteratively refine the generated text. The main difference lies in the training objective: DiffusionBERT models a stochastic process and drives BERT to learn a group of distributions to gradually recover training data while CMLMs forces the neural network to deterministically recover the whole sequence in every iteration, thus it fails to explicitly model the denoising process. Savinov et al. (2022) proposed to approach the problem of non-autoregressive text modeling via unrolling the generation path to prepare the model for the partially corrupted sequences it will encounter during generation, which resembles the idea of diffusion models for unconditional text generation. Non-autoregressive models are also considered in translation but implemented in various ways, e.g., insertion/deletion (Gu et al., 2019) and iterative sequence alignment (Saharia et al., 2020).

Conclusion
This work aims to approach the problem of unconditional text generation for non-autoregressive models. To achieve this, we combine pretrained language models with absorbing-state discrete dif-fusion models for text. The training procedure of our proposed DiffusionBERT includes two main deviations from current discrete diffusion models, i.e., a new family of time step designs and the spindle noise schedule. The novel spindle noise assigns a schedule for each token according to its frequency in the training corpus. Experimental results demonstrate the success of DiffusionBERT in terms of perplexity. It also pushes the Pareto front of quality-variance tradeoff of NAR methods by a large margin, comparable to a Transformer decoder trained from scratch.