InfoDiffusion: Information Entropy Aware Diffusion Process for Non-Autoregressive Text Generation

Diffusion models have garnered considerable interest in the field of text generation. Several studies have explored text diffusion models with different structures and applied them to various tasks, including named entity recognition and summarization. However, there exists a notable disparity between the"easy-first"text generation process of current diffusion models and the"keyword-first"natural text generation process of humans, which has received limited attention. To bridge this gap, we propose InfoDiffusion, a non-autoregressive text diffusion model. Our approach introduces a"keyinfo-first"generation strategy and incorporates a noise schedule based on the amount of text information. In addition, InfoDiffusion combines self-conditioning with a newly proposed partially noising model structure. Experimental results show that InfoDiffusion outperforms the baseline model in terms of generation quality and diversity, as well as exhibiting higher sampling efficiency.


Introduction
Non-autoregressive (NAR) generation refers to a method of generating sequences where each element is generated independently, without relying on previously generated elements, allowing for faster parallel generation but potentially sacrificing the generation accuracy (Xiao et al., 2022).Recently, diffusion models have demonstrated powerful generative capabilities in image generation tasks, gradually becoming a new paradigm in generative models.The successful application of diffusion models to continuous data such as images and audio has motivated researchers to introduce them to discrete data like text.Previous researches have attempted to incorporate diffusion models into nonautoregressive text generation, designing different text diffusion model structures and applying them to various text generation tasks, such as named entity recognition (Shen et al., 2023) and summarization (Zhang et al., 2023).However, these works have failed to recognize a fundamental difference between the process of generating text with diffusion models and the actual process of human text generation, which may be a reason why text diffusion models have consistently fallen short in terms of generation efficiency and quality.
Previous research has found that the text diffusion models seem to follow an "easy-first" principle (Emelianenko et al., 2019;He et al., 2022) in the decoding process.The "easy-first" principle means the model tends to generate tokens that are most frequently observed (and least surprising) in the training corpus, in order to achieve a higher likelihood.As the context becomes richer, more details are incorporated into the sequence.Figure 1 illustrates the decoding process of some existing text diffusion models, where it can be observed that the model tends to prioritize generating simple, highfrequency, and semantically poor words like "the" and "of" before generating less frequent but more informative and semantically rich words like "structure" and "remember".This differs from the actual order in which humans process or generate text.People tend to prioritize the core parts of a sentence or a paragraph, which contain crucial information (Grice, 1975).For example, when asked:"What are your upcoming plans?", you would answer:"I am going to finish a research paper."In this process, the words that come to your mind first are most likely "paper" or "finish", as they carry key information or have higher information entropy, rather than meaningless words like "a" or "the".It is difficult to imagine how someone whose first reaction is "the" would answer the question mentioned above.Similarly, this is also disastrous for a language model to which we expect to impart language abilities and even thoughts.
This inconsistent decoding order in text diffu-

D3PM
DiffusionBERT DiffuSeq the man has also been arrested by the police .today , he will be remembered for that mistake .I want to become a good geologist .
the man has also been arrested by the police .today , he will be remembered for that mistake .I want to become a good geologist .
the man has also been arrested by the police .today , he will be remembered for that mistake .I want to become a good geologist .
the man has also been arrested by the police .today , he will be remembered for that mistake .I want to become a good geologist .
the man has also been arrested by the police .today , he will be remembered for that mistake .I want to become a good geologist .sion models could lead to poor generation quality and low efficiency.On one hand, due to the fact that the core part of a sentence (key information) is accurately generated in the later half of the sampling process, the model lacks a comprehensive understanding of the overall semantic meaning of the sentence in the early stages, resulting in unsatisfactory generation quality.On the other hand, the lack of guidance from key information in the initial half of the sampling process leads to the generation of many meaningless or irrelevant words from the beginning.Due to the presence of these meaningless or even erroneous sampling steps, the efficiency of the model's sampling process is low.
To address the aforementioned issues, we propose a new non-autoregressive text generation model called InfoDiffusion.We devise a new noise schedule based on the information entropy of the text, enabling the model to aware the information carried by each word in a sentence.This guidance helps the model prioritize generating key information during the sampling process, thereby enhancing the quality and speed of sampling.Furthermore, we have integrated self-conditioning to further improve the generated output's quality and utilized "partially noising and conditional denoising" technique to realise sequence-to-sequence tasks.
In summary, our contributions are as follows: • We propose a new non-autoregressive text generation model called InfoDiffusion, and enables the model to aware the information entropy contained in the text to prioritize generating key information during the sampling process.• We combine self-conditioning and "partially noising and conditional denoising" to achieve highquality sequence-to-sequence text generation.• Experimental results demonstrate that InfoDiffusion, which follows a "keyinfo-first" generation order consistent with humans, achieves better generation quality and higher efficiency than baseline models across four text generation tasks.

Diffusion Models
Diffusion models are a class of latent variable models characterized by a forward and a reverse Markov process (Sohl-Dickstein et al., 2015;Ho et al., 2020).In this framework, given a sample from the data distribution x 0 ∼ q(x 0 ), the forward process generates a sequence of latent variables x 1 , ..., x T by sampling from: where β t ∈ (0, 1) is a noise schedule controlling the step size of adding noise.Based on the reparameterization trick, arbitrary intermediate latent variable x t can be sampled in a closed form: where α t = 1−β t , ᾱt = t i=1 α i .Following a predefined noise schedule, β t increases (α t decreases) as the timestep grow and eventually corrupts x 0 into a random noise.If β t is small enough, the reverse process q(x t−1 |x t ) is also a Gaussian, which is learned by a parameterized model: (3) where µ θ (x t , t) and Σ θ (x t , t) can be implemented by a denoising networks f θ (x t , t) like U-Net or Transformer (Li et al., 2023).During inference, the reverse process begins with sampling noise from a Gaussian distribution p(x T ) = N (x T ; 0, I) and iteratively denoise it by p θ (x t−1 | x t ) until obtaining x 0 .The learning objective of diffusion models is derived from the variational lower bound of the negative likelihood of input x 0 , denoted as:  where E q denotes the expectation over the joint distribution q(x 0:T ).With additional condition on x 0 , the posterior of the forward process q(x t−1 |x t , x 0 ) becomes tractable using Bayes theorem, then the simplified objective L simple can be expressed as: (5) where µ t is the mean of posterior q(x t−1 |x t , x 0 ).Through different parameterization strategies, the prediction objective can also be the noise (Ho et al., 2020) or original data x 0 (Li et al., 2022).

Continuous Text Diffusion Model
To adapt diffusion models for discrete text data, a straightforward approach is to employ word embedding, mapping discrete tokens to continuous word vector space before going through the continuous diffusion process.The continuous text diffusion model (Li et al., 2023), also known as the embedding diffusion model, introduces an embedding step q ϕ (x 0 |w) = N (EMB(w), σ 0 I) in the forward process, where EMB(w) represents a randomly initialized embedding function or obtained from a pre-trained model (such as BERT) that projects the discrete token w into the continuous word vector space.For the backward process, the text diffusion model maps continuous word vectors back to their corresponding actual words through the word rounding module p θ (w|x 0 ) = The inference process starts from random noise x T and follows the typical continuous diffusion process mentioned in Section 2.1 combined with word rounding to reconstruct the target word from the noise.To jointly learn the denoising network and the word embedding, the continuous text diffusion model extends the training objective in Equation 4to a new end-to-end objective (Li et al., 2022): ) which can be further simplified as: 3 InfoDiffusion In this section, we introduce the detailed design of InfoDiffusion.The overall model architecture of InfoDiffusion is depicted in Figure 2. InfoDiffusion incorporates an Information Entropy Aware Noise Schedule, enabling the model to follow a "keyinfo-first" generation strategy, thereby achieving text generation that aligns with human-like processes.Additionally, InfoDiffusion combines selfconditioning and partially noising to achieve faster and superior text generation.

Information Entropy Aware Noise Schedule
In diffusion models, the noise schedule is a crucial component.The noise schedule directly determines how the original data is gradually perturbed during the forward process and how the model learns to recover the target data from the noise during the reverse process.This also leads to the significant influence of noise scheduling on the quality and diversity of generated samples.Previously, commonly used noise schedules, such as the linear schedule (Ho et al., 2020) and the cosine schedule (Nichol and Dhariwal, 2021), have shown promising results in image generation tasks.However, these schedules assume all tokens carry the same amount of information and does not consider the linguistic differences among the tokens in a sequence.This directly leads to a "shortcut" taken by the model: it tends to generate tokens that are most frequently appearing (and easiest) in the training corpus to achieve a higher likelihood.However, this generation order contradicts the behavioral pattern of humans, who tend to prioritize thinking about and generating the core parts of text, which contain higher information content.We refer to this human strategy of prioritizing the generation of high-information content parts in text as "keyinformation-first", abbreviated as "keyinfo-first".
In order to address the differences in the generation process mentioned above and enable the model to generate text more like human, we have designed a new noise schedule that can aware the informational entropy of words in a sentence.That is, at the initial stage of the forward process, low-information words are perturbed, and highinformation words are perturbed at the final stage, thus guiding the model to prioritize generating key information during the reverse process.
Specifically, we first linearly interpolating the mutual information between the latent variables x t and the original data x 0 to 0, i.e.I( , where H denotes the entropy, which measures the amount of information of a random variable.In this case, the noise function in the classical diffusion model becomes β = 1 T −t+1 (Equation 1) and ᾱt = 1 − t T (Equation 2) (Austin et al., 2021).Furthermore, to prioritize perturbing words with lower information before perturbing words with higher information, we design noise weights based on the information entropy of each word in a sentence w.The specific details are as follows:  where e(w i ) represents the normalized value of the information entropy of the i-th word in sentence w and λ(t) control the effect of the informativeness at time step t.To ensure that the latent variables x t retains all information at the beginning process (t = 0) and zero information at the end process (t = T ), the noise schedule λ(t) is designed to be sinusoidal, satisfying λ(0) = λ(T ) = 0, following (He et al., 2022).The H(w) represents the mean entropy of sentence w and max(H(w j )) represents the maximum entropy in sentence w.
Figure 3 shows how ᾱt progresses during the forward process.For example, consider the sentence "The bus passes by the school", the word"school" carries higher information content.Therefore, we encourage masking it at the later stage, allowing our model to learn to restore it in the early place.
It is worth noting that such a noise schedule does not alter the training objective since it does not modify the conditional probability function q(x t−1 |x t , x 0 ) in Equation 4.

Self-Conditioning
In the sampling process of classical diffusion models, at each time step t, the model generates the current prediction xt 0 (x t , t, θ) through a denoising network f θ (x t , t).However, this denoising network only relies on the updated x t from the previous time step and discards the estimated result xt+1 0 ,which means there is no connection between the predicted results of adjacent sampling steps.
In the text generation process of the text diffusion model, this implies that the semantic information between the generated results of adjacent time steps is inconsistent and incoherent, inevitably leading to subpar text quality and low inference efficiency.
To address the issue of semantic incoherence mentioned above, inspired by (Chen et al., 2022)

Partially Noising and Conditional Denoising
In the classical sequence-to-sequence task, given a source text s = {w s 1 , w s 2 , ..., w s n } with n tokens, it generates target text sequence y = {w y 1 , w y 2 , ..., w y n }.A sequence generation model can achieve this by modeling the conditional probability: p(y|s).To accomplish sequence-tosequence text generation tasks, we employ the Partially Noising and Conditional Denoising (Gong et al., 2022).This technique adds noise only to the target text y during forward process and applies denoising solely to y during the denoising process.
Specifically, given a pair of text: the source text w s and the target text w y , we first perform word embedding and concatenate the text pair as EMB(w s y ).Then, we obtain the initial state x 0 of the forward process through q ϕ (x 0 |w x y ) = N (EMB(w s y ), β 0 I).To simplify the notation, we use s t and y t to represent the parts of x t belonging to w s and w y at diffusion time step t, following (Gong et al., 2022).In forward process, we only add noise to y t while keeping s t unchanged.In the reverse denoising process, s t is still kept unchanged and treated as the denoising condition, controlling and guiding the model to generate the desired text y t from the noise.The training objective at this point can be simplified as (Gong et al., 2022): where f θ is the denoising network.

Tasks and Datasets
Following (Gong et al., 2022), we conduct experiments on four typical and popular tasks: Open domain dialogue, Question generation, Text simplification and Paraphrase.Open domain dialogue requires models to generate informative and meaningful responses given a dialogue context.We employ the widely used Commonsense Conversation Dataset (Zhou et al., 2018), with over 3 million conversational pairs covering a wide range of everyday topics.Question generation aims to generate questions which can be answered by the given contents.We utilize the Quasar-T dataset (Dhingra et al., 2017), processed by (Gong et al., 2022), containing 119K training samples of document-question pairs.Text simplification aims to modify complex text into simplified sequences by simplifying grammar and word choice.We use the corpus constructed by (Jiang et al., 2020) consisting of 666K complexsimple sentences.Paraphrase involves rewriting sentence with the same semantic meaning but a different surface form.We adopt the widely used Quora Question Pairs (QQP), sourced from the community question-and-answer platform Quora, which consists of 147K positive pairs.
We choose GPT-2 (Radford et al., 2019) and GP-VAE (Du et al., 2022).GPT-2 is trained with language modeling and GPVAE augments T5 (Raffel et al., 2020) with VAE.• Non-autoregressive model.we consider LevT (Cortes et al., 2015), a widely used, strong iterative NAR model.It adopts insertion and deletion to generate and refine sequences iteratively.• Text diffusion model.We choose DiffuSeq (Gong et al., 2022).It is a recent text diffusion model, and the performance of other text diffusion models is similar to it.We implement these models following their original papers.

Evaluation Metrics
When evaluating the generated sequences, both quality and diversity play vital roles.To assess quality, we employ BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) as standard metrics, measuring the overlapping n-grams between the generated and gold texts.However, since string matching alone may not suffice for open-ended generation, we also utilize BERTScore (Zhang et al., 2020) to evaluate the semantic similarity at the embedding level.Greater scores in BLEU, ROUGE, and BERTScore indicate superior performance in text generation.In terms of diversity, we consider evaluating distinct n-grams using Distinct (Li et al., 2016) and the ratio of distinct n-grams to total words using Diverse (Deshpande et al., 2019).Fur-thermore, we incorporate self-BLEU (Zhu et al., 2018), a sentence-level metric that assesses overlapping n-grams among generated texts.A lower self-BLEU score and a higher diverse-4 value indicate a greater degree of diversity in the generated outputs.Following (Gong et al., 2022), we generate three samples per text condition to calculate the diversity metrics for each method.

Implementation Details
InfoDiffusion is built upon the 12 layers of Transformer with 12 attention heads and has approximately 91M parameters.The maximum sequence length is set to 128, with an embedding dimension of d = 128.We perform diffusion steps T = 2, 000.To address out-of-vocabulary genera-tion, we utilize Byte Pair Encoding (Sennrich et al., 2016) to construct the vocabulary.The accuracy metrics of InfoDiffusion are evaluated using MBR (Minimum Bayes Risk) with a candidate sample size of |S| = 10.The experiment is deployed on NVIDIA RTX 3090 Tensor Core GPUs, and we use 4 GPUs on training and single GPU on sampling.
5 Results and Analysis

Text Generation Evaluation
As shown in Table 1, we conclude that InfoDiffusion achieves comparable or even higher generation quality compared with strong baselines.First, compared to encoder-decoder autoregressive models and Non-Autoregressive models, In-foDiffusion exhibits an absolute advantage in terms of quality and diversity.For instance, in question generation tasks, the quality metric BLEU has improved by more than threefold, while distinct has increased by +0.12.The improvement in diversity metrics is equally significant.For example, the value of diverse-4 increased from 0.64 to 0.98, representing an improvement of over 50%.
Second, compared to pre-trained models like GPT2, InfoDiffusion outperforms the base variant and performs comparably to the large variant, which has 8 times more parameters than InfoDiffusion.In terms of diversity, InfoDiffusion leads in seven out of the twelve comparative scenarios, indicating a slight advantage over pre-trained models in generating diverse texts.
Last, compared to the well-performing diffusion model DiffuSeq, InfoDiffusion demonstrates superior text generation quality across all datasets.All quality metrics show an improvement of +0.01 to +0.03.On the other hand, although the score of self-BLEU lags behind DiffuSeq in text simplification tasks, there is a slight improvement in text diversity across the remaining datasets.

Inference Efficiency Comparison
One of the major concerns of diffusion models is the efficiency of Inference.We compare our In-foDiffusion with DiffuSeq in terms of inference efficiency.We conduct experiments on Text Simplification and set the inference batch size to 50 and diffusion time step to 2000 for both models.The quality (i.e., BLEU) and diversity (i.e., div-4) curves during the model generation process are shown in the Figure 5.The quality and diversity of the text generated by DiffuSeq gradually im-  prove in the later stages of sampling (The decreasing trend in diversity metrics is due to the sampling process gradually generating the target text from noise and noise has a high level of diversity).But InfoDiffusion exhibits the opposite behavior, generating high-quality text in the early and middle stages of sampling.Approximately halfway through the sampling process, the quality of the text generated by InfoDiffusion surpasses the final results of DiffuSeq.This indicates that InfoDiffusion can converge to the target sentence more quickly and shorten the sampling time by half compared to DiffuSeq while maintaining almost the same generation performance.

Ablation Analysis
To demonstrate the effectiveness of the proposed techniques in InfoDiffusion, we conducted ablation studies on the QQP dataset.As shown in Table 2, when we removed the self-conditioning, the BLEU score decreased by 0.0126, while Dist-1 remained almost the same.Furthermore, when we additionally removed the proposed noise schedule from InfoDiffusion and used the sqrt schedule proposed in DiffusionLM (Li et al., 2022) instead, the BLEU score dropped by 0.0051 and Dist-1 dropped by 0.0018.This indicates that the proposed noise schedule and self-conditioning contribute to improving the quality of generated text, while the impact of self-conditioning on the diversity of generated text is minimal.

Case Study
We select an illustrative cases and investigate the generation process of InfoDiffusion.There are more cases in the Appendix C. As shown in Table 3, the generation process reveals that the InfoDiffusion model follows the "keyinfo-first" generation order: it prioritizes generating nouns with higher information content, such as "i" and "geologist", and then sequentially generates words with lower information content, such as "can", "how", "become", and "good" to complement the sentence.
In order to illustrate more clearly the model's preference for generating key information, we selected four categories of words that generally have key information or higher information content: nouns, verbs, adverbs, and adjectives (Clark and Weir, 2002;Emelianenko et al., 2019).We compared the decoding order of these words in InfoDiffusion and DiffuSeq during the decoding process.As shown in Figure 6, it is evident that InfoDiffu-sion decodes these high-information words much more earlier compared to DiffuSeq.

Conclusion
This paper, we propose InfoDiffusion, a novel nonautoregressive text diffusion model.By designing an Information Entropy Aware Noise Schedule, we enable the diffusion model to follow a "keyinfofirst" text generation process that is more aligned with human text generation, thereby achieving improved effectiveness and efficiency in text generation.Experimental results on four benchmark datasets confirm the effectiveness of InfoDiffusion.This study is the first research on the decoding order of diffusion models and the first attempt to alter the decoding order of diffusion text models.Future work could explore the use of the proposed noise schedule to replace the existing noise in related tasks based on diffusion models, in order to further enhance the model's performance.

Limitations
Despite the strong performance of InfoDiffusion, it still has the following limitations.First, due to the strong preference of language for simple words, simple words may still appear early in the decoding process.Second, our evaluation relied solely on automatic metrics like BLEU, without assessing potential issues like hallucinations in the generated text.Future work could utilize both automatic metrics and human evaluation to comprehensively assess text quality across dimensions including grammar, semantics, and more.This multifaceted approach will facilitate the generation of truthful, logical, and reliable text.

A.1 Text Diffusion Models
Adapting diffusion models to non-autoregressive (NAR) text generation poses a challenge due to the discrete nature of text.Discrete tokens cannot be directly corrupted by continuous noise, necessitating the redesign of typical diffusion models for text data.In this section, we focus on diffusion models customized for text, which can either perform diffusion in discrete space or incorporate an additional step of mapping discrete tokens to latent space of token embeddings before applying continuous diffusion.
Discrete Text Diffusion Models.These text diffusion models extend diffusion models to discrete state spaces by corrupting and refining the sentences at the token level.D3PM (Austin et al., 2021) employs Markov transition matrices instead of Gaussian noise to diffuse real-world distributions.DiffusER (Reid et al., 2022) generates a sequence of edits that effectively transforms a random noise into output.DiffusionBERT (He et al., 2022) combines diffusion models with pre-trained Language Models to enhance their performance.Diffusion-NAT (Zhou et al., 2023) proposes an iterative self-prompting strategy for denoising process.
Continues Text Diffusion Models.Continuous text diffusion models introduce an additional step where discrete tokens are mapped to the latent space of token embeddings, followed by the adoption of continuous diffusion.Diffusion-LM (Li et al., 2022) is the first to proposes constructing diffusion models on continuous word embedding space.DiffuSeq (Gong et al., 2022) focuses on sequence-to-sequence generation using encoderonly Transformers and utilizes partial noising to define the diffusion process and learn the denoising function.SeqDiffuSeq (Yuan et al., 2022) proposes an encoder-decoder diffusion model architecture for conditional generation and uses adaptive noise schedule technique to improve generation quality.DiNoiSer (Ye et al., 2023) proposes an adaptive method to determine the range of noise scales sampled for counter-discreteness training allowing the model to leverage amplified noise scales from the source conditions during inference.Masked Diffuse LM (Chen et al., 2023) follows easy first generation and designs a soft masking strategy based on tf-idf.Additionally, DiffuSum (Zhang et al., 2023) applies diffusion models to enhance extractive summarization.model can follow the "keyinfo-first" generation order while maintaining high quality, showing the general applicability of our approach.
Meanwhile, we conduct experiments on another 3 datasets: PersonaChat (Zhang et al., 2018), XSUM (Narayan et al., 2018) and SQuAD (Rajpurkar et al., 2016).PersonaChat is a dataset for dialogue generation, with the goal of predicting responses according to the dialog history.XSUM is a dataset for summarization, with the goal of summarizing the document into a sentence.SQuAD is a dataset for question generation, with the goal of generating questions based on given passages and answers.The results in Table 10 demonstrate that InfoDiffusion still achieves strong performance across these datasets.

Figure 1 :
Figure 1: Inference process of three text diffusion models: illustrating the "easy-first" generation order.Each row represents one inference step.

Figure 2 :
Figure 2: The overview of the proposed text diffusion model InfoDiffusion.Grey represents undecoded words, red underline indicates words decoded at the current time step, and black represents words decoded in previous time steps.

Figure 3 :
Figure 3: The noise schedule for each token in a sequence is determined based on the information entropy.

Figure 4 :
Figure 4: An illustration of reverse diffusion sampling steps with Self-Conditioning, sampling directly based on its previously generated samples.

Figure 5 :
Figure 5: The curve of BLEU/div-4 score along with generation process.

Figure 6 :
Figure6: Comparison of distributions of different types of words relative generation order.The x-axis represents the diffusion step t, while the y-axis represents the number of words of a certain type that are first decoded in that diffusion step.
(Chen et al., 2022):-conditioning.As shown in, this technique considers different denoising functions xt 0 (x t , xt+1 0 , t, θ), utilizes the previously estimated samples as auxiliary inputs.Self-conditioning refine the denoising function based on previous estimations instead of starting from scratch with new estimations.By doing so, direct connections and dependencies are established between the generation results of adjacent time steps, achieving semantic consistency and coherence.For more efficient model training, we adopt the same training strategy as Analog-Bit(Chen et al., 2022):

Table 1 :
Evaluation results on four conditional text generation tasks.The best results are denoted by bold fonts, and the best results without pretrained language models are denoted by underline fonts.

Table 3 :
A sampling case from QQP dataset.We truncate the selected samples to the first 10 tokens and mark the generation process of each word with different colors.