DiffusionSL: Sequence Labeling via Tag Diffusion Process

,

Most current methods tackled SL in a discriminative manner (Akhundov et al., 2018), which usually employed a language model (Hochreiter and Schmidhuber, 1997;Devlin et al., 2019) to encode the sentence and added a token-level classifier to capture the conditional tag distribution (Zhang and Yang, 2018;Yan et al., 2019;Li et al., 2020;Cui et al., 2021;Wu et al., 2022).
Recently, generative pre-trained language models have demonstrated their versatility in various NLP tasks (Raffel et al., 2020).In consequence, researchers attempt to formulate SL as a sequence generation problem (Athiwaratkun et al., 2020;Paolini et al., 2021;Yan et al., 2021;Lu et al., 2022).This paradigm offers the advantage of flexible generation formats (Raman et al., 2022), allowing models to be trained to produce output in any desired format using teacher forcing.In these generative approaches, the input sentence is encoded first, and the label sequence is then decoded in an autoregressive manner, as depicted in Figure 1(a).Nevertheless, the intrinsic token-by-token generation style of the autoregressive model results in inefficient inference, and the difference in conditional signal during training and inference leads to the exposure bias problem (Arora et al., 2022).
To this end, this paper resorts to the Diffusion Probabilistic Model (DPM) (Sohl-Dickstein et al., 2015;Ho et al., 2020;Nichol and Dhariwal, 2021;Song and Ermon, 2019;Song et al., 2021b), renowned as one of the most powerful generative models in the field of Artificial Intelligence Generated Content (AIGC) (Cao et al., 2023;Wu et al., 2023a), especially in the realm of image synthesis (Liu et al.;Rombach et al., 2022;Hong et al., 2022).DPM gradually introduces noises into clean data through a predefined forward process and subsequently employs a denoising network in a reverse process to recover the original data from pure Gaussian noise.The generation process operates in parallel.Compared with an autoregressive decoder, it could naturally solve the exposure bias problem.
Furthermore, the diffusion model can generate the targets progressively to refine the contents step-bystep, thereby facilitating the correction of some mistakes during the generation process.In comparison, the conventional autoregressive generative model generates the targets at once.The aforementioned advantages contribute to the success of the diffusion model in generating data across various modalities (Zhang et al., 2023a;Trippe et al., 2023;Li et al., 2022c;Li et al., 2022b).Intuitively, we expect that DPM could also present superior performance on SL tasks.Therefore, this paper proposes a novel framework called DiffusionSL.The input sentence serves as the conditional signal, while the corresponding task-related tag sequence is the generation target.We call the entire diffusion process as the Tag Diffusion Process.During the training phase, a fixed forward process is utilized to sequentially perturb the tags and a reverse diffusion process is learned to eliminate noise based on the noisy tags and the encoded sentence.In the inference phase, we sample noisy tags from a standard Gaussian distribution and employ the well-trained denoising model to refine these tags into clean ones in an incremental manner (shown in Figure 1(b)).
However, directly applying a DPM on SL task is challenging.Classical DPM models the image data and Gaussian noise in a continuous space.Instead, tags in SL are discrete.A few research ef-forts have addressed this issue (e.g., text symbols (Gong et al., 2023), segmentation maps (Chen et al., 2023b(Chen et al., , 2022))).Following them, this paper introduces a lightweight and flexible module named Bit-Tag Converter (BTConverter) to represent the tags as binary bits, which can be viewed as real numbers in continuous space.Additionally, we propose the Bit Diffusion Transformer (BitDiT) to model the noise-removal reverse process, which enjoys faster convergence and stronger performance.
Currently, Shen et al. (2023) also introduces a diffusion model into NER, an important SL task.However, their approach only models the diffusion process on entity span boundaries, restricting itself to NER and impeding the potential for extension to other SL tasks.
In summary, our contributions are as follows: • This study proposes a novel framework Diffu-sionSL that addresses the Sequence Labeling (SL) tasks using a non-autoregressive stepwise generative approach.To the best of our knowledge, we are the first to apply the diffusion model to SL rather than only NER.
• We introduce two key modules, BTConverter and BitDiT, which convert the discrete data into continuous bits and model the iterative denoising process.The first module effectively handles the discreteness issue while the second one accelerates the convergence.
• We utilize the DiffusionSL for NER, CWS, and POS tagging.The thorough experiments on several datasets of these typical tasks indicate that DiffusionSL can achieve better than previous SOTAs, highlighting the superiority of the proposed Tag Diffusion Process.

Preliminary
The diffusion model is characterized as a latent generative model.It encompasses a forward process that involves the gradual introduction of noise to clean data x 0 .Additionally, it includes a reverse process that incrementally eliminates the noise, thereby generating data following the original distribution.The forward process q(x t |x t−1 ) can be mathematically expressed as: Here, N represents the Gaussian distribution and β t represents the noise introduced at timestep t.
The magnitude of the noise is governed by a noise schedule β = {β t } T t=1 .Generally, the noise intensity usually follows either a linear (Ho et al., 2020) or cosine (Nichol and Dhariwal, 2021) function.In this study, we adopt the linear noise schedule.By employing the notations α t = 1 − β t and ᾱt = t s=1 α s , we can deduce an equation to directly incorporate noise into x 0 , yielding the noisy data at any given timestep: Additionally, the reverse process q(x t−1 |x t ) allows us to reconstruct samples from pure Gaussian noise, which is not tractable directly.We can use a neural network p θ to estimate it: The conditional distribution of x t−1 given x t in reverse process could be obtained analytically as follows when x 0 is available: Therefore, we train the µ θ to predict the μt .Due to x t = √ ᾱt x 0 + √ 1 − ᾱt ϵ, we could get µ θ by predicting it directly or predicting ϵ, x 0 indirectly.The variance Σ θ could be learned or fixed.We keep it fixed in this work.The training objective is the variational lower bound of the negative loglikelihood − log p θ (x 0 ), which can be simplified to predict the clean data x 0 or injected noise ϵ.
To expedite inference, it is possible to skip certain timesteps in the reverse process.By taking in the more noisy samples at timestep t i , we can generate the less noisy samples at timestep t i−1 : where x0 represents the predicted original data by a learned network.η is usually set to 0, thus resulting in a deterministic generation process which is called DDIM sampling (Song et al., 2021a).

Methodology
The details of DiffusionSL are illustrated in Figure 2. We will provide a comprehensive explanation of it in this section.Firstly, we formally describe the overall procedure of the Tag Diffusion Process employed by DiffusionSL.Secondly, we present a detailed description of two components we proposed to help modeling the Tag Diffusion Process: BTConverter and BitDiT.Lastly, we summarize the training and inference algorithms of DiffusionSL.

Tag Diffusion Process
Tag Diffusion Process shown in Figure 2(a) describes a conditional generation process where the conditional signal is input sentence W = (w 1 , . . ., w n ) and the target is the task-specific tags L = (ℓ 1 , . . ., ℓ n ).The set of all possible tags is denoted as S, e.g., {B, M, E, S} for Chinese Word Segmentation.Initially, we construct indices for tags in S and transform them into bits in continuous space using BTConverter.The resulting transformed clean tag data is referred to as x 0 .Noise is incrementally injected into x 0 during the forward diffusion process, while a denoising network is learned to gradually eliminate the noise during the reverse diffusion process, ultimately generating the desired tag data.Specifically, we follow the Equation (3) to perturb the data for the forward diffusion process.More specifically, noisy tags {x t } T t=1 at each noise level could be obtained by the following equation: During the reverse diffusion process, we sample an incomplete timestep sequence of length λ to generate the target tags.We train a denoising network, denoted as bd θ,ϕ , which takes in the noisier tags x t i at the timestep t i along with the input sentence W and then outputs the predicted x0 .Here, bd is the abbreviation for BitDiT.Further details on this network will be provided in Section 3.3.Subsequently, we can obtain the less noisy tags x t i−1 following the Equation ( 9), ( 10) and ( 11): By iteratively performing λ denoising steps, the desired target tags are acquired.

Bit-Tag Converter
The diffusion model is unable to generate the discrete data directly.Thus, we propose Bit-Tag Converter (BTConverter) to transform the discrete sequence tags into the bits during training (tag2bit) and convert the bits back to the discrete tags during decoding (bit2tag).The functionality of the BTConverter is illustrated in Figure 2(b).We construct an index for every tag in S. If the number of elements in the tag set is |S|, we use ⌈log 2 |S|⌉ bits to represent tags.We assign 0 ∼ S − 1 to every tag in S sequentially, and the bit representation of the i-th tag is its corresponding binary bits (i.e., (101) 2 represents the 5-th tag), which could be denoted as tag2bit(i).After converting the tags into bits, every bit is taken as an independent real number.Then, they are shifted and scaled to the range of [−b, b], where b represents a hyper-parameter named signal-noise-ratio (SNR) (Xu et al., 2023;Ji , 2023).After the inference with the reverse diffusion process, we use bit2tag(bits) to convert bits back into discrete tags, which are then decoded based on the schema of the downstream task.Further details of the tag2bit and bit2tag functions are provided in Appendix A.

Denoising Network: Bit Diffusion Transformer
To eliminate noise during the reverse diffusion process, we employ a denoising network based on the Transformer architecture (Vaswani et al., 2017).This network, named Bit Diffusion Transformer (BitDiT), takes the bit representation of the noisy tags x t , the input sentence W, and the timestep t as inputs and generates the predicted clean tags x0 .The architecture of BitDiT is illustrated in Figure 2(c).It consists of a BERT-style (Devlin et al., 2019) encoder that represents the sentence in hidden space and a decoder that generates the targets in parallel.The encoder and decoder are parameterized by θ and ϕ respectively.The overall functionality of the denoising network could be represented as follows: In more detail, we first obtain the encoding vectors of the input sentence and cache them for all inference timesteps: Next, we use the decoder to generate the clean tags.The decoder comprises four components: Algorithm 2 Inference for DiffusionSL # input_ids, attention_mask: [bsz, len] # encode conditional signal with PLM hid = encoder(input_ids, attention_mask) # begin with an isotropic Gaussian bits = Normal(0, 1) # denoise the data iteratively for t in range(T): bits_0_pred = bd(bits, t, hid) bits = ddim_sample(bits_0_pred) tags = bit2tag(bits) return tags Condition Embedder: It encodes the timestep t using a learned lookup table and fuses it with the cached H to get the conditional signal c t , allowing the decoder to be aware of input and time: Pre Layer: This layer consists of a multi-layer perceptron (MLP) that projects the bit representations to a higher dimension for modeling more fine-grained interactions between tags.
Generator: This component consists of N our designed Transformer blocks.It generates clean tags based on the conditional signal c t .
Post Layer: This layer performs layer normalization to normalize the representations and then applies another MLP to project them back to the original dimension.
The generation process can be summarized as: where Pre and Post represents the Pre Layer and the Post Layer.Instead of using cross-attention to incorporate the conditional signal in the Transformer blocks of Generator, we resort to adaptive layer normalization (Perez et al., 2018;Brock et al.;Li et al., 2022a;Peebles and Xie, 2022), which predicts the element-wise affine parameters based on the condition c t to substitute the naive layer normalization (Ba et al., 2016) used in the original Transformer (Vaswani et al., 2017).The layer normalization used in the Post Layer has also been modified to this accordingly.We also learn an additional scale factor for this Transformer before each skip connection (He et al., 2016) (Shen et al., 2023) on Resume dataset were not reported in the original paper, so we reproduce and test on it, along with the other two datasets.△ represents the results reported in the original DiffusionNER paper while the † refers to our reproduced results.‡ is the results obtained by prompting gpt-3.5-turbowith one demonstration example.

Training and Inference
For training, we use the Mean Squared Error Loss (MSE Loss) to regress the predicted bits: For inference, we sample a sequence of tags in bit representation with the same length as the input sentence from a standard Gaussian distribution.Then, we conduct λ denoising steps to get the desired clean bits.The discrete tags are acquired by the bit2tag function of the BTConverter.The pseudo-code of training and inference is provided in Algorithm 1 and Algorithm 2, respectively.

Experimental Settings
We conduct experiments to evaluate the proposed DiffusionSL framework on seven different datasets.We choose MSRA (Levow, 2006), Resume (Zhang and Yang, 2018), and Conll03 (Tjong Kim Sang and De Meulder, 2003) for NER, MSRA, PKU (Emerson, 2005), and CTB6 (XUE et al., 2005) for CWS, and CTB5 (XUE et al., 2005) for POS tagging.The detailed statistics of these datasets are provided in Appendix C. We provide the detailed hyper-parameter settings in Appendix D.
We compare the previous models for NER, CWS, and POS tagging to test the effectiveness of the Dif-fusionSL.Compared methods can be divided into discriminative and generative ones.The former capture the conditional tag distribution and the latter generates the desired outputs at the decoder end.For NER baselines, the works of Huang et al. (2015a), Zhang and Yang (2018) (Yang et al., 2017;Ma et al., 2018;Yang et al., 2019;Tian et al., 2020) and POS (Diao et al., 2020;Meng et al., 2019) baselines are all discriminative ones.We also compare one of the most powerful Large Language Models (LLMs), gpt-3.5-turbo (OpenAI, 2023).More details of baselines are deferred to the Appendix E. problems effciently.Compared to DiffusionNER, which is our concurrent work that applies the diffusion model on NER, we surpass it on all the datasets, illustrating the advantages of Tag Diffusion Process over the entity span boundary diffusion process.In our view, the generation format (e.g., span boundary) is unnatural, and it separates the span localization and the entity label classification to some extent.Hence the diffusion model can not model the entity label information well.Besides, this generation format necessitates extra heuristic post-processing operations.As a result, it limits this framework to the NER task and results in sub-optimal performance.Our DiffusionSL can not only handle NER but also all other SL tasks, which enjoys stronger performance and more flexibility at the same time.

DiffusionSL
We also compare the gpt-3.5-turbo(OpenAI, 2023), which is one the most powerful LLMs among OpenAI gpt-series, on NER datasets.We prompt the LLM with one demonstration example (one-shot) for three NER datasets.The experiment of adding more demonstrations is also conducted on Resume dataset.Our experimental results in Table 3 test that the performance will not increase infinitely with the increase of demonstrations.We find that LLM still could not catch up with the taskspecific supervised small models.Thus exploring the specific algorithms for sequence labeling is still meaningful.The corresponding prompt details are shown in Appendix F.

Effects of BTConverter
We conduct experiments to compare with the most intuitive and straightforward alternative, embedding method, which creates a learnable embedding Additionally, the decoding method requires calculating the distance from each tag embedding to each prototype, which is more complex than the bit2tag function in BT-Converter that only operates once for every tag during decoding.These results demonstrate the lightweight and flexible nature of BTConverter.

Effects of BitDiT
BitDiT incorporates conditional signals through adaptive layer normalization, whereas the standard Transformer decoder block and the DiffusionNER utilize an additional cross-attention layer to perceptualize the conditional information.We refer to this method as Cross-Attn (Vaswani et al., 2017;Shen et al., 2023).Figure 3 shows the convergence process of BitDiT and Cross-Attn on the Reusume dataset.From the result, we discover that Cross-Attn exhibits slower convergence speed and inferior final performance compared to BitDiT.This finding underscores the significance of our proposed BitDiT architecture.

Effects of Denoising Steps
We validate the impact of denoising steps on performance, while the corresponding results are shown in Table 5.We find that the performance improves with the increase of the steps until reaching the bottleneck.Owing to the iterative refinement nature of the diffusion model, we can increase the sampling steps to boost the performance (Chen et al. 2022, Shen et al. 2023).But the performance will not improve infinitely with the increase of the steps.So we choose 10 as the defaulting steps for balance between the performance and the speed.

Case Study
We visually show the denoising process details by one Chinese NER case.In Figure 4, we input the sentence 中国中央致中国致公党十一大的贺词 (The CPC Central Committee's congratulatory message to the 11th National Congress of the Chinese Zhigong Party) as condition signal, then we generate the corresponding tags in ten decoding steps by iterative refinement.We also show the label accuracy of tags in the bottom for every step, which increases with the number of reverse steps.Finally, we get the exactly correct NER tags of the input sentence at the final step.Then, we can decode the desired entities (中共中央 (CPC Central Committee's), NT), (中国致公党十一大 (the 11th National Congress of the Chinese Zhigong Party), NT) by the semantics of these tags.To the best of our knowledge, we are the first to introduce this iterative denoising process and shift the traditional classification paradigm to the non-autoregressive generation paradigm for sequence labeling.
7 Related Work

Diffusion Model
Diffusion model is derived from the image generation field (Sohl-Dickstein et al., 2015).The strong performance of the diffusion-based generative model (Ho et al. 2020, Ramesh et al. 2021) significantly boosted the development of the image field.Thus, lots of eyes from other fields are attracted by the diffusion model, e.g., speech synthesis (Zhang et al., 2023a), protein design (Trippe et al., 2023), time series forecasting (Li et al., 2022c), natural language generation (Li et al., 2022b).For NLP, the discreteness nature of language symbol poses several challenges.Nonetheless, many techniques are proposed to boost the development of this field (Hoogeboom et al., 2021;Li et al., 2022b;Yuan et al., 2022;Lin et al., 2023;He et al., 2023;Zhou et al., 2023).We provide a more detailed survey about applying the diffusion model to the natural language field in appendix G. Currently, diffusion model dominats the field of generative AI and there are so many works chasing to improve the diffusion model from different per- spectives, e.g., likelihood estimation (Kingma et al., 2021), efficient sampling (Song et al., 2021a), condition guided diffusion (Ho and Salimans, 2022).

Conclusion
In this paper, we cast the Sequence Labeling (SL) task as a conditional generation problem.Furthermore, we introduce a novel framework DiffusionSL to model SL using a non-autoregressive stepwise generation approach.Meanwhile, we propose BT-Converter to handle the discreteness problem and BitDiT to model the progressive noise-removal process.Experimental results indicate that our model performs better on NER, CWS, and POS tagging compared to previous strong baselines, thus revealing the superiority of DiffusionSL.Our work is pioneer research to cast Natural Language Understanding task as a non-autoregressive corresponding generation task.We hope DissusionSL could be used in more tasks that could be formulated as sequence labeling and leave this as the future work, e.g.semantic role labeling, text chunking.

Limitations
There are several disadvantages that are hard to avoid for DiffusionSL.Firstly, more sampling steps result in a slowerer sampling speed.Though we choose ddim as sampling method to decrease the sampling steps, it is still slower than the discriminative models.Secondly, BitDiT incorporates an extra random-initialized decoder compared to the BERT-style tagger models, which needs more computation memory and is harder to train, we must search the hyperparameters thoroughly to find an optimal result.Thirdly, the initial noise is sampled from a Gaussian distribution, thus bringing randomness in the Tag Diffusion Process.
An imitation learning perspective of error accumulation in language generation.In Findings of the Association for Computational Linguistics: ACL 2022, pages 700-710, Dublin, Ireland.Association for Computational Linguistics.

B Transformer Block of BitDiT Decoder
We show the architecture details of the Transformer block used in BitDiT decoder in Figure 5.The noisy tag bits x t goes through two subblocks.The first block consists of an Adaptive Layer Normalization (ALN) layer, a Multi-Head Self-Attention (MHSA) layer and a scaled skipconnection.The second is basically the same but the MSHA layer is substituted by a Feedforad Network (FFN) layer.The scale factor before the skip connection is learned based on the conditional embedding c t .The calculation procudure could be summarized as follows: where α 1 and α 2 are scale factors.The ALN learns the scale γ and shift β parameters based on the c t : where ϵ is for avoiding division-by-zero error and numerical stability.

D Hyper-parameters settings
For conditional text encoder, we choose chinese-bert-wwm-ext2 for chinese datasets and bert-large-cased3 for English datasetes.We take AdamW (Loshchilov and Hutter, 2019) as optimizer.We set the peak learning rate as 1e-5, with 1000 warm-up steps and a liner decay scheduler.The timestep for training is set to 1000 and the ddim sampling step is set to 10 for efficiency.The SNR scale is set to 0.1.We use the B-M-E-S schema instead of the B-I-O schema.

E Baselines
E.1 NER • BiLSTM-tagger (Huang et al., 2015b) uses a BiLSTM to encode the sentence and a CRF layer to decode the NER tags.
• TENER (Yan et al., 2019) designs a specific attention mechanism and improves the performance of the Transformer for NER.
• FLAT (Li et al., 2020) introduces the character-word lattice into the Transformer architecture to enhance the NER performance.
• NFLAT (Wu et al., 2022) proposes Inter-Former to apply lexical enhancement and removes the word-character and word-word attention interaction for memory efficiency.
• Athiwaratkun et al. ( 2020) uses a generative model for joint sequence labeling and sequence classification task.It can tackle many sequence labeling at the same time and performs well in different scenarios.
• Paolini et al. (2021) proposes an augmented natural languages and then cast many sequence labeling tasks as a translation between original language sequence and the augmented ones.
• UIE (Lu et al., 2022) utilze the proposed structure extraction language to unify various information extraction tasks and uses unsupervised corpora to pretrain, thus resulting in better information extraction ability.
• DiffusionNER (Shen et al., 2023) models the entity as a labeled span (left boundary, right boundary, entity label) and cast the NER as a boundary-denoising process.It generates the entities from the noisy spans using a diffusion model.DiffusionNER sets the boundaries as the generation objective, which can only be used for NER.Conversely, DiffusionSL can be used for all sequence labeling tasks, which enjoys more versatility.
• gpt-3.5-turbo(OpenAI, 2023) is one of the most powerful LLMs.We test its one-shot performance to generate the sequence labeling targets, and the corresponding prompts are in Table 6.

E.2 CWS
• Ma et al. ( 2018) uses a simple unigram and bigram-based Bi-LSTM to extract the entities.
• WMSEG (Tian et al., 2020) introduces the memory mechanism to incorporate the wordhood information to enhance the segmentation model.
• Cui et al. ( 2021) enhances the BERT encoder by masking the whole word.

E.3 POS tagging
For POS tagging, most works focus on adding more features to enhance the word representation ability of the text encoder.
• ZEN (Diao et al., 2020) incorporates the ngram information to enhance the BERT.
• Glyce (Meng et al., 2019) incorporates the visual information to enhance the BERT.
• Cui et al. ( 2021) enhances the BERT encoder by masking the whole word.

F LLM experiments
We show the prompt we used in the LLM experiments in

G Detailed Related Work of Diffusion Model with Language
We provide a complete and detailed review of previous work about applying the diffusion model to language for the potential audiences to get familiar with this promising field.Most existing literature focuses on NLG instead of NLU.
• Multinomial-Diffusion (Hoogeboom et al., 2021) defines a multinomial diffusion process with categorical data to generate text or a segmentation map.
• D3PM (Austin et al., 2021) also defines the diffusion model in discrete space and the newly proposed discrete corruption process improves the performance.
• DiffusionLM (Li et al., 2022b) firstly proposes to embed the text into continuous space to circumvent the discreteness problem and round the final denoised vectors back into words.
• DiffuSEQ (Gong et al., 2023) firstly employs the diffusion model for seq2sesq text generation setting.And it novelly proposes to concatenate and embed the source and target text and only add noise to the target text embedding while keeping the source text embedding as clean.
• SSD-LM (Han et al., 2022) iteratively generates text blocks in a semi-autoregressive manner and conducts the diffusion process on the natural vocabulary space, which balances the advantage of Autoregressive and Non-Autoregressive model and allows convenient guidance incorporation.
• SED (Strudel et al., 2022) casts the discrete language symbols into continuous embeddings and incorporates the self-condition trick into the backward denoising process.
• CDCD (Dieleman et al., 2022) uses the scorematching framework to solve several language modeling tasks in continuous time and input space.
• DiffusionBERT (He et al., 2023) uses the diffusion model to improve the masked language model, which combines the advantage of two denoising models.New proposed noise schedule and time embedding injection methods are applied to it.
• Difformer (Gao et al., 2022) proposes an anchor loss function, a layer norm module over the embeddings, and a noise factor that controls the added noise to tackle the problems incorporating the collapse of the denoising objective, the imbalanced norm of embedding among words, distraction resulted by adding standard noise.
• Lovelace et al. (2022) demonstrates that the latent space of pretrained language model could be used to learn a diffusion model in which the latent representations could be decoded into natural language.
• SeqDiffSeq (Yuan et al., 2022) uses a selfcondition technique and a newly proposed adaptive noise schedule for sequence-tosequence diffusion language model based on Transformer.
• Diff-Glat (Qian et al., 2022) proposes modality diffusion process and residual glancing sampling to boost the performance of Non-Autoregressive parallel text generation.
• GENIE (Lin et al., 2023) pretrains a novel Transformer-based encoder-decoder diffusion language model on a large-scale corpus with a novel pretraining technique named continuous paragraph denoise.
• DiffusER (Reid et al., 2023) applies an editbased generative model based on diffusion denoising process to revise the already existing text, making gradual text refinement probable.
• Dinoiser (Ye et al., 2023a) manipulates the noise in the training and inference stage to avoid the pitfall of discreteness of the embedding space and boost the impact of source sentences for conditional sequence learning.
• Masked-Diffusion LM (Chen et al., 2023a) uses a linguistic masking strategy to perturb the clean sentence which enables an easy-firstgeneration backward process.
• RenderDiffusion (Li et al., 2023) transforms the discrete language symbol generation problem into the glyph image generation problem, thus casting the less-studied text-to-text diffusion model as a well-studied text-to-image model.
• DiffusionSum (Zhang et al., 2023b) tackles the extractive summary problems by directly generating the desired summary sentence embedding to match the corresponding natural sentences in the original document.
• Tang et al. (2023) proposes two methods named Distance Penalty and Adaptive Decay Sampling to bridge the gap between the training and inference(namely exposure bias problem in the traditional NLG setting) of the diffusion language model.
• Diffusion-NAT (Zhou et al., 2023) unifies the BART and discrete diffusion model and proposes a self-prompting technique for text-totext generation.
• AR-Diffusion (Wu et al., 2023b) combines the Autoregressive language model and the Non-Autoregressive diffusion language model to boost the text generation performance and speed, due to the left-to-right sequential nature of language.
• DDLM (Balagansky and Gavrilov, 2023) reproduces the CDCD-based (Dieleman et al., 2022) LM and publicly releases the corresponding training code and checkpoint, then firstly tests the performance of downstream tasks, making it convenient for other researchers to explore this field.

Figure 1 :
Figure 1: (a) shows one traditional autoregressive generative method (Paolini et al., 2021) for one typical SL task, Named Entity Recognition (NER).This line of methods generates the target at once and can not refine them.(b) demonstrates our non-autoregressive stepwise generative framework DiffusionSL for NER.In inference stage, the reverse diffusion process of DiffusionSL utilizes a well-trained denoising network to remove the noise step-by-step and recover the clean tags finally, allowing contents (task-specific tags) refinement in this process.Symbols marked in yellow background are noisy, whereas symbols without yellow background represent the desired generated ones.

Figure 2 :
Figure 2: The overall picture of our framework DiffusionSL.(a) depicts the Tag Diffusion Process that eliminates the noise in the tag data in a progressive manner, which transforms the noisy tags sampled from a Gaussian distribution into clean tags.In the actual scenario, the tags in this diffusion process are in bit representation, we convert them into discrete tag representation for clarity in illustrating the denoising process.(b) demonstrates the Bit-Tag Converter (BTConverter) which enables flexible conversion between the discrete tags and the continuous bits.(c) presents the Bit Diffusion Transformer (BitDiT), the denoising neural network.

Figure 4 :
Figure 4: Visulization example for Chinese NER.中国中央致中国致公党十一大的贺词 means The CPC Central Committee's congratulatory message to the 11th National Congress of the Chinese Zhigong Party.

Figure 5 :
Figure 5: Transformer block details of the BitDiT Decoder.MHSA represents Multi-Head Self-Attention and FFN represents Feedforward Network.β and γ are the learned affine parameters for adaptive layer normalization.α is the scale factor before each skip connection.

Table 1 :
Results of NER.* demonstrates the results of the autoregressive generative methods, which exhibit inferior performance compared to the non-autoragressive ones.The results of DiffusionNER to control the amount of newly learned information added.We refer the architecture diagram and calculation details of this Transformer block in Appendix B.

Table 2 :
achieves better performance on various datasets across different tasks.Promising results validate the effectiveness of DiffusionSL over the previous baselines, including both discriminative and generative methods.NER results in Table 1 show that DiffusionSL achieves 95.49, 96.40, and 92.70 F1 scores on MSRA, Resume, and Conll03 datasets.Meanwhile, for CWS, Dif-fusionSL achieves 98.33, 96.54, 97.46 F1 scores on MSRA, PKU, and CTB6 respectively according to Table 2.For POS tagging, our method achieves 97.18 F1 score on CTB5, with an improvement of 0.36, which is demonstrated in Table 2.The strong performance proves this non-autoregressive stepwise generative approach can generate task-specific tags effectively and thus tackle the corresponding Results of CWS and POS tagging.† denotes the results reproduced by us.

Table 3 :
-turbo experiments results on Resume dataset.Shot is the number of demonstration example used in prompting.Due to the budget of OpenAI api, we only test one dataset.

Table 4 :
Ablation study of tag representation on Resume NER dataset and CTB5 POS tagging dataset.#bits is the bit representation dimension.

Table 5 :
Ablation study of denoising sampling steps on Resume NER dataset and CTB5 POS tagging dataset.
Angel Daza and Anette Frank.2018.A sequence-tosequence model for semantic role labeling.In Proceedings of the Third Workshop on Representation Learning for NLP, pages 207-216, Melbourne, Australia.Association for Computational Linguistics.

Table 7 :
The statistics of the datasets.For CWS task, whose dev set is not provided, we randomly select 10% of the original training instances to serve as the dev set.

Table 6 .
For one-shot experiments, we only add one example following the task definition.