CODEFUSION : A Pre-trained Diffusion Model for Code Generation

Imagine a developer who can only change their last line of code—how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsid-ering earlier tokens generated. We introduce CODEFUSION , a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CODEFUSION on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CODEFU - SION (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M– 175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy, due to its better balance in diversity versus quality.


Introduction
Auto-regressive code generation models (Wang et al., 2021;Brown et al., 2020;Scholak et al., 2021;Feng et al., 2020;Fried et al., 2022) cannot easily reconsider tokens generated earlier in the decoding process.This limitation can lead to lower diversity generations (Lin et al., 2023) in the related domain of text.To balance diversity and quality of candidates generated, prior work has explored decoding strategies such as grouped beam search (Vijayakumar et al., 2018) or nucleus sampling (Holtzman et al., 2019).
Diffusion models, which have shown remarkable performance in image generation (Dhariwal and Nichol, 2021), have recently been extended to generate diverse text (Li et al., 2022;Lin et al., 2023).These approaches use an embedding layer to convert discrete tokens to continuous embeddings, where Gaussian noise can be added and predicted, to imitate the diffusion process.To map denoised embeddings back to discrete text, these approaches then select the vocabulary token with the closest embedding.In the code domain, where there are many syntactic and semantic constraints between tokens, independently projecting embeddings back to tokens can yield invalid programs.
We propose CODEFUSION, a natural language to code (NL-to-code) model that combines an encoder-decoder architecture (Raffel et al., 2020) with a diffusion process.The encoder maps the NL into a continuous representation, which is used by the diffusion model as an additional condition for denoising random Gaussian noise input.To generate syntactically correct code, we then feed the denoised embeddings to a transformer decoder, with full self-attention and cross attention with the embedded utterance, to obtain probability distributions over code tokens.Finally, we select the token with the highest probability at each index.
To pre-train CODEFUSION for code generation, we extend the continuous paragraph denoising (CPD) task introduced in Lin et al. (2023) to the code domain.Specifically, we only apply noise to tokens that correspond to identifiers in code or to built-in keywords in the target language.This denoising task allows the model to learn relations between critical code tokens (like variable names, function names and control flow built-ins).
We find that CODEFUSION yields more diverse code (higher n-gram fraction, lower embedding similarity, and higher edit distance) than autoregressive models (see Table 2).The CPD objective, which biases the model towards learning to remove noise in a context-aware fashion, paired with a decoder that has access to the full denoised representation, jointly lead CODEFUSION to produce 48.5% more syntactically correct generations (averaged over three languages) when compared to GENIE, a text diffusion model (Table 3).

Methodology
Figure 1 shows CODEFUSION's architecture.This section describes each component and our training and inference procedures.

Architecture
The input to CODEFUSION is a natural language utterance s = {s 1 , s 2 , • • • , s k } and the output is a predicted code snippet ŷ = {ŷ 1 , ŷ2 , • • • , ŷs }.Both input and output are padded to a fixed dimension n.CODEFUSION has three main transformer-based components (an encoder E, a denoiser N , a decoder D) and a classification head H.
The transformer-based encoder transforms the tokenized utterance s into a vector representation Conditioned on the encoded utterance E s and the time t, the denoiser (N ) predicts and removes noise ϵ t from the noisy program embedding x t to obtain a predicted denoised program embedding x0 = N (x t , t, E s ).N is a transformer block with cross-attention between x t and E s and full selfattention over x t .
Before projecting the denoised embeddings back to discrete code tokens, we use a decoder (D), this time applying cross-attention to x0 and E s , with full self-attention over x0 , to compute a final hidden representation As opposed to prior text diffusion approaches, where tokens are generated independently, full self-attention allows each hidden dimension (d i ) to be generated with full information about the other dimensions.
Finally, D s is projected to actual code tokens with a classification head H that computes a distribution over code tokens p(y|d i ).We do not perform a search over these tokens and select ŷi = arg max y p(y|d i ) for each i.

Training
We train CODEFUSION in two phases: unsupervised pre-training of the denoiser and decoder on code snippets, and supervised fine-tuning of encoder, denoiser and decoder on (utterance, code snippet) pairs.Following prior work on diffusion for text, we use a trainable embedding layer L to embed a code snippet y into a continuous space where we can add (and remove) noise ϵ t at timestep t.
We take inspiration from prior work on diffusion for text and adapt the loss from GENIE (Lin et al., 2023) to CODEFUSION by incorporating the hidden representation D s from the decoder.At time step t, the loss is computed as and consists of three parts.1.We minimize the error between the predicted noise εt and the actual noise ϵ t to train N .
2. We minimize the error between the D s and embedded code to train D and L.
3. We apply a standard cross-entropy loss over the outputs of the classification head, which produces predicted code tokens given D s , and the ground truth code snippet y.
The loss function allows us to train the three main components of our model (denoiser, decoder and classification head) with a diffusion objective.
To pre-train the denoiser (N ) and decoder (D) over a corpus of code snippets, we use two tasks: unsupervised code generation and our adaptation of continuous paragraph denoising (CPD) (Lin et al., 2023) for code.This code-specific CPD task only masks tokens associated with identifiers or built-in keywords from the target language.We randomly sample from these two tasks during pre-training.
Both pre-training and fine-tuning tasks use L t .Because there is no natural language utterance in pre-training, there is no input E s to the denoiser N .In the unsupervised code generation task, E s is replaced with Gaussian noise sampled at every denoising time step.In the CPD task, E s is computed by passing the masked code y through encoder E.

Inference
During inference, we initialize x t with Gaussian noise and iteratively remove a (schedulerdetermined) proportion of the noise over T time steps to obtain x0 (Ho et al., 2020).During this iterative denoising, we do not use the decoder.After this iterative procedure, the decoder produces the final predicted code ŷ.We post-process ŷ to select the tokens up to the first pad token.

Evaluation Setup
We briefly describe training, baselines, benchmarks and metrics.We provide further details for training and baselines in the Appendix.

Benchmarks
We evaluate CODEFUSION on NL-to-code for languages with varying complexity: Python, Bash, and conditional formatting (CF) rules in Microsoft Excel.The CoNaLa dataset (Yin et al., 2018) for Python consists of complex, multi-statement StackOverflow code snippets and associated NL questions.The Bash dataset (Lin et al., 2018) has complex, single-line Bash commands annotated with NL descriptions.The CF dataset (Singh et al., 2022) consists of Excel CF rules, which are single line programs of low complexity, annotated with NL.These benchmarks (and pre-training data-see next section) are made publicly available.1

Training
For our experiments, we instantiate the encoder (E) as a pre-trained CodeT5 encoder (Wang et al., 2021) (embedding dimension is 512), the denoiser (N ) as a 10 layer transformer block, the decoder (D) as 6 transformer decoder layers, and the classification head (H) as a single fully connected layer.
In the training and pre-training phase, we use a square root noise schedule with 1200 diffusion steps (Wu et al., 2023).We use the tokenizer and vocabulary from CodeT5 (Wang et al., 2021) and target code length of 128 tokens.We use AdamW optimizer without weight decay (Loshchilov and Hutter, 2019) and a learning rate of 5e-4.
We pre-train the diffusion and decoder model on code snippets only.For Excel, we use a public corpus of 450K conditional formatting rules (Singh et al., 2022).For Python and Bash, we scrape GitHub notebooks and StackOverflow posts with tags python, bash and powershell, using a regex extractor to detect code (Lin et al., 2018).

Metrics
We evaluate Bash generation using the template match metric-which performs some basic normalization-provided with the dataset.We evaluate Python using CodeBERTScore (Zhou et al., 2023), which has been shown to be a high quality non-execution-based code matching metric.We evaluate CF using execution match (Singh et al., 2022) by executing a rule on the data column and comparing to the expected output.

Evaluation
We investigate the following questions.Q1.Does CODEFUSION generate correct and diverse code?Q2.How do different design decision impact performance?Q3.How does the latent representation evolve during the diffusion steps?

Performance and Diversity (Q1)
Table 1 summarizes performance in top-1, top-3 and top-5 settings for CODEFUSION and baselines.
In top-1, CODEFUSION performs on par with or even better than (much larger) auto-regressive models.For Python, only GPT-3 (175B) performs better than CODEFUSION (75M).In top-3 and top-5, CODEFUSION outperforms all baselines, consistent with previous observations that auto-regressive models with high top-1 performance sacrifice diversity in their generations (Poesia et al., 2022).
Table 2 shows diversity results averaged across all benchmark tasks, over the top-5 generations for each model, for CODEFUSION and auto-regressive (T5, CodeT5, StarCoder, CodeGen, GPT-3) baselines.CODEFUSION produces generations of higher diversity compared to auto-regressive models.
Like CODEFUSION, other diffusion methods (Diffusion-LM and GENIE) improve for top-3 and top-5 relative to top-1.They fall short of CODEFU-SION as a result of generating syntactically invalid programs.Table 3 shows the fraction of syntactically valid generations for CODEFUSION and diffusion baselines.CODEFUSION generations are more often syntactically valid compared to diffusion models not designed for code: 33.8% more  versus Diffusion-LM and 26.2% more versus GE-NIE averaged across all three languages.

Ablations (Q2)
Table-4 shows the results of CODEFUSION with various changes.Removing either pre-training task significantly reduces performance (-10.9% for code generation and -4.6% for CPD on average across the three languages).Results by replacing D and H with grounding (pick closest vocabulary token at last denoising step) or clamping (pick closest vocabulary token at each denoising step) highlights the benefit of using a decoder before rounding.

Gradual Refinement (Q3)
We study how CODEFUSION gradually reached the final result.For this experiment, we stop the denoising at a timestep t ∈ [0, T ], and generate a code snippet for the current state.We measure the normalized string edit distance obtained at each time step (in increments of 100 steps).Figure 2 shows   that the edit distance decreases with t.The drop is much faster for CF rules as these are simpler to generate than full programs and bash commands.An example visualizing this is shown in Figure 3.

Conclusion
We propose CODEFUSION, the first diffusion natural language to code (NL-to-code) generation model.With a decoder and code-focused pretraining, CODEFUSION generates more syntactically correct programs than existing text diffusion models and more diverse programs than existing auto-regressive code models.Our evaluation shows that CODEFUSION competes with state-of-the-art transformer code generation models on Python, Bash and Excel conditional formatting rules.
CODEFUSION is not a global system as we only consider natural language utterances in English.Furthermore, natural language specifications can be provided at varying levels of detail -utterances with less detail may result in worse performance.We consider various programming languages, but more complex languages may result in worse performance.We also find that CODEFUSION struggles when tasked with generating longer code snippets or programs that have long-ranged syntactic dependencies.Because diffusion-based models denoise iteratively, inference latency is substantial, rising exponentially with target generation length.shows these results.CODEFUSION performs comparable to transformer based models and better than other diffusion based text generation approaches for top-1 accuracy.For top-3 accuracy, CODEFU-SION outperforms all baselines.This is consistent with the results in Table 1 and show that CODEFU-SION produces better and more diverse candidate programs for a variety of tasks.

C.2 Visualizing Diffusion
CODEFUSION iteratively denoises the latent representation to construct the final target.This can be visualized by mapping the representation at each time step to discrete tokens.We follow the setup as explained in Section-5.3. Figure 4 shows a success example from the Python benchmarks where CODEFUSION is able to generate the correct code.
We can see how CODEFUSION gradually denoises and generates the correct code.Figure 5 shows a failure case where CODEFUSION is unable to generate the correct code.The user asks to remove the directory tree 'folder_name/'.CODEFU-SION's generation is incorrect as os.removedir is not a valid function, the correct function name is os.removedirs.Further, this function only removes empty directories while the user wanted to remove the directory tree which includes files.

C.3 Effect of Diffusion Time Step
The number of diffusion timesteps is directly related to the generation quality as shown in (Saharia et al., 2022a).We explore how CODE-FUSION is affected by the number of timesteps for Python.We try different timestep values, T = {200, 400, 600, 800, 1000, 1200} and plot the CodeBERT score corresponding to each variation.Figure 6 show the CodeBERT score against increasing timesteps.We see that the dependence of quality with timesteps is true for CODEFUSION as well.Adding timesteps has a diminishing gain as we see the plot flatten at t = 1000.It should be noted that adding timesteps also increases the inference latency and memory requirements of the model.

C.4 Latency and Memory
Diffusion models are known to be complex with millions of parameters, and have higher latency and memory requirements than transformer based models.This is due to repeated sampling and sequential denoising operations.CODEFUSION has 75 Million parameters and requires a disk space of 544 Mega Bytes.The average inference latency on the benchmarks was found to be 2318 milliseconds.The average GPU memory used was 928 Mega Bytes and the maximum GPU memory used was 1042 Mega Bytes.

D Background D.1 Transformer based Sequence Generation
Transformer based language models (Vaswani et al., 2017) are conditional generative models implemented through auto-regressive (AR) decoding.These models predict the likelihood of the target token y t using the conditional input encoding and previously generated tokens y 1 , y 2 , • • • , y t−1 .The likelihood of the generated sequence is given by: P(y|x) = N i=1 p(y i |y 1:i−1 ; x) (1)

D.2 Diffusion Model
Diffusion processes are a discrete-time Markov process.The process starts with an initial state x 0 at timestep t = 0, where x 0 is from the original data distribution.The process moves forward by gradually adding Gaussian noise to x 0 in accordance to the variance schedule β 1 , • • • , β T .Since the forward process only adds noise based on a schedule, at any timestep t + 1, x t+1 can be expressed in terms of x t as During training, a diffusion model learns to perform the inverse diffusion process, wherein it pre-

Figure 2 :
Figure 2: Average normalized edit distance for CODEFU-SION generations against increasing diffusion timesteps.

Figure 3 :
Figure 3: Successive stages of denoising by CODEFU-SION on an example from the Python benchmarks.

Figure 4 :
Figure 4: Figure showing the various stages of denoising in CODEFUSION on an example from the Python benchmarks where CODEFUSION succeeds.CODEFUSION starts from pure noise and gradually denoises to generate the target code.

Figure 5 :Figure 6 :
Figure 5: Figure showing the various stages of denoising in CODEFUSION on an example from the Python benchmarks where CODEFUSION fails.CODEFUSION starts from pure noise and gradually denoises to generate the target code.The final generation is incorrect here as the correct function name is os.removedirs and also this function only removes empty directories while the user wanted to remove directory with files.

Table 1 :
Comparison of CODEFUSION with baselines on the task of NL to code generation for Python, Bash and CF rules.We report top-1, top-3 and top-5 predictions.Model denotes the underlying base model's checkpoint name.#P denotes the number of model parameters.We note the metric used for each language in parentheses.

Table 2 :
Comparison of diversity in top-5 code generations for CODEFUSION and baselines for Python, Bash and CF rules.We report fraction of distinct token-level n-grams, pairwise similarities of CodeBERT embedding of outputs, and statistics over pairwise string normalized edit distance of outputs.

Table 3 :
% of top-1 generations that are syntacticallyvalid for CODEFUSION and text diffusion-based baselines.CODEFUSION generates more valid candidates.

Table 6 :
Comparison of CODEFUSION with baselines on the task of text to code generation for Python, Bash and CF rules.We report top-1, top-3 and top-5 exact code match of the predictions."Model" column denotes the underlying base LLM used by the system.#P denotes number of parameters in the model.