CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.


Introduction
Large language models (LLMs) [Chen et al., 2021, Wang et al., 2021b, Nijkamp et al., 2023] have recently demonstrated remarkable success in a broad set of downstream tasks in the code domain [Husain et al., 2019, Hendrycks et al., 2021. By pretraining on massive code-based data (e.g. GitHub public data), these code LLMs can learn rich contextual representations which can be transferred to various code-related downstream tasks. However, we found that many of the existing models are designed to perform well only in a subset of tasks. We argue that this is mainly due to two limitations in terms of architecture and pretraining tasks.
From an architectural perspective, existing code LLMs often adopt encoder-only or decoder-only models that perform well only on certain understanding or generative tasks. Specifically, encoderonly models [Feng et al., 2020 are often used to facilitate understanding tasks Figure 1: An overview of our CodeT5+ approach: CodeT5+ is a family of code large language models to address a wide range of code understanding and generation tasks. The framework contains a diverse mixture of pretraining objectives on unimodal and bimodal data. Individual modules of CodeT5+ can be flexibly detached and combined to suit different downstream applications in zero-shot, finetuning, or instruction-tuning settings.
such as text-to-code retrieval . For generative tasks such as code generation [Chen et al., 2021, Hendrycks et al., 2021, decoder-only models [Chen et al., 2021, Nijkamp et al., 2023 often demonstrate stronger performance. Besides, several recent models have adopted more unified encoder-decoder architectures [Wang et al., 2021b] to adapt to different types of tasks. While these models can support both understanding and generative tasks, they still suffer from suboptimal performance on certain tasks. Guo et al. [2022] found that encoder-decoder models fail to beat state-of-the-art (SoTA) encoder-only or decoder-only baselines on retrieval and code completion tasks respectively. This shortfall is due to the limitation of the single-module architecture generally adapted to all tasks. In summary, prior approaches are not designed with compositionality such that individual components can be activated to better suit different types of downstream tasks.
From a learning objective perspective, current models employ a limited set of pretraining tasks. These tasks can lead to performance degrade on certain downstream tasks due to the discrepancy between the pretraining and finetuning stage. For instance, T5-based models such as [Wang et al., 2021b] are often trained with a span denoising objective. However, in downstream tasks such as code generation [Chen et al., 2021, Hendrycks et al., 2021, most state-of-the-art models are pretrained with a next-token prediction objective which auto-regressively predicts a program token by token. Furthermore, many models are not trained to learn contrastive code representations that are vital for understanding tasks such as text-to-code retrieval. Although recent attempts [Guo et al., 2022, Wang et al., 2021a] introduce a contrastive learning task to alleviate this issue, these approaches ignore the fine-grained cross-modal alignments between text and code representations.
To address the above limitations, we propose "CodeT5+", a new family of encoder-decoder code foundation LLMs for a wide range of code understanding and generation tasks (see Fig. 1 for an overview). Despite being an encoder-decoder based model, our CodeT5+ can flexibly operate in encoder-only, decoder-only, and encoder-decoder modes to suit different downstream applications. Such flexibility is enabled by our proposed pretraining tasks, which include span denoising and causal language modeling (CLM) tasks on code data and text-code contrastive learning, matching, and CLM tasks on text-code data. We found that such a wide set of pretraining tasks can help learn rich representations from both code and text data, and bridge the pretrain-finetune gap in various applications. Besides, we show that the integration of the matching task with contrastive learning is crucial to capture the fine-grained text-code alignments and improve retrieval performance.
Furthermore, we scale up the model size of CodeT5+ with a compute-efficient pretraining strategy by leveraging off-the-shelf code LLMs [Nijkamp et al., 2023] to initialize the components of CodeT5+. Specifically, we employ a "shallow encoder and deep decoder" architecture [Li et al., 2022b], where both encoder and decoder are initialized from pretrained checkpoints and connected by cross-attention layers. We freeze the deep decoder LLM and only train the shallow encoder and cross-attention layers, largely reducing the number of trainable parameters for efficient tuning. Finally, recent work in the NLP domain [Taori et al., 2023, Wang et al., 2022b, Ouyang et al., 2022 inspired us to explore CodeT5+ with instruction tuning to better align the models with natural language instructions.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks under various settings, including zero-shot, finetuning, and instruction-tuning. Results show that CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g., 8 text-to-code retrieval tasks (+3.2 avg. MRR), 2 line-level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval-augmented code generation tasks (+5.8 avg. BLEU-4). In 2 math programming tasks on MathQA and GSM8K benchmarks [Austin et al., 2021, Cobbe et al., 2021, CodeT5+ models of below billion-parameter sizes significantly outperform many LLMs of up to 137B parameters. Particularly, in the zero-shot text-to-code generation task on HumanEval benchmark [Chen et al., 2021], our instruction-tuned CodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001 model. Finally, we showcase that CodeT5+ can be seamlessly adopted as a semi-parametric retrieval-augmented generation system which significantly outperforms similar methods in code generation. All CodeT5+ models will be open-sourced to support the research and developer communities.

Related Work
Following the success of large language models (LLMs) such as BERT [Devlin et al., 2019] and GPT [Radford et al., 2019] in natural language processing (NLP), recent years witness a surge of research work of LLMs in the code domain, leading to new SoTA results on a wide spectrum of code-related tasks. Typically, code-based LLMs can be categorized into three architectures: encoderonly models [Feng et al., 2020, Wang et al., 2022a, decoder-only models , Chen et al., 2021, Fried et al., 2022, Nijkamp et al., 2023, and encoder-decoder models , Wang et al., 2021b, Niu et al., 2022, Chakraborty et al., 2022. For encoder-only and decoder-only models, they are often ideal for either understanding tasks such as code retrieval [Husain et al., 2019] or generation tasks such as code synthesis [Chen et al., 2021, Hendrycks et al., 2021 respectively. For encoder-decoder models, they can be adapted to both code understanding and generation but do not always achieve better performance [Wang et al., 2021b than decoder-only or encoder-only models. In this work, we propose a new family of encoder-decoder code large language models that can flexibly operate in various modes, including encoder-only, decoder-only, and encoder-decoder models.
Prior code LLMs are also limited by their pretraining tasks, which are not perfect to transfer the models to some downstream tasks. For instance, T5-based models such as [Wang et al., 2021b] pretrained with span denoising objective are not ideal for auto-regressive generation tasks like nextline code completion , Svyatkovskiy et al., 2020b, as these models are trained to recover short spans of limited lengths rather than a whole program. 2 Inspired by recent advances in NLP research [Tay et al., 2022, Soltan et al., 2022, we explore to combine span denoising with CLM tasks to improve the model with better causal generation capability. Additionally, most models do not have specific pretraining tasks (e.g. contrastive learning) to facilitate the learning of contextual representations that can distinguish code samples of different semantics. This can lead to suboptimal performance on code understanding tasks like code retrieval [Husain et al., 2019]. In light of this observation, in our pretraining objectives, we include a contrastive learning task to learn better unimodal representations and a matching task to learn richer bimodal representations. These tasks have demonstrated positive impacts in related vision-language pretraining .
More related to our work is UniXcoder [Guo et al., 2022], which adopts a UniLM-style design [Dong et al., 2019] and supports various tasks by manipulating input attention masks. However, as the model attempts to rely on a single encoder to support all tasks, UniXcoder suffers from the inter-task interference, leading to performance degrade especially on sequence-to-sequence tasks such as code generation. UniXcoder and related work [Wang et al., 2021b, Guo et al., 2022, Wang et al., 2022a also use code-specific features such as abstract syntax trees and identifiers. In CodeT5+, we efficiently activate component modules for different tasks and do not rely on code-specific features.  Figure 2: Model architecture: The encoder learns to encode contextual representations from code/text sequences (either complete, partial, or span-masked sequences) while the decoder is trained to generate different types of outputs, depending on the pretraining learning tasks: S1: first stage pretraining with unimodal code corpus. S2: second stage pretraining with bimodal code-text corpus. The diagram on the right illustrates our proposed compute-efficient training with frozen code LLMs to scale up the model. We employ a "shallow encoder and deep decoder" architecture and only keep the small encoder and the cross-attention layers trainable while freezing the deep decoder LLM.
Finally, also related to our work is the research of parameter-efficient LLM training which aims to scale LLMs using limited computation resources. A common strategy to achieve this goal is to only train a small number of (extra) model parameters while freezing a large part of LLM [Hu et al., 2022, Sung et al., 2022. Another common feature is the use of prompting, either with continuous or discrete prompts, to efficiently align models to downstream tasks , Lester et al., 2021, Liu et al., 2022, Ponti et al., 2023. In this work, we scale our models by leveraging LLMs to initialize the encoder and decoder components of CodeT5+ with pretrained model checkpoints. We employ a "shallow encoder and deep decoder" architecture by Li et al. [2022b] and only keep the small encoder and the cross-attention layers trainable while freezing the deep decoder LLM. We then combine this training scheme with instruction tuning [Taori et al., 2023, Wang et al., 2022b, Ouyang et al., 2022, using only a small set of synthetic instruction-following prompts by Chaudhary [2023], to efficiently guide CodeT5+ towards better alignment to downstream tasks.

CodeT5+: Open Code Large Language Models
We develop CodeT5+, a new family of open code large language models for code understanding and generation tasks (see Fig. 1 for an overview and more architecture/pretraining details in Fig. 2 and Fig. 3). Based on the encoder-decoder architecture [Wang et al., 2021b], CodeT5+ is enhanced with the flexibility to operate in various modes for different downstream tasks through our proposed mixture of pretraining objectives on unimodal and bimodal data.
In the first stage of unimodal pretraining, we pretrain the model with massive code data using computationally efficient objectives (Sec. 3.1). In the second stage of bimodal pretraining, we continue to pretrain the model with a smaller set of code-text data with cross-modal learning objectives (Sec. 3.2). For each stage, we jointly optimize multiple pretraining objectives with equal weights. We found that this stage-wise training approach can efficiently expose our models to more diverse data to learn rich contextual representations. Additionally, we explore initializing CodeT5+ with off-the-shelf code LLMs to efficiently scale up the model (Sec. 3.3). Finally, model components in CodeT5+ can be dynamically combined to suit different downstream application tasks (Sec. 3.4).

Unimodal Pretraining on Code Data
In the first stage, we pretrain CodeT5+ on large-scale code unimodal data, which can be obtained from open-source platforms like GitHub. Although such data also contain texts such as user-written code comments, we denote unimodal data to distinguish them with bimodal data of text-code pairs in Figure 3: Self-supervised pretraining on open-source code data: we pretrain CodeT5+ on code data using a mixture of tasks: (i) span denoising (Top); (ii) decoder-only causal LM (Middle); and (iii) Seq2Seq causal LM (Bottom). This mixture of tasks lets the models learn meaningful representations of code contexts and recover missing information at different levels: code spans, partial programs, and complete programs. the second pretraining stage. It is non-trivial to separate the code and text due to various commenting styles of programmers and different commenting syntax of languages. In this stage, we pretrain the model from scratch using a mixture of span denoising and CLM tasks as shown in Fig. 3. These tasks enable the model to learn to recover code contexts at different scales: code spans, partial programs, and complete programs.
Span Denoising. Similar to T5 [Raffel et al., 2020], we randomly replace 15% of the tokens with indexed sentinel tokens (like [MASK0]) in the encoder inputs, and require the decoder to recover them via generating a combination of these spans. We follow CodeT5 to employ whole-word masking by sampling spans (span lengths determined by a uniform distribution with a mean of 3) before subword tokenization to avoid masking partial words. To accelerate the training, we concatenate different code files into sequences and truncate them into chunks of fixed length.
Causal Language Modeling (CLM). Inspired by Tay et al. [2022], Soltan et al. [2022], we introduce two variants of CLM to optimize our model for auto-regressive generation. In the first variant, we randomly select a pivot location and regard the context before it as the source sequence and the sequence after it as the target output. We denote this variant as a sequence-to-sequence (Seq2Seq) causal LM objective. We restrict the pivot location to be uniformly sampled between 10% and 90% of the whole sequence and prepend a special token [CLM] to the source sequence. The second CLM variant is a decoder-only generation task and can be viewed as an extreme case of the first variant. In this task, we always pass a [CLM] token to the encoder input and require the decoder to generate the full code sequence. Compared to the first variant, this task aims to provide more dense supervision signals to train the decoder as an independent full-fledged code generation module.

Bimodal Pretraining on Text-code Data
In the second stage, we pretrain the model using text-code bimodal data at the function level [Husain et al., 2019]. In this setting, each text-code pair contains a code function and its corresponding docstring describing its semantics. Such a bimodal data format facilitates model training for crossmodal understanding and generation. The bimodal pretraining tasks consist of cross-modal contrastive learning, matching, and causal LM tasks, as shown in Fig. 2 Text-Code Contrastive Learning. This task aims to align the feature space of text and code representations by pulling together the representations of positive text-code pairs and pulling apart the negative pairs. Guo et al. [2022] demonstrated the benefits of such learning task for code understanding. This task only activates the encoder, which encodes a text or code snippet into a continuous representation through bidirectional self-attention [Vaswani et al., 2017]. Similar to BERT [Devlin et al., 2019], we prepend a special token [CLS] to the input and regard its output embeddings at the final Transformer layer as the representations of the corresponding input text or code. We further add a linear layer and use L2 normalization to map the output to 256-dimensional embeddings. To enrich the negative samples, we use a momentum encoder to store embeddings of samples from previous mini-batches, as similarly adopted by He et al. [2020], . Specifically, the momentum encoder maintains a queuing system that enqueues the samples in the current mini-batch and dequeues the samples in the oldest mini-batch. We update the momentum encoder by linear interpolation of the original encoder and the momentum encoder to ensure the consistency of representations across training steps.
Text-Code Matching. This task activates the decoder and aims to predict whether a text and code snippet share the same semantics. Such task enables model to learn better bimodal representations that capture the fine-grained alignment between text and code modalities. Given a code sample, the decoder first passes it to an embedding layer and a causal self-attention layer. The self-attention representations are then passed to a cross-attention layer which queries relevant signals from the text representations (received from the encoder). A task-specific [Match] token is prepended to the code input sequence to inform the decoder of the text-code matching functionality, and an [EOS] token is appended to the end of the code input. Since the decoder employs causal self-attention masks and only the last decoder token can attend to the whole context, we treat the output embedding of [EOS] at the last decoder layer as the text-code cross-modal alignment representation. Finally, we use a linear layer on top of the output embedding of the decoder for a binary matching task, predicting whether a text-code pair is positive (matched) or negative (unmatched).
In order to find more informative negatives, we employ a hard negative mining strategy . Specifically, we sample hard negatives based on the contrastive-based similarity scores between the current sample and previous samples in the queue maintained by the momentum encoder. As such, harder negatives are more likely to be selected. For a batch of positive pairs, we construct two batches of negative pairs by mining negatives from the text/code queue with a code/text query.
Text-Code Causal LM. This task activates both encoder and decoder and focuses on a cross-modal generative objective through a dual multimodal conversion: text-to-code generation and code-to-text generation. Specifically, when the input is a text sample, we prepend a [CDec] token to the input sequence to the decoder. In this case, the decoder operates under code generation functionality. Alternatively, when the input is a code sample, we prepend a [TDec] token to the input sequence to the decoder. The decoder operates under text generation functionality in this case. This type of Causal LM has been shown to be an effective learning objective to close the pretrain-finetune gap for multimodal generative downstream tasks such as code summarization [Wang et al., 2021b].

Compute-efficient Pretraining with Frozen Off-the-shelf LLMs
To efficiently scale up the model without the need of pretraining from scratch, we propose a computeefficient pretraining strategy to initialize model components (i.e. encoder and decoder) of CodeT5+ with off-the-shelf pretrained LLMs [Nijkamp et al., 2023] (see the rightmost diagram of Fig. 2). For this extension, inspired by [Li et al., 2022b], we employ a "shallow encoder and deep decoder" architecture instead of encoder and decoder of the same size in conventional T5 models [Raffel et al., 2020, Wang et al., 2021b. As noted by Li et al. [2022b], the decoder in a T5-based model is often required to deal with a higher level of complexity in generation tasks and thus, should be enhanced with a larger number of neural parameters.
To connect the separately pretrained encoder and decoder, we insert randomly initialized crossattention layers to decoder blocks after the self-attention layers. For the purpose of efficient tuning, we only insert cross-attention layers to the top-L decoder layers (L=1 in our experiments). We only keep the small encoder and cross-attention layers trainable while freezing the majority of the decoder parameters. We also explored other advanced designs such as adding a gating function to improve training stability or inserting multiple cross-attention layers at a certain frequency [Alayrac et al., 2022]. However, we did not observe significant performance improvement, and worse still, these design choices would introduce too expensive computation overhead.

Adaptation to Downstream Understanding and Generation Tasks
After the two stages of pretraining, CodeT5+ can flexibly operate in various modes to support different tasks, including Seq2Seq generation tasks, decoder-only tasks, and understanding-based tasks: Seq2Seq Generation Tasks. As an encoder-decoder model, CodeT5+ can be naturally adapted to a variety of Seq2Seq generation tasks such as code generation and summarization. We also adapt CodeT5+ as a retrieval-augmented generation model, using the encoder to retrieve code snippets, which are then used by both the encoder and decoder for code generation.
Decoder-only Tasks. In this setting, we always feed a [CLM] token to the encoder input and pass the source sequence to the decoder as the prefix context. We freeze the weights of the encoder and the cross-attention layers in the decoder. This strategy only activates parts of the decoder and technically reduces about half of the total model parameters. We use next-line code completion tasks to evaluate the decoder-only generation capability of CodeT5+.
Understanding Tasks. CodeT5+ can support these understanding tasks in two ways: first, it employs the encoder to obtain text/code embeddings, which can be either passed to a binary classifier for detection tasks or retrieval tasks; alternatively, the encoder can be combined with the decoder to predict the text-code matching scores for text-to-code retrieval tasks.

Pretraining Dataset
We enlarge the pretraining dataset of CodeSearchNet [Husain et al., 2019] with the recently released GitHub Code dataset 3 . We select nine PLs (Python, Java, Ruby, JavaScript, Go, PHP, C, C++, C#) and filter the dataset by preserving only permissively licensed code 4 and files with 50 to 2000 tokens. Besides, we filter out the overlapped subset with CodeSearchNet and other downstream tasks covered in our evaluation by checking their GitHub repository names. Note that although we employ the deduplicated data version in which duplicates are filtered out based on the exact match (ignoring whitespaces), there might be some potential remaining duplicates. However, we do not expect any remaining duplication will impact our model performance significantly. We use the CodeT5 tokenizer to tokenize the multilingual dataset, resulting in 51.5B tokens, ∼50x larger than CodeSearchNet.
We report the data statistics of both unimodal code and bimodal text-code pretraining datasets in Table 1. From the table, we can see that our curated dataset from GitHub code has a much larger data size at the file level than the CodeSearchNet bimodal data at the function level, allowing our model to learn rich representations in the first stage of pretraining. Different from CodeT5 [Wang et al., 2021b] which employs both unimodal and bimodal data in CodeSearchNet Husain et al.
[2019], we only employ its bimodal subset for the second stage pretraining of our CodeT5+. We use this stage to mainly adapt our model to text-code related tasks like text-to-code retrieval and generation.
Instruction: Create a SQL query to get the list of employee names and ids with a monthly income greater than 4,000. Input: n/a Output: SELECT id, name FROM Employees WHERE monthly_income > 4000; Instruction: Optimize the given Python program to improve the speed of execution.

Pretraining Setup
We pretrained two groups of CodeT5+ models: 1) CodeT5+ 220M and 770M that are trained from scratch following T5's architecture [Raffel et al., 2020] (T5-base and large respectively), 2) CodeT5+ 2B, 6B, 16B in which the decoders are initialized from CodeGen-mono 2B, 6B, 16B models [Nijkamp et al., 2023] and its encoders are initialized from CodeGen-mono 350M. Note that following our model scaling strategy, the latter group of CodeT5+ models introduce insignificant trainable parameters (the 350M encoder plus one cross-attention layer of 36M, 67M, 151M for 2B, 6B, 16B models respectively) compared to the original CodeGen models. We employ the CodeT5 tokenizer and CodeGen tokenizer for these two groups of models respectively. In pretraining, we adopt a stage-wise strategy to pretrain CodeT5+ first on the large-scale unimodal dataset and then on the smaller bimodal dataset on a cluster with 16 A100-40G GPUs on Google Cloud Platform.
In the first stage, we warm up the model with the span denoising task for 10k training steps, and then joint training with the two CLM tasks with equal weights for 100k steps. We employ a linear decay learning rate (LR) scheduler with a peak learning rate of 2e-4 and set the batch size to 2048 for denoising and 512 for CLM. To prepare the input and output data, we set the maximum length to 512 for the denoising task, and set the maximum lengths to 768 and 600 for source and target sequences for the code completion CLM, 1 and 1024 for the decoder-only generation CLM. In the second stage, we jointly optimize four losses of contrastive learning, matching, and two CLM losses with equal weights for 10 epochs with a batch size of 256. We employ a peak learning rate of 1e-4 and set the maximum sequence lengths to 420 and 128 for code and text sequences.
In all experiments, we employ an AdamW optimizer [Loshchilov and Hutter, 2019] with a 0.1 weight decay. We also employ the DeepSpeed's ZeRO Stage 2 [Rasley et al., 2020] with mixed precision training of FP16 for training acceleration. For the training of CodeT5+ 2B, 6B, and 16B, we use FP16 frozen decoder weights and keep other trainable weights in FP32. We use DeepSpeed ZeRO Stage 3's parameter partition for CodeT5+ 6B and 16B models.

Instruction Tuning
In the NLP domain, recent work [Wang et al., 2022b, Taori et al., 2023 studied the benefits of data augmentation techniques on pretrained LMs with synthetic instruction data. Models finetuned with this type of data can better understand natural language instructions and demonstrate improved alignment with the corresponding tasks [Wang et al., 2022b, Ouyang et al., 2022. We are motivated to transfer this technique to the code domain to improve our CodeT5+ models. Following Taori et al.
[2023], we employ over 20k instruction data in the code domain curated by Chaudhary [2023]. The data is generated by letting pretrained LLMs i.e. text-davinci-003, generate novel tasks, including task instructions, inputs (if any), and expected outputs. We trained our models on this augmented dataset for up to 3 epochs and denote the instruction-tuned models as "InstructCodeT5+". Note that the instruction data are generated fully independently from any downstream evaluation tasks and we still evaluate the instruction-tuned models in a zero-shot manner. Fig. 4 illustrates some examples of the generated instruction data. Note that as we rely on LM-generated data, including the annotations of expected outputs, not all of the data is perfectly correct. For instance, the example of the code optimization task in Fig. 4 contains a wrong output. Wang et al. [2022b] treated these examples as data noise and the tuned models still benefit from the majority of the synthetic instruction dataset.

Experiments
We conducted comprehensive experiments on a wide range of code understanding and generation tasks over 20+ code-related datasets across 9 different programming languages (PLs). In addition, we consider a variety of evaluation settings including zero-shot, instruction tuning, task-specific finetuning. Additional results and detailed finetuning setups can be found in the Appendix C and D.
Baselines. We implemented a family of CodeT5+ models, with model sizes ranging from 220M to 16B. Note that CodeT5+ 220M and 770M employ the same architecture of T5 [Raffel et al., 2020] and are pretrained from scratch, while CodeT5+ 2B, 6B, 16B employ the "shallow encoder and deep decoder" architecture with encoders initialized from CodeGen-mono 350M and decoders initialized from CodeGen-mono 2B, 6B, 16B, respectively. We compare CodeT5+ with state-of-the-art code LLMs that can be categorized into 3 types: encoder-only, decoder-only, and encoder-decoder models.
• For encoder-only models, we consider RoBERTa , CodeBERT [Feng et al., 2020] trained with masked language modeling, GraphCodeBERT  using data flow extracted from abstract syntax tree (AST) of code, SYNCOBERT [Wang et al., 2021a] and UniXcoder [Guo et al., 2022] that incorporates AST and contrastive learning. Note that UniXcoder can be also viewed as decoder-only model as it employs UniLM-style masking [Dong et al., 2019].
• For encoder-decoder models, we consider PLBART  and CodeT5 [Wang et al., 2021b], which employ a unified framework to support understanding and generation tasks.
Note that billion-parameter LLMs such as Codex and CodeGen typically use most of the source code from GitHub for model training and do not remove any overlap with the downstream tasks covered in this work as we did. Therefore, it is difficult to ensure a fair comparison with these models in those tasks, especially the code summarization and completion tasks. Moreover, these models are very expensive to perform task-specific finetuning, and hence, they are often employed only on the zero-shot evaluation. In this work, we mainly compare CodeT5+ with these LLMs in the zero-shot HumanEval code generation task (Sec. 5.1). In other experiments, we focus on the finetuning setting and compare our models with smaller-scale LMs, including CodeGen-multi-350M despite its potential data leakage issues during pretraining. In some of the finetuning evaluations such as the code summarization (Sec. 5.3) and text-to-code retrieval tasks (Sec. 5.5), we found that the performance improvement already becomes relatively saturated as the model size increases. This implies that with enough data for finetuning, these tasks might not benefit much from model scaling (to billions of parameters) as compared to the zero-shot evaluation settings.

Zero-shot Evaluation on Text-to-Code Generation Tasks
We first evaluate the model capabilities to generate Python code given natural language specifications in a zero-shot setting. In this task, from CodeT5+, we activate both encoder and decoder modules, whereby the encoder encodes an input text sequence and the decoder generates corresponding programs conditioned on the input text. We use the HumanEval benchmark [Chen et al., 2021], which Table 2: Results of pass@k(%) on HumanEval: We compare our models with (i) closed-source models (top) such as AlphaCode [Li et al., 2022b], Codex [Chen et al., 2021], and GPT-4 [OpenAI, 2023]; (ii) open-source models (middle) such as CodeGen [Nijkamp et al., 2023], Incoder [Fried et al., 2022], and LLaMA [Touvron et al., 2023]; and (iii) models with enhancement generation strategies (bottom) such as unit test generation   Python Solution: n0 = 48 n1 = 2 t0 = n0 / n1 answer = n0 + t0 Figure 5: GSM8K benchmark: One example of how to convert natural language solution into a Python program on GSM8K dataset.
consists of 164 Python problems. To evaluate models for code generation, exact match or BLEU scores might be limited as there can be multiple versions of correct program solutions. Besides, Chen et al. [2021] found that the functional correctness of generated codes correlates poorly with their BLEU scores. Therefore, we evaluate the model performance by testing generated codes against unit tests. We report the passing rate pass@k in Table 2. Following prior approaches in this benchmark, we adopted nucleus sampling during inference with a temperature of 0.2, 0.6, and 0.8 for k = {1, 10, 100}. In this experiment, we follow Nijkamp et al. [2023] to continue to pretrain our CodeT5+ models on another epoch of the Python subset data using causal LM objective to better adapt them for Python code generation.
These superior results against decoder-only baselines demonstrate the advantage of the encoderdecoder architecture of CodeT5+ and validate the effectiveness of our proposed compute-efficient pretraining strategy with frozen off-the-shelf code LLMs.
Finally, we evaluated the models with enhancement generation strategies following CodeT Chen et al. [2023]. In this setting, we let models generate additional test cases (by prompting the models with an assert statement). We then used these generated test cases to filter and sample generated code samples for evaluation. We observe that this strategy can select better code candidates and bring the performance gains, achieving up to 42.9% pass@1 and 67.8% pass@10. We do notice the performance gaps of CodeT5+ against closed-source models such as GPT-4 [OpenAI, 2023] and code-davinci-002. However, as the implementation details and model weights/sizes of these models were not released, it is difficult to diagnose the root causes of the performance gaps.

Evaluation on Math Programming Tasks
We consider other code generation tasks, specifically two math programming benchmarks MathQA-Python [Austin et al., 2021] and GSM8K [Cobbe et al., 2021]. The task is to generate Python programs to solve mathematical problems described in natural language descriptions, where code correctness is measured based on the execution outputs of the generated programs. We follow [Austin et al., 2021] to convert the solutions in GSM8K into Python programs (henceforth GSM8K-Python, one example is illustrated in Fig. 5). We employ pass@k, to measure the percentage of problems solved using k generated programs per problem. We compare our models with very large-scale decoder-only models including LaMDA [Austin et al., 2021], LLaMA [Touvron et al., 2023], Minerva [Lewkowycz et al., 2022], code-davinci [Chen et al., 2021], GPT-Neo [Black et al.], and CodeGen [Nijkamp et al., 2023]. Some of the prior approaches are enhanced with generation strategies such as self-sampling optimization [Ni et al., 2022] and majority voting [Lewkowycz et al., 2022].   Figure 6: Results of MathQA-Python programming tasks by problem complexity: compared to CodeT5, CodeT5+ is more robust against the complexity of the problems (i.e. the number of reasoning steps required). This observation demonstrates improved reasoning capabilities of CodeT5+, in addition to its understanding and generation skills.
(e.g., GPT-Neo 2.7B and CodeGen-mono 2B), and outperforms LaMDA 137B and code-davinci in the few-shot evaluation setting. We did observe that our models still have some performance gap against Minerva [Lewkowycz et al., 2022]. Note that this model was initialized with pretrained PaLM [Chowdhery et al., 2022] and further finetuned with large-scale scientific corpora. The model also employs a majority voting strategy to select the most common answers as the final predictions.
In Fig. 6, we analyze the model performance by the complexity of math programming problems on MathQA-Python. For each problem, we extract the number of reasoning steps required to solve the problem. We observe that compared to CodeT5, CodeT5+ is more robust against the complexity of the problems (i.e. the number of reasoning steps required). CodeT5 model performance tends to deteriorate drastically as the number of reasoning steps increases. In CodeT5+, the downward  trend is a lot less severe and the model still achieves good results in very complex tasks (more than 10 steps). Please see Appendix C.3 for more qualitative examples.

Evaluation on Code Summarization Tasks
The code summarization task aims to summarize a code snippet into natural language docstrings. We employ the clean version of CodeSearchNet dataset [Husain et al., 2019] in six programming languages to evaluate our models for this task. We employ BLEU-4 [Lin and Och, 2004] as the performance metric which measures the token-based similarity between predicted and ground-truth summaries. From pretrained CodeT5+, we activate both encoder and decoder for this task.
From Table 4, we found that encoder-decoder models (CodeT5 and CodeT5+) generally outperform both encoder-only models [Feng et al., 2020] and decoder-only models [Nijkamp et al., 2023], as well as the UniLM-style model UniXcoder [Guo et al., 2022]. This observation demonstrates the benefit of using the encoder-decoder architecture in CodeT5+ to better encode code contexts and generate more accurate code summaries. Finally, we also observed some performance gains against CodeT5 [Wang et al., 2021b], indicating the advantage of our proposed mixture of diverse pretraining learning objectives in addition to the span denoising objective in CodeT5.

Evaluation on Code Completion Tasks
We evaluate the decoder-only generation capability of CodeT5+ through a line-level code completion task, which aims to complete the next code line based on the previous code contexts. We employ PY150 [Raychev et al., 2016] and GitHub JavaCorpus [Allamanis and Sutton, 2013] from CodeXGLUE, and use exact match (EM) accuracy and Levenshtein edit similarity [Svyatkovskiy et al., 2020a] as evaluation metrics. In this task, we employ a decoder-only model from CodeT5+ so that only about half of the total model parameters are activated. Table 5 shows that both CodeT5+ (in decoder-only mode) and decoder-only models (the top block) significantly outperform encoder-decoder models (the middle block), validating that decoder-only and JavaCorpus respectively. This is mainly due to our causal LM objectives in the first-stage pretraining, which allows the decoder to see longer sequences instead of a combination of discrete spans in CodeT5, leading to a better causal generation capability.

Evaluation on Text-to-Code Retrieval Tasks
We evaluate the code understanding capabilities of CodeT5+ through text-to-code retrieval tasks across multiple PLs. This task aims to find the most semantically related code snippet at the function level from a collection of candidate codes based on a natural language query. We consider three datasets for evaluation: CodeSearchNet [Husain et al., 2019], CosQA , and AdvTest , which are curated from the original CodeSearchNet by filtering data with low-quality queries, adopting real-world queries from a modern search engine, and obfuscating identifiers to normalize the code. In this task, we activate both encoder and decoder of CodeT5+ and use Mean Reciprocal Rank (MRR) as the evaluation metric.
From Table 6, our CodeT5+ 220M significantly outperforms all existing encoder-only/decoder-only (the top block) and encoder-decoder models (the middle block). Our CodeT5+ 770M further sets new SoTA results, surpassing the previous SoTA UniXcoder by more than 3 absolute MRR points on all 3 tasks across 8 datasets. This implies CodeT5+ is a robust code retriever model to handle queries with diverse formats and PLs. Besides, CodeT5+ 220M yields substantial performance gains over CodeT5 model of the same size. These gains can be attributed to the text-code contrastive learning and matching objectives that facilitate better unimodal and bimodal representation learning. Particularly, compared to SYNCOBERT and UniXcoder also pretrained with contrastive learning, our models achieve much better results, which can be attributed to our text-code matching pretraining task that enables the exploitation of more fine-grained text-code alignments.

Ablation Study
We conduct an ablation study to analyze the impacts of our proposed pretraining objectives: a) casual LM objectives at stage-1 unimodal pretraining on two generative tasks including code completion and math programming, b) text-code matching and causal LM objectives at stage-2 bimodal pretraining on an understanding task of text-to-code retrieval. We employ CodeT5+ 770M and report the results of three representative tasks over 10 datasets in Table 7. In CodeT5+, we found that causal LM objective plays a crucial role in code completion and math programming tasks, observed by a significant performance drop after removing it. This indicates causal LM can complement the span denoising objective and improve the generation capability of our models. Additionally, we found that the text-code matching objective is critical to the retrieval performance (a drop of 2.6 avg. MRR over 6 datasets without it), implying this objective can learn a better bimodal representation that captures  the fine-grained alignment between text and code. Besides, we found that retrieval tasks can also benefit from the joint training with causal LM objective despite their task differences.

Unified Retrieval-Augmented Generation Paradigm
As our model is capable of both code retrieval and generation, it can be naturally exploited as a unified semi-parametric retrieval-augmented generator. To explore this adaptation, we follow Parvez et al. [2021] to evaluate two code generation tasks by reversing the input and output order of code summarization on Java and Python and using their released deduplicated retrieval codebase. We evaluate our models in three settings: retrieval-based, generative, and retrieval-augmented (RA) generative. For the retrieval-based setting, we activate our encoder to retrieve the top-1 code sample as the prediction given a text query, while for the RA generative setting, we append the combination of top-k retrieved samples (k=1 in our work) to the encoder input and activate the decoder.
As shown in Table 8, we found that our CodeT5+ achieves better results in all categories, especially in the retrieval-based and RA generative setting. While the previous SoTA model REDCODER-EXT [Parvez et al., 2021] separately employs GraphCodeBERT as the retriever and PLBART as the generator, our model can be flexibly used as an end-to-end system with both retrieval and generation capabilities. We further include a qualitative case in Fig. 7, where we found that the retrieved code provides crucial contexts (e.g., use "urllib3" for an HTTP request) to guide the generative process for more correct prediction. In contrast, the generative-only model gives an incorrect prediction that only captures the concepts of "download" and "compress". Additionally, we analyze the effects of various top-k retrievals on the code generation performance (see Appendix C.2).  Figure 7: Example code generation output: Our CodeT5+ retrieval-augmented generation model could retrieve relevant code context and use it to facilitate better code generation.

Conclusion
We present CodeT5+, a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (encoder-only, decoder-only, and encoderdecoder) to support a wide range of code understanding and generation tasks. To train CodeT5+, we introduce a mixture of pretraining tasks including span denoising, causal language modeling, contrastive learning, and text-code matching to learn rich representations from both unimodal code data and bimodal code-text data. Additionally, we propose a simple yet effective compute-efficient training method to initialize our model with frozen off-the-shelf LLMs to efficiently scale up the model. We explore further instruction tuning to align the model with natural language instructions. Extensive experiments on a broad set of code intelligence tasks over 20 datasets have verified the superiority of our model. Particularly, on the zero-shot HumanEval code generation tasks, our instruction-tuned CodeT5+ 16B

A Ethics Statement
Advancements in code understanding and generation systems hold immense potential to create positive societal impacts by improving programming accessibility and enhancing developer productivity through natural language interfaces. However, deploying such systems at scale requires careful consideration of various ethical aspects, as extensively discussed by Chen et al. [2021].
One critical concern is the potential risk of generated code summaries or comments incorporating toxic or insensitive language, which can have detrimental effects. Several studies have explored techniques to address this issue, such as reinforcement learning [Ouyang et al., 2022], weighted decoding [Krause et al., 2021] , and safety-specific control tokens [Xu et al., 2020]. These approaches aim to ensure non-toxic natural language generation, promoting responsible and ethical use of large language models for code.
Additionally, it is essential to recognize the broader intellectual property implications of code generation and retrieval systems before deployment. Deep learning models generating code may inadvertently introduce security vulnerabilities. To mitigate this risk, it is crucial to conduct expert reviews and rigorous security assessments before adopting such code. This review process ensures that the generated code meets necessary security standards, safeguarding against potential exploits and vulnerabilities. In code retrieval scenarios, providing appropriate attribution to the source along with the retrieved results is paramount. This attribution not only respects the rights of code authors but also enhances transparency, traceability, and collaboration within the programming community. By acknowledging the original authors and promoting a collaborative, ethical, and legally compliant environment, code retrieval systems can foster knowledge sharing and contribute to a reputable programming ecosystem.
By considering these ethical considerations, we can promote the responsible deployment of large language models for code, maximizing their potential benefits while mitigating potential harms to individuals, communities, and the overall software ecosystem. It is imperative to prioritize safety, nontoxicity, intellectual property rights, security, and collaboration in the development and deployment of these systems, ensuring they align with ethical principles and societal needs.

B Bimodal Pretraining Details
To expose the model on more diverse set of pretraining data, we employ a stage-wise pretraining process to first train CodeT5+ on large-scale code-only data with span denoising and causal language modeling (CLM) tasks, then train on smaller set of text-code bimodel data using text-code contrastive learning, matching, and causal LM tasks. Below, we provide detailed formulas for text-code contrastive learning and matching tasks at the second-stage pretraining on text-code pairs.
Text-Code Contrastive Learning activates the encoder to learn better unimodal (text/code) representations by computing a similarity score such that parallel text-code pairs have higher scores. Given a text T and a code C, we first learn representations h t for text T and h c for code C by mapping the [CLS] embeddings to normalized lower-dimensional (256-d) representations from the encoder. Given a batch of N text-code pairs, we obtain text vectors {h t } N i=1 and code vectors {h c } N i=1 to compute text-to-code and code-to-text and similarities: where s t2c i,j represents text-to-code similarity of text of i-th pair and code of j-th pair, and s c2t i,j is the code-to-text similarity, τ is learned temperature parameter. p t2c i (T ) and p c2t i (C) are the softmax-normalized text-to-code and code-to-text similarities for the i-th text and code.
Let y t2c (T ) and y c2t (C) denote the ground-truth one-hot similarity, where negative pairs have a probability of 0 and the positive pair has a probability of 1. The text-code contrastive loss from a corpus D of text-code pairs is defined as the cross-entropy H between p and y: Text-Code Matching activates the decoder with the bimodal matching functionality to predict whether a pair of text and code is positive (matched) or negative (unmatched). We employ the output embedding of the [EOS] token as the fused bimodal representation for a text-code pair (T , C), as this token attends to all the previous context for the text-code pair input. Followed by a linear layer and softmax, we compute a two-class probability p tcm (T, C) and define the matching loss: where y tcm (T, C) is a 2-dimensional one-hot vector representing the ground-truth label.
Text-Code Causal LM. This task focuses on a cross-modal causal LM objective between text and code through a dual multimodal conversion: text-to-code generation and code-to-text generation (i.e. code summarization). Let L t2c and L c2t denote the losses for text-to-code and code-to-text generation. The full second-stage pretraining loss of our CodeT5+ is:

C Additional Experimental Results
In this section, we provide additional experimental results including two understanding tasks of code defect detection and clone detection from the CodeXGLUE benchmark  (Appendix C.1), analysis of the effects of top-k retrievals in retrieval-augmented code generation tasks (Appendix C.2), and more qualitative results in math programming tasks (Appendix C.3).

C.1 Code Defect Detection and Clone Detection from CodeXGLUE
Defect detection is to predict whether a code is vulnerable to software systems or not, while clone detection aims to measure the similarity between two code snippets and predict whether they have a common functionality. We use benchmarks from CodeXGLUE  and use accuracy and F1 score as the metrics. In Table 9, we can see CodeT5+ models achieve new SoTA accuracy of 66.7% on the defect detection task. For the clone detection task, our model achieves comparable results to SoTA models, where the performance increase tends to be saturated, observed by the close performance gaps between multiple baselines.  We further conduct an ablation study to analyze the effects of top-k retrievals in retrieval-augmented code generation tasks and report the results in Table 10 . We found that increasing the number of retrievals can boost model performance which becomes saturated when k=5. This saturation is due to the maximum sequence length of 600, which might not be able to accommodate a large number of retrieved code samples. Overall, our CodeT5+ significantly outperforms the prior SOTA baseline which uses top-10 retrievals in all cases, even with only a top-1 retrieved code.

C.3 Qualitative Results in Math Programming tasks
For math programming tasks, we provide qualitative examples predicted by our models in Fig. 8 and Fig. 9. Overall, we found CodeT5+ is able to generate decent programs that can solve the math problems in various levels of difficulties, i.e. from simple math operations to more complex problems with multiple reasoning steps. From the rightmost example of Fig. 9, we found that CodeT5+ is able to leverage some external libraries such as math when synthesizing the solutions.

D.1 Text-to-Code Retrieval
Text-to-code retrieval (or code search), is the task of finding the best code sample that is most relevant to a natural language query, from a collection of code candidates. We experiment CodeT5+ with
CosQA and AdvTest are two related benchmarks that are derived from the CSN data. Specifically, instead of natural language queries, CosQA uses logs from Microsoft Bing search engine as queries, each of which is annotated by 3 human annotators . AdvTest is created from the Python split of the CSN data but the code samples are normalized with obfuscated variable names to better evaluate the understanding abilities of current models.
For training, we set the maximum sequence to 350 and 64 for code and text. We set the learning rate as 2e-5 and finetune the model for 10 epochs. We employ distributed training on 8 A100s and the total batch size is 64. For momentum encoders, we maintain a separate text/code queue with a size of 57600, and allow the matching decoder to retrieve 64 hard negatives from the queues for hard negative mining.

D.2 Code Summarization
Code summarization is the task of generating a natural language summary of a code snippet. We use the task dataset from CodeXGLUE  which curated a code summarization benchmark from CSN data [Husain et al., 2019]. The benchmark consists of six PLs: Ruby, JavaScript, Go, Python, Java, and PHP. It is the same clean version of CSN data that we use for text-to-code retrieval tasks. For training, we set the maximum sequence length of the source and target as 256 and 128, respectively. We use a learning rate of 2e-5, the batch size as 64 for 10 epochs of finetuning. We set the beam size as 5 in inference.

D.3 Code Defect Detection
Defect detection is the task of classifying whether a code sample contains vulnerability points or not. We adopt the defect detection benchmark from CodeXGLUE  which curated data from the Devign dataset [Zhou et al., 2019]. The dataset contains in total more than 27,000 annotated functions in C programming language. All samples are collected from popular open-source projects such as QEMU and FFmpeg. We follow  and adopt 80%/10%/10% of the dataset as the training/validation/test split. For training, we set the learning rate as 2e-5, the batch size as 32, and the max sequence length as 512 to finetune the model for 10 epochs.

D.4 Code Clone Detection
The task of clone detection aims to detect whether any two code samples have the same functionality or semantics. We conduct experiments using the clone detection benchmark from CodeXGLUE . The benchmark is curated from the BigClone Benchmark dataset [Svajlenko et al., 2014]  For finetuning, we set the learning rate as 2e-5 and finetune the model for 2 epochs. We set the batch size as 10, and the max sequence length as 400.

D.5 Code Completion
In code completion, given a source sequence containing a partial code sample, a model is required to generate the remaining part of the code sample. We conduct experiments on line-level code completion using two major benchmarks: PY150 [Raychev et al., 2016] and JavaCorpus [Allamanis and Sutton, 2013]. PY150 [Raychev et al., 2016] consists of 150,000 Python source files collected from Github. Among these samples,  selected 10,000 samples from different files from the test set of PY150 and then randomly sampled lines to be predicted for the code completion task. The average numbers of tokens in the source sequence and target sequence are 489.1 and 6.6 respectively. JavaCorpus [Allamanis and Sutton, 2013] contains over 14,000 Java projects collected from GitHub. Similarly to PY150,  selected 3,000 samples from different files from the test set of the dataset and randomly sampled lines to be predicted for the code completion task. The average numbers of tokens in the source and target sequence are 350.6 and 10.5 respectively.
For both tasks, we set the learning rate as 2e-5 and batch size as 32, and set the maximum sequence length of 1024 for the decoder. We finetune the model for 30 epochs. During inference, we employ beam search with a beam size of 5.

D.6 Math Programming
Math Programming is the task of solving maths-based problems with programming. Compared to conventional code generation tasks, this task focuses more on computational reasoning skills. The problem descriptions in this type of task are also more complex than conventional code generation tasks. We employ two major benchmarks for this task: MathQA-Python [Austin et al., 2021] and GradeSchool-Math [Cobbe et al., 2021].
MathQA-Python [Austin et al., 2021] is developed from the MathQA dataset [Amini et al., 2019] where given a mathematical problem description in natural language, a system is required to solve this problem via generating a program that returns the final answer. Austin et al. [2021] translated these programs into Python programs and filtered for cleaner problems. In total, MathQA-Python contains ∼24,000 problems, including 19,209/2,822/1,883 samples for training/validation/test splits.
The benchmark focuses on problems with moderate difficulty that an average grade school student should be able to solve. In total, GSM data contains 8,500 problems, divided into 7,500 training and 1,000 testing problems. We translated the solution described in natural language to Python programs by following the construction process of MathQA-Python by Austin et al. [2021]. Finally, we successfully converted 5,861 out of 7,500 training samples.
For training, we set the maximum sequence length of the source and target as 256 and 256 for MathQA-Python, and 246, 138 for GSM8k-Python. We use a learning rate of 2e-5 and a batch size of 32 for 30 epochs of finetuning. During inference, we employ the beam size as 5 to get pass@1 results. For pass@80 and pass@100, we found they are quite sensitive to the diversity of the generation. We employ nucleus sampling with a temperature of 1.2 and top-p=0.95.

D.7 Retrieval-augmented Code Generation
Developers often search for relevant code snippets from sources on the web such as GitHub or StackOverflow as references to aid their software development process. Motivated by this behaviour, we explore a retrieval-augmented code generation setting, where given a natural language description, a retriever first retrieves similar candidates in a search codebase and then augments the input for the generator to produce the target code. Such retrieval-augmented generation (or retrieve-then-generate) paradigm has been widely used in open-domain question answering [Karpukhin et al., 2020] in NLP and recently extended to some code-related tasks such as code generation and summarization [Parvez et al., 2021] with significant improvements. As our CodeT5+ is capable of both retrieval and generation, it can be seamlessly adapted as a unified retrieval-augmented generator. This can bring unique benefits such as less computational cost compared to prior work that employs a different retriever and generator. We evaluate CodeT5+ on two Java and Python code generation datasets from the CodeXGLUE Lu et al.
Specifically, we leverage the encoder to encode the code snippet in the retrieval base and build a search index with the faiss library [Johnson et al., 2019]. The search index is a set of representations (of 256 dimensions) for all the code snippets in the retrieval codebase. Let (x i , y i ) denote one training instance where x i is the input text description and y i is the corresponding target code snippet. we employ the same encoder to obtain the embedding of x i and retrieve top-k similar code samples from the search base using the L-2 similarity metric, with k being a hyperparameter. We ensure that the training example's target string (y i ) is not present in any of these k retrieved samples.
After retrieving these top-k relevant code samples, we combine them with a special token [SEP] and concatenate it to the end of the source input x i . Unlike Parvez et al. [2021], we do not augment docstrings or text descriptions and only augment the code snippet for simplicity. We then finetune CodeT5+ on this augmented dataset. During inference, we retrieve similar code samples from the search base and augment these to input x i . For training, we set the maximum sequence length of the source and target as 600 and 320. We use a learning rate of 2e-5, the batch size as 32 to finetune the model for 10 epochs. We set the beam size as 5 during inference with beam search.