Better Language Models of Code through Self-Improvement

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.


Introduction
Pre-trained models for code (PLMCs), such as CodeBERT (Feng et al., 2020), PLBART (Ahmad et al., 2021), CodeT5 (Wang et al., 2021), UniX-Coder (Guo et al., 2022), and DISCO (Ding et al., 2022), have become the foundation to solve many practical software engineering tasks such as code summarization, code translation, program repair.Those PLMCs, like large language models (LLMs), are typically first pretrained on very large-scale datasets with a variety of multi-modal objectives under a self-supervised training style.They can then be fine-tuned using task-specific datasets in a supervised training style.
We hypothesise that, while fine-tuned models may not achieve peak performance, PLMCs can produce reasonable outputs that can be regarded as high quality data because they have been pretrained on large scale datasets, and that such data can be leveraged as additional high-quality training data.Our framework utilizes the self-improvement capability of PLMCs through an simple data augmentation step.This approach is particularly useful for tasks involving code-related sequence generation, such as code summarization and code generation.Our method involves fine-tuning a PLMC on a downstream dataset, allowing the model to gain knowledge about the task.The model then generates an augmented version of the original training data, which are used to further fine-tuning.Our framework is similar to sequence-level knowledge distillation (Kim and Rush, 2016), but our approach focuses on improving model performance without compressing the model by utilizing the same technique.
Our empirical evaluation results show that our framework significantly improves the state-of-thearts PLMCs, including CodeBERT, CodeT5, UniX-Coder with significant margins.In short, we summarize our contributions as follows.
• We present a simple self-improvement framework and show how it can be easily adapted to PLMCs for the task of code-related sequence generation.
• We conduct extensive evaluation on two tasks: code summarization and code generation, and compare it with the well-known, state-of-the-art PLMCs.The results show that our framework consistently improvesover all PLMCs by a significant margin in those tasks.
• We provide analysis and explanations on how utilizing a simple framework consistently improves the performance of PLMCs.
Our work is publicly available.

Related Work
Exposure bias and hallucination in Sequence Generation Tasks The exposure bias problem is regarded as the difference between the training and inference phases for auto-regressive sequence generation models.Previous work has attempted to reduce exposure bias in training phase (Bengio et al., 2015;Ranzato et al., 2015;Wiseman and Rush, 2016;Wang and Sennrich, 2020).In the sense that our self-improvement step involves training model on its own prediction, the exposure bias is close to our approach.
Code understanding and generation Code learning problems have recently emerged as one of the primary tasks for assessing the capability of language models.Most recent code models are pretrained on multi-modal objectives before being fine-tuned on specific downstream tasks (Feng et al., 2020;Ahmad et al., 2021;Wang et al., 2021;Guo et al., 2022;Ding et al., 2022).
Knowledge Distillation Knowledge distillation is the process of transferring knowledge from a large unwieldy model or set of models to a single smaller model that can be practically deployed under real-world constraints, and such smaller model can usually keep the same performance or even better than the original model (Hinton et al., 2015;Kim and Rush, 2016;Wang et al., 2020;Chen et al., 2020;Mukherjee et al., 2021).We perform an additional self-improvement step to improve the original model without using external resources, our work is relevant to knowledge distillation.

Method
Algorithm 1 Data Augmentation Process Input: • θ f ine−tuned , the fine-tuned model checkpoint on a specific task T ∈ {code summarization, code generation, etc. }.
In other words, end for 9: return D 10: end procedure Our method utilizes three distinct sets of model parameters: θ pre−trained , θ f ine−tuned , and θ improved .Each corresponds to the stage of the model parameters after pre-trained, fine-tuned, and self-improved, respectively.The model generates tokens in autoregressive manner, progressing token-by-token.
Typically, models are pretrained on large scale corpora, resulting in a pre-trained checkpoint θ pre−trained .These pre-trained models are subsequently fine-tuned on a targeted downstream dataset D using a supervised learning approach, yielding in a set of fine-tuned parameters θ f ine−tuned .Our investigation has revealed that model's performance can be further enhanced if we continue to fine-tuned these parameters on an augmented version of D. Figure 1  augmentation technique and an additional round of fine-tuning, in addition to the pre-traingn and fine-tuning paradigm.
The process of augmenting the dataset is illustrated in Figure 2. We provide a detailed algorithm for this procedure in Algorithm 1.For each training pair of sequences (x i , y i ) in the train dataset D, we employ beam search to generate a list of K-best predictions L K .This list comprises k predictions, where k represents the beam size.Subsequently, we evaluate the similarity between each prediction ŷij and its corresponding ground truth sequence y i using a similarity function sim based on BLEU score.The prediction with highest similarity is then selected ỹi = argmax ŷij ∈L K (sim(ŷ ij , y i )).Finally, we add the sequence pair (x i , ỹi ) to a new empty dataset D, which we refer as the augmented dataset.Essentially, the augmented dataset contains an equal number of datapoints as the original training dataset due to an one-by-one mapping during augmetation.Moreover, the augnmentation process occurs offline, with each newly augmented datapoint being saved to storage before being used for training in the self-improving phase.
The subsequent step involves fine-tuning θ f ine−tuned on D until convergence.This results in a new set of model parameters denoted as θ improved .It is important to note that the index j in ŷij denotes the j th prediction in the beam, rather than the j th token of the predicted sequence.Additionally, only the training dataset D is augmented, while the validation and test dataset remain unchanged for evaluation purpose.

Experimental Setup
Our goal is to show that for any of the fine-tuned model for a sequence generation task (F-PLMC), after applying our self-improvement method (S-PLMC), the result improves.

Evaluation Results
The results of our code summarization task are presented in Table 1.The "Beam sizes" column indicates the beam size used in the beam search algorithm, while the "Methods" column indicates whether or not our self-improved algorithm was utilized.We also included other models as references to compare the relative improvement of our model.On average, we observed an average of 0.76 BLUE score increase in performance across all languages.This improvement was consistent across various beam sizes (1, 5, 10), which confirms the effectiveness of our self-improved approach across a wide range of PLMCs.When comparing our model to other strong baselines, we found that our method improved the performance of CodeBERT for JavaScript from 15.78 to 16.39, surpassing the performance of PolyglotCodeBERT (15.80).This highlights the benefit of our self-improved method in improving weak models.The results of our code generation study are presented in Table 2, the performance increase by 0.81 BLUE scores on average.When using EM and CodeBLEU, the improvement also increases consistently.
While conducting our experiments, it is important to note that we did not selectively choose the most favorable random seed to optimize the performance of each entry.Instead, we utilized the default seed provided in each model repository to ensure fairness and consistency.Our code summarization experiments encompassed six different programming languages, and both code summarization and generation experiments were evaluated using three distinct beam sizes.In total, we conducted 60 runs to gather comprehensive results.The numbers reported consistently demonstrate the improvement achieved in each individual run, thereby affirming the robustness of the proposed method.

Ablation Study
Improvement Study In this section, we examine the factors that influence the improvement achieved by θ improved as compared to θ f ine−tuned through code summarization.We define r 1 as the difference in performance measured by BLEU between inferencing with a beam size of 10 and inferencing with a beam size of 1.Additionally, we define r 2 as the improvement in BLEU when inferencing with the same beam size of 1 between θ f ine−tuned and θ improved .By evaluating these values across a variety of beam sizes and programming languages in the code summarization dataset, we are able to visualize the results in Figure 3. Additionally, we have calculated the Pearson Correlation score, which is 0.77, indicating a strong correlation between r 1 and r 2 .Our analysis demonstrates that a larger r 1 is correlated with a better r 2 , suggesting that our method is more likely to yield better overall performance when r 1 is large.We believe this insight is a crucial finding as it provides a simple indicator of the model's fully trained capability.
CodeBLEU as Similarity Function Our primary findings in code generation are presented using BLEU as the similarity function.However, for a more comprehensive assessment of the correctness of the generated code, we consider Code-BLEU, which incorporates the specific characteristics of source code.CodeBLEU, therefore, aligns better with the objective of measuring similarity in data augmentation compared to BLEU, which relies on n-gram matching.This section examines the impact of using CodeBLEU as a similarity  We present the results of our UniXCoder selfimproving model with both BLEU-augmentation and CodeBLEU-augmentation in Table 3.The results indicate that CodeBLEU-augmentation enhances both BLEU and CodeBLEU scores compared to BLEU-augmentation.This suggests that using CodeBLEU as a similarity function improves the generated code at a local level, encompassing aspects such as fluency, semantics, and syntax.However, it does have a negative impact on exact match (EM).As code problems may not have a unique solution, when EM is used as an evaluation metric, it should allow for a more lenient assessment.Consequently, we argue that a slight decrease in EM would have minimal impact on the actual correctness of the generated solution.Thus, we propose placing greater emphasis on CodeBLEU as an evaluation metric for code generation.

Conclusion
We introduced a self-improvement technique as a final fine-tuning step to enhance model performance.Our experiments showed that this method, when applied to popular pre-trained code models (Code-BERT, CodeT5, and UniXCoder), significantly improves performance on code summarization and code generation tasks.We also provided insights on when this method is most effective in improving PLMCs.We intend to implement our technique in larger-scale models and other tasks, and believe it is an efficient way to optimize the capabilities of any code language model without the need for extensive architecture modifications or large-scale dataset assembly.We leave all of these investigations for the future.

Limitations
We discuss a few limitations of our work.One limitation of Self-Improved is its complexity in usage.The data augmentation process involves generating predictions for the entire training dataset with a large beam size, resulting in a time complexity of O(nk), where n is the train dataset size and k is the beam size.Additionally, the fine-tuning step to derive θ improved also adds a significant amount of computational complexity.This limitation is discussed in Section 6 to weigh the performance benefits of our method against the computational sacrifices.Another limitation is that Self-Improved has only been applied to encoder-decoder models in this work.However, it is also applicable to other types of auto-regressive models, including encoderonly models, which are commonly used for tasks such as code completion (Radford et al., 2019;Lu et al., 2021;Guo et al., 2022).A few models can be named are GPT models (Radford et al., 2019;Brown et al., 2020), CodeX (Chen et al., 2021), CodeGen (Nijkamp et al., 2022), etc.Further research into these applications is left for future work.

A Data statistics
We include the statstics for the data used in our experiments.We access directly to CodeXGLUE page5 for downloading data.CodeXGLUE gathers datasets from multiple sources for each downstream task.For code summarization, the Code-SearchNet (Husain et al., 2019)

B Hyperparameter selection
In the fine-tuning phase, we maintain the model hyperparameter values as specified in the training script provided by the official repositories.However, we make a modification by increasing the batch size to fully utilize the memory capacity of a NVIDIA A100 80GB.In the self-improvement phase, we further decrease the learning rate by a factor of 10.

Figure 3 :
Figure 3: Scatter plot visualizing performance gap (in BLEU score) infered by different beam sizes (i.e 10 and 1) of θ f ine−tuned vs. performance gained (in BLEU score) by θ improved infered with beam size of 1

Table 1 :
portrays our proposed self-improvement step within the broader training framework.This step encompasses a data Results on code summarization evaluated with smoothed BLUE-4.The "Overall" column presents average scores over six programming languages.The bolded numbers represent the best scores (higher is better) when comparing between Baseline and Self-Improved for each model and each value of beam size.

Table 2 :
Results on code generation evaluated with EM, BLEU, and CodeBLEU.The bolded numbers represent the best scores (higher is better for all metrics) when comparing between Baseline and Self-Improved for each model and each value of beam size.

Table 3 :
Comparing BLEU and CodeBLEU as similarity function in Data Augmentation Process on UniXCoder.

Table 4 :
(Iyer et al., 2018)umber of examples for each train/dev/test are reported in Table4.The dataset comprise the codetext pairs in 6 programming languages.CON-CODE(Iyer et al., 2018)dataset is employed as the benchmark for code generatiom with statistics reported in Table5.It contains pairs of Java member function and natural language description Java language.Statictisc of data used in code summarization

Table 5 :
Statictisc of data used in code generation