Towards Low-Resource Automatic Program Repair with Meta-Learning and Pretrained Language Models

,


Introduction
Program repair is critical to improving the productivity and stability of software development.However, it is resource-consuming and costprohibitive (Weiß et al., 2007;Planning, 2002;Jørgensen and Shepperd, 2007).A reliable automatic program repair (APR) system is thus crucial to reduce manual debugging efforts and development time (Gazzola et al., 2019;Winter et al., 2023).Buggy code commits=change.get("commits",[]) Figure 1: Illustration of our Meta-APR framework with CodeT5 for low-resource error-specific automatic program repair.We first meta-train the CodeT5 with highresource bugs ( ), where the backbone model is updated via gradient descent with respect to θ.After that, Meta-APR is finetuned on the target low-resource bugs (▲,▲) using few-shot examples (10,50,100).
With the advances in deep learning (DL) models (Vaswani et al., 2017) and accessibility to large code corpora (Tufano et al., 2019;Lu et al., 2021), neural approaches to APR have achieved remarkable performance via exploiting existing code patches (Chen et al., 2021b;Zhu et al., 2021).These models are typically trained and evaluated on datasets that comprise a mix of various error types, which are diverse in nature: they vary in terms of the number of bug-fix pairs per error type, and are typically imbalanced.Moreover, the performance gaps across different error types are tremendous (Berabi et al., 2021), which can significantly impair the APR models' performance.
These observations motivate us to consider the idea: rather than training the model jointly on all error types, could we train a model that is quickly adaptable to any low-resource error-specific APR task?Inspired by the success of meta-learning on low-resource NLP tasks like machine transla-tion (Gu et al., 2018;Park et al., 2021) and dialogue generation (Mi et al., 2019;Lin et al., 2021), in this work, we drive pioneering efforts in formalizing low-resource APR, and propose an effective metalearning framework that utilizes a code pretrained model to enhance APR performance.
Low-resource APR formulation: Unlike traditional APR approaches that jointly learn a model from a mix of error types, our formulation considers each rare error type as a low-resource target task.Accordingly, we create datasets specifically to support the evaluation of low-resource errorspecific APR, based on three practical APR benchmarks in various programming languages: TFix in JavaScript (Berabi et al., 2021), ManySStuBs4J in Java (Karampatsis and Sutton, 2020), and TSSB-3M in Python (Richter and Wehrheim, 2022).We observe diverse and imbalanced error type distributions in these benchmarks, e.g., TFix, ManySStuBs4J, and TSSB-3M respectively have 31, 8, and 9 error types that are of low-resource1 along with 21, 6, and 14 high-resource error types.
Meta learning for low-resource APR: To better address the task distribution issue while adapting the model to low-resource (synonymously, fewshot) tasks, we propose a novel meta-learning approach integrated with pretrained code language models.To the best of our knowledge, this is the first work to study the low-resource errorspecific APR.We build Meta-APR with a codeaware pretrained encoder-decoder Transformer model CodeT5 (Wang et al., 2021) and an efficient first-order meta-learning algorithm Reptile (Nichol et al., 2018) for the challenging low-resource APR tasks.Fig. 1 illustrates the overview of our Meta-APR approach.Specifically, we first meta-train a CodeT5-based APR model on high-resource bugfix pairs to learn a better model initialization that captures error-specific knowledge, which enables faster adaptation to the target low-resource bugs via finetuning on few-shot examples.In our experiments, we show that Meta-APR effectively aligns the representations between high-resource and lowresource bugs so that they have a closer distance in the representation vector space.
We extensively evaluate Meta-APR on three curated low-resource multilingual APR benchmarks with different degrees of low-resource settings, i.e. different numbers of training samples (10,50,100).
We show that Meta-APR significantly outperforms the standard transfer-learning method in all settings.As our Meta-APR is a model-agnostic framework that can be integrated with any other DL models, we compare its performance when integrated with other pretrained models like UniXcoder (Guo et al., 2022).Our results demonstrate that Meta-APR consistently enhances performance.Further analysis confirms that Meta-APR is a more robust and effective approach in fixing bugs with various buggy patch lengths and error types.
We further compare with closed-sourced language models such as ChatGPT (OpenAI) in fixing these low-resource bugs.We find Meta-APR achieves much better performance than ChatGPT under zero-shot/few-shot settings.Besides, we observe that ChatGPT often predicts "no bugs", which is probably because it does not well capture the fixing patterns of these low-resource bugs due to their data scarcity issue.
Besides, learning-based approaches (Chen et al., 2021b;Lutellier et al., 2020;Jiang et al., 2021) have shown to achieve promising results by learning the fix patterns from previous bug-fix pairs in an end-to-end manner.Motivated by the success of Neural Machine Translation (NMT) in the NLP domain, one notable learning-based APR method is formulated as a sequence-to-sequence generation problem (Tufano et al., 2019), which aims to translate a buggy code into its correct version.This technique is further enhanced by using pretrained models such as T5 (Raffel et al., 2020) in Berabi et al. (2021).In this work, we propose to exploit a pretrained code-aware CodeT5 (Wang et al., 2021) following Bui et al. (2022); Wang et al. (2023a).
Meta-Learning for Low-Resource Tasks Metalearning has been well studied for few-shot learning as a learning-to-learn approach, which attempts to learn new concepts based on past experiences (Bengio et al., 2013;Vilalta and Drissi, 2002).Recently, optimization-based techniques yield substantial improvement in many low-resource NLP tasks (Zhao et al., 2022).Among them, Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017) has been widely used to tackle low-resource NLP tasks such as machine translation (Gu et al., 2018;Park et al., 2021), dialogue generation (Mi et al., 2019;Lin et al., 2021), and text-to-speech alignment (Lux and Vu, 2022).MAML has shown exceptional efficacy in learning a good parameter initialization for a fast adaption with limited resources.
Recently, meta-learning approaches have been adapted to solve low-resource software intelligence tasks such as code summarization (Rauf et al., 2022;Xie et al., 2022), and code search (Chai et al., 2022).To the best of our knowledge, we are the first to formulate the low-resource error-specific APR task based on the error type distributions and investigate the effectiveness of meta-learning methods.In addition, unlike prior approaches that mostly use the second-order meta-learning algorithm MAML, we exploit a more efficient firstorder meta-learning method Reptile (Nichol et al., 2018).In the ablation studies, we show that it outperforms MAML in various low-resource settings.

Approach
Fig. 1 illustrates the overview of our proposed Meta-APR, a meta-learning framework that lever- ages a code pretrained model for low-resource error-specific APR.We first formulate the task of low-resource error-specific APR in §3.1.Then, we describe our error-specific meta-APR dataset creation in §3.2 and our Meta-APR method in §3.3.

Task Formulation
Assume that we have a set of error types T = {T 1 , T 2 , . . ., T n }.For each error type T i , it associates with a collection of bug-fix pairs , where (B j , F j ) denotes the j-th bug-fix pair.For the error types in T, we define their resourceness based on the total number of bug-fix pairs |D i |.Considering the actual data distribution across three benchmarks, we select an empirical cutoff value of 1000 instances.This threshold value is established to identify an error type as low-resource if it has less than 1000 samples.Otherwise, we treat it as high-resource.
in an autoregressive manner.Formally, where F j,1 : F j,k−1 is the previous sequence at k-th token with N denoting the total number of tokens in the target sequence equally.Then, we apply the first-order metalearning algorithm Reptile (Nichol et al., 2018) to update f θ via gradient descent.After that, the model f θ is finetuned on a low-resource error type with few-shot examples.The underlying idea of Meta-APR is to meta-train a model on highresource error types such that it is quickly adaptable to low-resource types with few-shot examples.

Low-resource APR Dataset Construction
As there are no available low-resource APR benchmarks for evaluation, we curate three low-resource APR datasets in various low-resource settings from three existing APR benchmarks with error type annotations, which are TFix in JavaScript (Berabi et al., 2021), ManySStuBs4J in Java (Karampatsis and Sutton, 2020), and TSSB-3M in Python (Richter and Wehrheim, 2022).As mentioned in §3.1, we define the low-resource error types based on the actual counts of its associated bugfix pairs (< 1000).To construct more challenging low-resource scenarios, we randomly select 10, 50, and 100 samples from each low-resource error type.Following the common practice (Gao et al., 2021), we repeat this few-shot sampling process with five different random seeds (13, 21, 42, 87, and 100).For evaluation, we report the averaged results over the five seeds to rule out the random noises.

Model-Agnostic Meta-APR Framework
Base APR Model CodeT5 (Wang et al., 2021) is a unified code-aware encoder-decoder Transformer model pretrained from large-scale source code corpus in eight different programming languages.CodeT5 has been shown to achieve SoTA performance in many code understanding and generation tasks such as defect detection and program refinement.In this work, we propose to adapt CodeT5 as the base model of our Meta-APR to leverage its better code understanding capability.
where θ is the global model parameters, and θ ′ is the local error-specific model parameters, α and β denote the learning rate of the inner loop and outer loop respectively, L denotes the cross entropy loss function.The error-specific local gradients are grouped by every M steps to update the global APR model parameters θ.The meta-training procedure of our Meta-APR is summarized in Algorithm 1.In the low-resource setting, we set the size of support and query sets to 10 and we leverage support sets for the inner loop update.The query sets are used to track the meta-loss and not involved in parameter updating.
Low-Resource APR Adaptation After the metatraining, we adapt Meta-APR to the target lowresource APR tasks via directly finetuning the meta-learned global APR model on few-shot training samples.Such meta-learned APR model is expected to capture error-specific knowledge by providing a better model initialization, which enables faster adaptation to fix low-resource bugs.In finetuning, the objective is to minimize the cross-entropy loss between model predictions and ground-truth fixes.
4 Experimental Setup 4.1 Error-Specific APR Dataset ManySStuBs4J (Karampatsis and Sutton, 2020) has small and large versions comprising 10,231 and 63,923 bug-fix pairs respectively in Java.It is organized at the level of the single statement changes for each bug-fix pair.We consider ManySStuBs4J large with 14 error types in this work.
TFix (Berabi et al., 2021) is a large-scale program repair dataset that consists of a ground truth repair code patch for each buggy patch in JavaScript.It focuses on syntax and stylistic errors from open-source GitHub commits, which comprise 104,804 bug-fix pairs.Among them, 52 error types are detected by a static analyzer ESLint 2 (Tómasdóttir et al., 2020).
2 https://eslint.org/Data Preprocessing As discussed in §3.2, we process all three benchmarks to create highresource and low-resource APR tasks based on the number of bug-fix in each error type.The data statistics are reported in Table 1.We further provide bug-fix examples for each benchmark in Fig. 2. To prepare the source input to Meta-APR, we follow Berabi et al. (2021) to combine error type, error message, and error context into a single piece of text in the following format: fix {error type} {error message} {error context} where error context consists of the given localized error line and its two neighboring code lines to form a buggy code patch.The corresponding fixed line is used as the target sequence.

Metrics and Baselines
Metrics Following the common practice (Berabi et al., 2021), we use the Exact Match (EM) accuracy to measure the APR performance.Specifically, EM requires the prediction to be identical to the ground-truth fix, which can reflect how well model predictions are aligned with historic correct fixes from human developers.EM is commonly utilized to uphold correctness standards, especially in cases where static analyzers or unit tests are not available.
Baselines We compare Meta-APR with three learning settings: 1) only finetuning on lowresource bugs; 2) transfer-learning from highresource to low-resource bugs; and 3) multi-task learning on both high-resource and low-resource bugs with or without upsampling strategies.Specifically, for the transfer-learning baseline, we first finetune the model on the high-resource training data and then have another stage of finetuning on the low-resource training data.Under the multitask learning setting, we jointly finetune our models on a mix of both high-resource and low-resource training data.
Besides, we compare with other code pretrained models as the backbone model, which include encoder-only CodeBERT (Feng et al., 2020), decoder-only UniXcoder (Guo et al., 2022), and encoder-decoder CodeT5 (Wang et al., 2021).For our Meta-APR method, we perform two ablation studies by replacing either the Reptile metalearning approach to MAML or replacing the backbone model CodeT5 into UniXcoder to verify the effectiveness of our design choices.

Implementation Details
We implement Meta-APR based on the deep learning framework PyTorch 3 .We employ CodeT5base 4 with 220M parameters as our backbone model.All of our experiments are conducted on a single NVIDIA A100-40GB GPU.We use the library Higher (Grefenstette et al., 2019) to metatrain the model on high-resource error types for 50 epochs with a batch size of 10, where the first 5 instances work as the support set and the remaining 5 instances are query set.For inner loop gradient updates, we use the SGD optimizer with an inner loop learning rate α of 1e-4.For the global gradient updates, we use the AdamW (Loshchilov and Hutter, 2019) optimizer and set the outer loop learning rate β to 5e-5.Moreover, in the meta-training stage, we warm up the first 1000 steps with a linear decay.The meta update step size M is set to 150, 20, 150 for TFix, ManySStuBs4J, and TSSB-3M respectively.For low-resource APR adaptation, we finetune the meta-trained model for 50 epochs on low-resource error types with a batch size of 25 and a learning rate of 5e-5.For testing, we select the checkpoint which has the best EM on a held-out validation set.

Experimental Results and Analysis
In this section, we compare Meta-APR with other code pretrained models in different training settings on a set of our curated low-resource error-specific APR tasks from three benchmarks ( §5.1), followed by a detailed analysis on the effects of different error types and token length ( §5.2), and a pilot study to compare with a closed-sourced large language model such as ChatGPT in fixing these challenging low-resource bugs ( §5.3).

Low-Resource APR Performance
Tables 2 to 4 present the results of exact match (EM) accuracies on ManySStuBs4J, TSSB-3M, and TFix benchmarks respectively at different lowresource settings.We can observe that Meta-APR consistently outperforms other baselines in various few-shot settings across 3 benchmarks in different programming languages.a better backbone model for APR tasks with an encoder-decoder architecture.Among different learning paradigms, we find that transfer-learning from high-resource to lowresource bugs and multi-task learning on both bugs yield much better results compared to directly finetuning on low-resource bugs, validating our assumption that low-resource APR can benefit from the bug-fixing knowledge learned from highresource bug-fix data.These two approaches generally exhibit comparable performance across different benchmarks, and the upsampling strategy often proves to be helpful in multi-task learning.Overall, our Meta-APR further improves the adaptation from high-resource to low-resource bugs, thereby leading to superior APR performance.Notably, the performance gain of Meta-APR over other learning paradigms becomes more significant when there are fewer or even no low-resource training samples available.This implies that Meta-APR is able to learn a better model initialization that captures the error-specific knowledge, thereby enabling faster adaptation to the target low-resource error types.
Ablation Study We consider two variants of Meta-APR to verify the design choices in our proposed framework, where "→MAML" means that we replace the first-order meta-learning algorithm with a second-order meta-learning approach MAML, and "→UniXcoder" means that we change the backbone model CodeT5 to UniXcoder.From the results, we find that both CodeT5 and the firstorder meta-learning algorithm are important in enhancing low-resource APR performance, observed by a consistent performance drop from these two variants in most settings across 3 benchmarks.Note that our Meta-APR's first-order meta-learning is also more efficient than MAML's second-order meta-learning approach.

Further Analysis
We proceed to analyze the model predictions to better understand our Meta-APR behaves in fixing various bugs compared to other approaches.All results in this section are under a 100-shot setting.

Effect of Bug Sequence Length
We analyze how Meta-APR performs in fixing low-resource bugs with varying numbers of bug tokens.Fig. 3a shows the cumulative fractions of bugs by their number of tokens, grouped based on the Meta-APR repair outcome (EM).Comparing the blue and orange lines, we observe that the blue one is consistently above the orange one, and if we select a fixed cumulative fraction based on the y-axis, the blue line (correct fixes) will have fewer tokens (i.e.shorter) than the orange one (wrong fixes), indicating the bugs successfully fixed by Meta-APR tend to be shorter than the ones that are incorrectly fixed.We further compare Meta-APR with other training strategies, based on the same backbone model of CodeT5, in fixing bugs with various lengths in Fig. 3b.We The details of error type A-H can be found in Table 5.  observe a monotonous performance decline when the bug sequence length increases for all models, suggesting that shorter bugs are easier to be fixed which might be due to their limited complexity.

Effect of Error Type
We further analyze how Meta-APR performs in addressing various error types.Fig. 4   tion through error-specific meta-training compared to the default transfer-learning approach, we visualize the embeddings of both high-resource and low-resource bugs after the high-resource finetuning from transfer-learning and Meta-APR in Fig. 5.We observe that Meta-APR can better align the representations between high-resource and lowresource bugs so that they are distributed in a closer distance in the embedding vector space, enabling faster adaptation from high-resource to lowresource bugs with limited training samples.
Case Study We provide 3 qualitative examples from our multilingual low-resource APR benchmarks in Fig. 2. We find that our Meta-APR is able to fix the bugs using various fix operations such as deletions, boolean conversion, and identifier renaming, while the standard transfer-learning approach fails to fix bugs by simply copying the buggy line as the fixed line.This indicates Meta-APR can enable faster and better adaptation to low-resource APR scenarios.

Comparison with ChatGPT
Recent studies (Prenner and Robbes, 2021;Joshi et al., 2022) have shown that large language models (LLMs) are capable of bug fixing in zero-shot/fewshot settings.In order to investigate their performance in fixing challenging low-resource bugs, we use ChatGPT (GPT-3.5-Turbo5 ) and evaluate it on 80 randomly sampled test bug-fix pairs for each benchmark.As illustrated in Fig. 6, we construct the zero-shot prompt to provide the code context and its buggy line, together with an instruction "fix the buggy line:".Besides, we randomly select one bug-fix pair from the same error type to design the one-shot prompt for in-context learning.
We report the comparison results in Fig. 7.We observe that Meta-APR significantly surpasses ChatGPT in both zero-shot/one-shot settings across 3 tasks.This shows that ChatGPT is still lim- ited to handle the challenging low-resource bugs as it did not see much such bug-fix data during training due to the data scarcity issue.Additionally, we find that the one-shot example is not always beneficial for low-resource APR and might introduce some noises compared to zero-shot setting.It substantially improves the performance on TFix but leads to some performance degrades on ManySStuBs4J and TSSB-3M.By inspecting the predictions, we find that ChatGPT often predicts "no bugs" as it might require more semantic information for decision-making.Besides, Chat-GPT performs pretty well in fixing bugs related to syntax errors such as the error type of "no unsafe negation", which is to fix the bug by simply adding parentheses to an expression after a negation operator.This is probably due to the fact that ChatGPT has been pretrained on a large-scale code corpus and can understand the program syntax well.

Conclusion
In this work, we present Meta-APR, a simple yet effective framework that extends CodeT5 with meta-learning for low-resource APR.It is a modelagnostic framework that can be integrated with any learning-based models.To the best of our knowledge, we are the first to investigate APR in the lowresource setting and curate error-specific datasets in different low-resource degrees from three APR benchmarks in Python, Java, and JavaScript.Comprehensive experiments have verified the superiority of Meta-APR over other learning strategies with various code pretrained models.More analysis shows that Meta-APR can better align the representations of high-resource and low-resource bugs, and fix bugs with various sequence lengths and error types.A pilot comparison with ChatGPT further shows that our Meta-APR is still more capable of fixing these challenging low-resource bugs.

Limitations
As we are the first to investigate the low-resource APR tasks, we curated 3 datasets with different low-resource degrees (i.e., shot=10/50/100) from existing APR benchmarks to support our study.Such data construction will have a data quality dependency issue from those original APR datasets.Besides, the low-resource sub-sampling may introduce some randomness issues.To mitigate this issue, we performed multiple rounds of random sampling with different seeds and reported the average results.Furthermore, to evaluate the APR performance, we employ exact match scores as the metric to compare the predicted fixes with the ground-truth fixes written by developers, which might fail to capture other correct fixes with different formats and styles.

Ethics Statement
Our work complies with ACL Ethics Policy.In this work, we construct our datasets using publicly available APR benchmarks, which are widely used to examine the program repair performance.We provide detailed procedures to create our lowresource APR datasets and provide proper citations to their source benchmarks.We will publicly release our curated datasets with the same licenses as their source datasets.As an APR tool, one potential risk of Meta-APR is that the predicted fixes from Meta-APR cannot be guaranteed to be correct, and directly adopting them without manual checking could cause security risks to the software development.We suggest that all the fixes should have a manual check from experts before real adoption.

A Appendix
A.1 More Dataset Statistics We provide the detailed statistics of our curated low-resource APR benchmarks in Table 5, Table 6, and Table 7 for ManySStuBs4J, TSSB, and TFix respectively.We can observe a very imbalanced error type distribution across these benchmarks.

First
changes = payload.get("changes",{}) for change in changes: commits=target.get("commits",[]) F j .During the meta-training stage, we randomly sample a batch of bug-fix pairs B s = {(B s , F s )} |Bs| s=1 from high-resource error types.Each batch B s is further divided into B support s and B query s High-Resource APR Meta-Training During the meta-training phase, each mini-batch of data simulates the low-resource scenarios.In our Meta-APR approach, we iterate through a set of high-resource error types as a private training task to update f θ .We first merge all high-resource error-specific training dataset as D train h , and randomly segment D train h into N batches {B 1 , B 2 , . . ., B N } equally.Then, each B s is further split into B support s and B query s to form a local error-specific meta-learning task to update the global APR model f θ using gradient descent:

Figure 3 :
Figure 3: (a): cumulative fraction of programs by number of tokens in the source buggy patch, grouped by whether Meta-APR can have a correct fix.(b): distribution of correct fix over number of tokens for lowresource finetuning and transfer-learning from highresource to low-resource bugs and our Meta-APR.

Figure 4 :
Figure 4: Distribution of correct fix on 9 low-resource error types from ManySStuBs4J.Number of bugs for each type is included in the parentheses at the x-axis.The details of error type A-H can be found in Table5.

Figure 7 :
Figure 7: Evaluation results of correct fixes on a subset of 80 bugs from the test data across three benchmarks.
Algorithm 1: Meta-Training for APR Require: A set of high-resource error types T h = {T1, T2, . . ., Tn}, ∀Ti ∈ T h it pairs with associated bug-fix pairs that Di = {(Bj, Fj)} |D i | j=1 , a APR model f θ , inner loop learning rate α, outer loop learning rate β, meta update step size M Initialize: Initialize θ from the APR model f θ Output: Optimal meta-trained APR model f θ while not done do D h = ∅.forall T ∈ T h do Append the training dataset of T into D h end Randomly divide the merged training dataset D h into batches Bs forall Bs do on B j

Table 1 :
(Richter and Wehrheim, 2022specific low-resource APR benchmarks.During low-resource finetuning, we randomly sample (10,50,100) shots for each error type to construct various low-resource settings.TSSB-3M(Richter and Wehrheim, 2022) is a dataset of over 3 million isolated single statement bug fixes across 23 error types.Each bug fix is associated with a commit in an open-sourced Python project that does not modify source code in other files or statements.We randomly down-sample by 10% for each error type.To facilitate future research in this new field, we release our curated error-specific lowresource APR datasets at https://github.com/wang-weishi/Meta-APR. See Appendix A.1 for more detailed statistics.
Among different models, we find that CodeT5 achieves consistent performance gains over CodeBERT and UniXcoder in most cases, demonstrating that it can serve as Zhang.2022.Improving meta-learning for lowresource text classification and generation via memory imitation.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 583-595.Association for Computational Linguistics.Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang.2021.A syntax-guided edit decoder for neural program repair.In ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, pages 341-353.ACM.