Jointly Learning to Repair Code and Generate Commit Message

We propose a novel task of jointly repairing program codes and generating commit messages. Code repair and commit message generation are two essential and related tasks for software development. However, existing work usually performs the two tasks independently. We construct a multilingual triple dataset including buggy code, fixed code, and commit messages for this novel task. We first introduce a cascaded method with two models, one is to generate the fixed code first, and the other generates the commit message based on the fixed and original codes. We enhance the cascaded method with different training approaches, including the teacher-student method, the multi-task method, and the back-translation method. To deal with the error propagation problem of the cascaded method, we also propose a joint model that can both repair the program code and generate the commit message in a unified framework. Massive experiments on our constructed buggy-fixed-commit dataset reflect the challenge of this task and that the enhanced cascaded model and the proposed joint model significantly outperform baselines in both quality of code and commit messages.


Introduction
Deep learning has been demonstrated remarkably adept at numerous natural language processing (NLP) tasks, such as machine translation (Bahdanau et al., 2014), relation extraction (Zhang et al., 2017), grammar error correction (Ge et al., 2018), and so on. The success of deep learning in NLP also promotes the development of which in programming languages (Clement et al., 2020;Lu et al., 2021). Recently, researchers have exploited deep learning to programming-language related tasks, such as code completion , automated code repair (Tufano et al., 2018), commit messages generation (Xu et al., 2019), code search (Gu et al., 2018), and so on. Among these tasks, automated code repair and commit message generation are the two most active and closely related tasks. The former is to repair software bugs automatically without the intervention of a human programmer. The latter aims to generate natural language descriptions of code changes, which act as a record of feature additions and bug repairs.
Because these two tasks can potentially reduce debugging costs in software development and helps programmers to understand the high-level rationale of changes, a lot of great work has been proposed to deal with automated program repair (Tufano et al., 2018;Chen et al., 2019;Dinella et al., 2020;Yasunaga and Liang, 2020; and commit message generation (Loyola et al., 2017;, respectively. However, existing work tackles the two tasks independently, ignoring the underlying relationship between these two closely related tasks, e.g., after fixing the bug, commit message can record the process of code repair. Therefore it is crucial to explore how to bridge these two tasks and achieve the code repair and commit messages generation simultaneously. In this paper, we formulate a novel task to jointly repair program code and generate commit message, given the buggy code. To facilitate the study of this task, we create a dataset with multiple programming languages. The dataset is collected from commit and buggy-fixed histories of open-source software projects, where each example consists of buggy code, fixed code, and the corresponding commit message. We first introduce the cascaded methods as baseline. The cascaded model employs one model to repair code and the other to generate commit message successively. We enhance this cascaded model with three training approaches inspired by the low-resource machine translation, including the teacher-student method , the multi-task learning method (Domhan and Hieber, 2017), and the back-translation method (Sennrich et al., 2016a). To deal with the error propagation problem of the cascaded method, we propose a joint model which can achieve both code repair and commit message generation in a single model. We train and evaluate our model using the created triple (buggy-fixed-commit) dataset. The results demonstrate the validity of our proposed methods, which achieve a significant improvement over baseline in both qualities of code and commit messages. Particularly, the enhanced cascaded method obtains the best performance on code repair task, and the joint method behaves better than the cascaded method on commit message generation task.
Our main contributions are as follows: • We propose the novel task of jointly repairing code and generating commit message. Moreover, we collect and release a multilingual buggy-fixed-commit dataset for the task.
• We perform an empirical study of different machine learning-based methods for code repair and commit message generation.
• To the best of our knowledge, this is the first work to investigate the effectiveness of joint modeling with the code repair process and commit message generation.

Task Definition
We aim to generate the repaired code and the commit message based on the given buggy code. For the example in Figure 1, our goal is to replace syntax bug "int" with "float" for the return type of the "multiplication" method, then generate a piece of commit message to describe the change.
Formally, given a triple (buggy-fixed-commit) where the i-th sample consists of a buggy code snippet B i , its fixed version F i , and a commit message C i that is used to describe the changes from B i to F i , our goal is to learn probability distribution P ( C, F | B). In practice, the commit message C is hard to estimate without the full consideration of both B and F . Therefore, it is a reasonable way to firstly predict the fixed code F based on B. Then learn how to generate an appropriate message C according to B and F . The probability P ( C, F | B) can be decomposed as: Thus, given a new buggy code B, we can generate its fixed version F , and the commit message C following the conditional probability P ( F | B) and P ( C| B, F ), respectively.

Approach
In this section, we firstly introduce the cascaded models enhanced by the teacher-student method, the multi-task learning method, and the backtranslation method (Section 3.1), which generate repaired code and commit message in a two-stage manner. Then, we propose a joint model (Section 3.2), which is capable of jointly optimizing the generation of repaired code and commit message in an end-to-end manner. The models described in this section are all build on the Transformer model (Vaswani et al., 2017), where we devise the model to take in some representation of input and then yield a distribution over output vocabulary.

Cascaded Model
Cascaded model is one of the most straight-forward methods to tackle this problem, where F is used as a hinge to build bridges between B and C. Formally, given the buggy code B, the generation of commit message C can be conducted in two steps. The first step aims to generate F conditioned on B, which can be defined by minimizing the following negative log-likelihood loss: The second step is to generate commit message C based on B and previous generated F , and it can be formally expressed as: where g(B, F ) is a function to combine B and F as model input, which could be concatenating them, or using their changing information 1 . In the following section, the training loss of commit message generation is optimized by Equation (3), unless explicitly specified.
To further enhance the modeling capabilities of the cascaded model, we introduce three alternative methods by incorporating the teacher-student framework, multi-task learning method, and backtranslation method, respectively.
Teacher-student Method We attempt to improve the performance of code repair with the help of commit message. Different from the previous works that directly used comments  or compiler error messages (Yasunaga and Liang, 2020) as the prior information, we utilize the commit message as the posterior information, to supervise the generation of F in code repair. Specifically, the teacher-student framework (Hinton et al., 2015) is employed to distill knowledge from teacher model to student model, which first learns a teacher model P ( F | B, C) with the use of C, where C is the truth commit message . Then, the teacher model teaches the student model P ( F | B) by minimizing the KL divergence (Kullback and Leibler, 2006), which is defined by where Q ( F | B, C) represents the teacher's sequence distribution over the sample space of all possible sequences. When optimizing L KL , the posterior distribution Q ( F | B, C) can be regarded as labels, so that our model is instructed to use prior distribution P ( F | B) to approximate Q ( F | B, C) accurately. During the training stage of code repair, the student model not only learns from the output probabilities of teacher model, but also learns from the correct context, which is formulated by Multi-task Learning Inspired by previous work which shows that given the buggy lines can significantly improve the performance of code repair (Chen et al., 2019;Wen et al., 2018; Saha et al., 1 We use difflib to represent code changes. The tool can be found in https://docs.python.org/3/ library/difflib.html. In this paper, we use the code changes to build the model input, instead of their concatenation, since the latter will result in overlong sequence length, which drops the performance of model by a significant margin. 2017), we use an alternative way to improve code repair, which is the multi-task learning method. Specifically, we introduce a line-level binary sequence classification task as an auxiliary learning task to assist code repair, which reduces the difficulties for the model to locate the buggy lines 2 . To help the model distinguish from the line-level information and the token-level information, we add the "[CLS]" token at the beginning of each line of buggy code B, which is used to align with the tagging label T , where T ∈ {0, 1}, in which tag 0 means the line is error-free, and tag 1 means the line is buggy. To identify the buggy lines, we build a sequence classifier based on encoder output to implement the line-level binary sequence tagging task. The line-level sequence classification loss can be defined as: At the stage of code repair, we jointly optimize the objective of sequence classification task and sequence generation task, i.e., Back-translation Method Back translation has been demonstrated as an effective way on data augmentation (Sennrich et al., 2016a;Lachaux et al., 2020), and it leverages monolingual data to expand as pseudo-parallel data in a weakly-supervised manner. More precisely, we first train a backdirectional model, that is a repaired code to buggy code model parameterized by P ( B| F, θ F →B ). Then, the pseudo-parallel data is created by the back-directional model, in which the repaired code is regarded as the model input, and the goal is to predict its corresponding buggy version, which is formulated bŷ where θ F →B is the parameter learned by maximum likelihood estimation on M. M is a non-parallel corpus of fixed code, which is used to build the pseudo-parallel data. After obtainingB, the pseudo parallel data P = {(B, F )} is created to merge with the parallel data D to obtain the augmented parallel data D , which is used to train the code repair model according to Equation 2.

Joint Model
Although the above three methods can boost the performance of cascaded method, they still suffer from three challenges: (1) the generated fixed code may contain errors, and those errors will be propagated to the next step of commit generation, (2) they lose the inter-dependencies among global features to represent the changing details of code repair during commit message generation, and (3) the two-stage method results in low decoding efficiency. These problems may lead to the poor performance of commit generation. To this end, we propose a joint method that incorporates with a novel changes-aware dynamic attention mechanism to jointly decode fixed code and commit message.

Model Architecture
The overview of our model is shown in Figure 2. Our model consists of three components: a buggy encoder, a fixed decoder, a message decoder with a changes-aware dynamic attention module. At first, the buggy encoder is deployed to encode the buggy code B, and map it into a sequence of output z b , where z b ∈ R n×H , n and H are the length of B and the hidden size of model, respectively. z b is used for line-level binary sequence tagging (optimized as Equation 6) and as an indispensable component to produce changes information. Then, the fixed decoder generates a high-level representation z f , z f ∈ R m×H is used to generate a repaired codeF , and produce changes information with z b . After that, the commit decoder that combines with the changes-aware dynamic attention mechanism generates an output representation z c , z c ∈ R l×H is used to attend over each representation of z b and z f , then get a final output distribution to generate messages. In the following part, we will introduce our proposed changes-aware dynamic attention mechanism, as well as the method to jointly train our model.
Changes-aware Dynamic Attention During decoding the message, the output z c generated by the message decoder respectively attends over z b and z f to obtain the context representation c b and c f by dot-product attention. which is formulated by where φ ∈ {b, f }. Similar as Vaswani et al. (2017), we use the scaling factor √ H to make gradient more stable during optimization. Intuitively, the context vector could provide much information to  dynamically indicate the alignments of changes over the attended features during decoding the commit messages.
We subtract c b from c f in order to represent the semantic changes that took place from buggy code to its fixed version, and plus c b with c f to denote the semantic summarization of them, which is defined by here, the δ and ζ are R l×H matrices, which represent the changes context and summarization context, respectively. Intuitively, the δ is necessary to generate informative commit messages since it describes the changes from buggy code to its fixed version. Nevertheless, the summarization ζ is also indispensable during decoding, since it always contain vital information to generate the meaningful tokens (e.g., function name or class name), which may not be included in δ.
We develop the control gates to balance the contribution between δ and ζ during decoding the message. The control gates control the degree that attending over each feature of δ and ζ, which is defined as where [a; b] denotes the concatenation between a and b. W g ∈ R 2H×2H and b g ∈ R 2H×l are learnable parameters. The gates g δ ∈ R H×l and g ζ ∈ R H×l are used to control δ and ζ, respectively.
The final output of changes-aware dynamic attention module is the linear combination between the output state of commit message decoder and the gated fusion of context representations, which can be calculated as: where W o ∈ R H×H is the learnable weights. denotes the element-wise product.

Joint Training
We jointly train our model in an end-to-end manner, the overall loss is defined as where L R (θ), L C (θ)and L T (θ) are used to optimize the repaired code generation, commit message generation, and binary sequence classification, respectively. When training multilingual model of fixing code and predicting commit message, following multilingual neural machine translation (Johnson et al., 2017), we mix the training corpus and add a special token (e.g., <java>) at the beginning of each input sequence to distinguish from different programming languages.

Data
In this section, we describe the creation of the dataset in detail. We first describe how we collect the data in the wild. Then, we introduce the preparation process of the data to make it suitable for our tasks.

Data Collection
We collected data from GitHub Archive 3 using the GH Archive API. We first download the event data in each public event from 2018 to 2020, and then filter them using the Github API 4 to obtain meta information of each commit. Specifically, we maintain the commits that consist of edits to the files with multiple programming languages. (i.e., Java, Javascript, Python, C sharp, Cpp). Moreover, to ensure that the prior file and post file are repairing code, we follow (Fischer et al., 2003), where the commits without the specific patterns (i.e., "fix" or "solve") in its commit messages are filter out. After we obtain the meta information of the filtered commit, we begin downloading the buggy file (i.e., the file prior to the commit) and fixed file (i.e., the file following the commit) in pair. Apart from the above multilingual dataset, we also build a Java-only monolingual triple dataset from  the corpus (buggy-fixed pair) released by Tufano et al. (2018) 5 .

Data Preparation
The commit messages are filtered by (i) removing the messages whose length shorter than 3 words or longer than 100 words; (ii) filtering the url in commit messages; (iii) removing the messages that appear more than 3 times in the dataset. The rationale behind the latter decision was to remove the data with meaningless commit messages (e.g., "fix bug.", "fix an error.", etc.). For the processing of file-level buggy code and fixed code, we follow (Tufano et al., 2018), where both the buggy code and fixed code are separated into method-level fragments since the file-level granularity is too large to learn patterns of transformation. After preparation, we obtain the clean triples consist of buggy code, fixed code, and commit message. The statistics of the dataset used in this paper are summarized in Table 1. More processing details and statistics can be found in Appendix A and Appendix B. We release the datasets at https: //github.com/jqbemnlp/BFCsData.

Experimental Settings
Evaluation Metrics We conduct evaluations on both code repair and commit message generation. For the code repair, we use exact match accuracy  to measure the percentage of the predicted fixed code that are exactly matching the truth fixed code. In addition, we also introduce the BLEU-4 score (Papineni et al., 2002) as a supplementary metric to evaluate their partial match. For the commit message generation, we use BLEU-4 and Rouge-L (Lin, 2004) to evaluate our model.

Results on Monolingual Dataset
We first compare the performance of our proposed cascaded model and joint model for code repair and commit message generation tasks on the monolingual dataset, as listed in Table 2.

Automated Code Repair
The teacher-student model achieves the highest score on the metric BLEU-4, which indicates that the commit message could provide effective guidance for the model to repair code. Moreover, it also indicates that the teacher-student method successfully distills the knowledge from the teacher model to the student model, without much loss of accuracy 9 . The multitask learning model outperforms other experimental models on metric exact match, and get the comparable performance of the teacher-student model on BLEU-4. The intuition behind that is the model 6 An open-source deep learning platform (https:// pytorch.org/) 7 The BPE codes are learned by fastBPE (https:// github.com/glample/fastBPE). 8 https://github.com/facebookresearch/ XLM 9 We have evaluated the performance of both the teacher model and student model, the results show that the student model obtains the comparable performance with teacher model. has learned the location information of buggy line from the supervision of line-level binary sequence classification task, which could provide potential guidance for the model to correctly repair code. It is worth noting that the Naive method, which directly uses the buggy code to compare with its repaired version, also gets the distinct score on metric BLEU-4, which indicates the high overlap ratio between the buggy-fixed pairs.
Commit Message Generation Both the joint model and cascaded model are superior to generate meaningful commit message than the naive method, which directly uses buggy code to realize message generation. The joint model outperforms cascaded models over all evaluation metrics of commit message generation. Specifically, it achieves about 10.8% and 5.1% improvements respectively on BLEU-4 and ROUGE-L compared to the multitask learning model, which is one of the most competitive models on code repair. It is highly likely that the joint model effectively captures the inter-dependencies among global features to represent the changing details of code repair during commit message generation, thereof mitigates the error propagation of the two-stage model.

Results on Multilingual Dataset
We further analyze the results of the joint model on the multilingual dataset for both code repair and commit message generation. Table 3 shows the results on five program languages.
With regard to code repair, the result shows that the multilingual model achieves significant improvements compared to the monolingual model in terms of java, cpp and c-sharp dataset, and obtains comparable performance on python and javascript dataset, whether using BLEU-4 or exact match as evaluation metric. The intuition behind that is  the corpus mixed with the multiple programming languages is helpful to make up for the lack of monolingual data during repairing code. In other words, the model could learn the potential bugfixing patterns from multiple languages, and apply them to the limited monolingual data to handle the deficiency of data-limitation problem. A similar observation can also be found during generating commit messages. As shown in Table 3, for commit message generation task, the multilingual model outperforms monolingual model over all evaluation metrics and languages. We believe that the alignment of embedding spaces across multiple programming languages, shares either the same alphabet or anchor tokens such as variable, digits, string, method name, etc., which allows the model to learn these alignments simultaneously during generating commit messages.

Discussion
Ablation Study To further study the effects brought by different techniques, we show in Table 4 the result of different joint model variants on the monolingual dataset. First, we remove the linelevel sequence classification task from our joint model. Therefore, the model is optimized without the loss function L T (θ) that is mentioned in Equation 13. We observe that the results of both code repair and commit message generation decrease distinctly, which shows that locating the buggy lines is important for repairing code and generating commit messages. Then, we remove the proposed changes-aware dynamic attention module. It can be seen that this modification doesn't impact too much for the code repair, but affect the performance of commit message generation by a large margin. The main reason is that the changesaware dynamic attention module could effectively model the changes from buggy code to its fixed version, thereby improves the performance of commit message generation.

Lengths Analysis
We further analyze the model performance on different sequence length, and conduct a comprehensive study for both code repair and commit message generation on the monolingual dataset. Figure 3(a) and 3(b) present the results of code repair and commit message generation, respectively. Figure 3(a) demonstrates the challenge of this task, especially when the repaired code with a long sequence length. It can be seen that even the exact match score of the two models declined with the growing length of repaired code, the joint model still outperforms the cascaded model over all length ranges, which demonstrates the stronger capability of the joint model on modeling the code repair. Figure 3(b) presents the comparative results of the cascaded model and joint model on generating commit message. We observe that the joint model outperforms the cascaded model when the sequence length exceeds 20, which demonstrates the superiority of the joint model to excavate the underlying semantic relationships between code changes and their corresponding commit message during handling with the long message generation.  Figure 4: Examples on the monolingual dataset. The repaired lines are highlighted with different colors, where yellow means the buggy code is wrongly repaired, while green means it was correctly repaired.
Case Study We conduct case study on both monolingual and multilingual dataset. Figure 4 presents the results on monolingual datasets. With regard to the code repair part, the cascaded model wrongly modify the code by replacing public with private in the first line of buggy code, which indicates that it is hard for the model to locate and repair the buggy code without giving any prior information. It is worth noting that the joint model correctly repairs code, we believe that the line-level binary sequence classification task assists the model in locating the buggy lines, thereby improving the model's performance during repair the code. As for commit message generation, the joint model successfully captures the changes from buggy code to its fixed version and generates an appropriate message, while the cascaded model fails may due to the error propagation. More examples on multilingual dataset are shown in Appendix D.

Related Work
Our work is enlightened from two research lines of studies, which are automated code repair and commit message generation. We discuss these topics in the following.
Automated Code Repair Conventional approaches mainly focused on a relatively limited and manually-craft set of fixing patterns, which can only fix bugs in a given language or a specific application domain (Saha et al., 2017;Jin et al., 2011;Nguyen et al., 2013). Very recently, deep learning based approaches are proposed to automatically repair code by learning from massive open-source projects with numerous buggy-fixes pairs (Tufano et al., 2018;Chen et al., 2019;Vasic et al., 2019;Yasunaga and Liang, 2020). Tufano et al. (2018) first proposed using end-to-end neural machine translation model for learning bug-fixing patches. Besides,  demonstrated that appropriately incorporating the natural language descriptions into the pre-train model could further improve the performance of code repair.
Commit Message Generation Early work on automatic commit message generation translates source code changes (such as feature additions and bug repairs) into natural language based on predefined rules and templates (Buse and Weimer, 2010;Cortés-Coy et al., 2014). To overcome the limitation of high complexity and difficult extensibility, some researchers employ information retrieval methods to generate commit messages, which attempts to re-use the commit messages of similar code changes (Huang et al., 2017). Recent work has focused on adopting machine learning based techniques for the commit message generation problem, which usually train a sequence-tosequence model to translate the source changes into commit messages (Jiang et al., 2017;Loyola et al., 2017;Xu et al., 2019). Although automated code repair and commit message generation have achieved rapid development in recent years, existing work usually regards them as two separate tasks and ignores the potential relationship between them. Different from previous work, we attempt to bridge the two tasks since commit message can be used to record the process of code repair. Specifically, we propose a novel task to repair code and generate commit message simultaneously with the proposed cascaded and joint methods, based on our collected buggy-fixedcommit dataset.

Conclusion
In this paper, we propose a novel task to jointly repair code and generate commit message. We provide several competitive architectures, including cascaded model and joint model. To train and evaluate our models, we collect a multilingual buggyfixed-commit dataset from Github. The empirical study is conducted to demonstrate the effectiveness of our proposed methods. For future work, we plan to incorporate the tree structure of code into the task and employ more indicative metrics  to evaluate the model performance.

A Data Processing Details
To ensure the commit message describes the changes information that took place from buggy method to its fixed version, we only consider the changes that are inside of a single method in the file. Changes that involve multiple methods are not considered in our work, since it is implicit to indicate which method does the commit message describe to. Besides, we also develop a heuristics method in which the lexical overlap is employed to filter the commits that the commit message doesn't describe the changes from the buggy method to its fixed version. Specifically, we first tokenize the commit message and the code by nltk 10 and pygments 11 , respectively. Then, we only maintain the commit in which at least one of the tokens in the commit message matches a code token belonging to the buggy code or repaired code 12 . In order to check whether the message describes the changes from buggy code to its fixed version, we randomly selected 100 samples and employ two well-educated annotators for independently analyzing the identified commits. After solving 4 cases of disagreement, they concluded that 97% of the identified commits were true positive.

Data Statistics
Multi.
We take the further analysis for the monolingual and multilingual datasets. Table 5 summarizes the average number of tokens per buggy, the average line of code per buggy, and the average number of tokens per commit. We observe that the average number of lines and tokens for buggy code are considerable, which indicates the difficulty of this task. Figure 5 and Figure 6 present the distribution of the amount of token for buggy code and commit message, respectively. We observe that the density 10 https://www.nltk.org/ 11 https://pygments.org/ 12 During filtering commits, we have removed meaningless tokens in a commit message, such as punctuation, stop words, url, changes ID, etc., which avoids meaningless tokens affect the quality of filtered results  for the buggy code has a long tail that extends over 300 tokens, while the density for the commit message has a peak before 10 tokens. Figure 7 shows the distribution of the line number for the first appeared buggy line. It can be seen that the density for the line number also has a peak before the fifth line of buggy code.

C Hyperparameter Settings
We train our models 13 using Adam optimizer, the initial learning rate is 3×10 −4 . The mini-batch size and the dropout rate are 16 and 0.1, respectively. We train our models for a maximum of 50 epochs 14 .
To avoid overfitting, we implement the early stop if the validation performance does not increase for 10 consecutive iterations.

D Case Study
To further analyze our model under a low-resource setting, we present an example collected from the Cpp dataset. As shown in Figure 8, both the pipeline-based model and joint model, which are under the monolingual setting, fail to correctly repair code and appropriately generate commit messages, the most likely reason is that the lacking amount of data doesn't allow the model to successfully capture the useful patterns which are adapted to the specific task. Expectantly, The joint model under a multilingual setting successfully solves the returned value bug. Besides, it captures the key words "HitRecord" in generated commit message, which makes the message more relative to the code context. The example demonstrates that the model performs better under a multilingual setting compared to which under the monolingual setting, especially on the condition that the amount of monolingual data is limited.