CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model

Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL).


Introduction
Commit message is the smallest unit that summarizes source code changes in natural language. good commit message allows developers to visualize the commit history at a glance, so many teams try to do high quality commits by creating rules for commit messages. For example, Conventional Commits 1 is one of the commit rules to use a verb of a specified type for the first word like 'Add' or 'Fix' and limit the length of the character. It is very tricky to follow all these rules and write a good quality commit message, so many developers ignore it due to lack of time and motivation. So it would be very efficient if the commit message is automatically written when a code modification is given.
Similar to text summarization, many studies have been conducted by taking code modification X = (x 1 , ..., x n ) as encoder input and commit message Y = (y 1 , ..., y m ) as decoder input based on the NMT (Neural machine translation) model. Loyola et al., 2017;van Hal et al., 2019) However, taking the code modification without distinguishing between the added and the deleted part as model input makes it difficult to understand the context of modification in the NMT model. In addition, previous studies tend to train from scratch when training a model, but this method does not show good performance because it creates a large gap in the contextual representation between programming language (PL) and natural language (NL). To overcome the problems in previous studies and train a better commit message generation model, our approach follows two stages: (1) Collecting and processing data with the pair of the added and deleted parts of the code X = ((add 1 , del 1 ), ..., (add n , del n )). To take this pair dataset into the Transformer-based NMT model (Vaswani et al., 2017), we use the BERT (Devlin et al., 2018) fine-tuning method about two sentencepair consist of added and deleted parts. This shows a better BLEU-4 score (Papineni et al., 2002) than previous works using raw git diff. Similar to Code-SearchNet (Husain et al., 2019), our data is also collected for six languages (Python, PHP, Go, Java, JavaScript, and Ruby) from Github to show good performance in various languages. We finally released 345K code modification and commit message pair data.
(2) To solve a large gap about contextual representation between programming language (PL) and natural language (NL), we use CodeBERT (Feng et al., 2020), a language model well-trained in the code domain as the initial weight. Using Code-BERT as the initial weight shows that the BLEU-4 score for commit message generation is better than when using random initialization and RoBERTa (Liu et al., 2019). Additionally, when we pre-train the Code-to-NL task to document the source code in CodeSearchNet and use the initial weight of commit generation, the contextual representation between PL and NL is further reduced.

Related Work
Commit message generation has been studied in various ways.  collect 2M commits from the Mauczka et al. (2015) and top 1K Java projects in Github. Among the commit messages, only those that keep the format of "Verb + Object" are filtered, grouped into verb types with similar characteristics, and then the classification model is trained with the naive Bayes classifier.  use the commit data collected by  to generate the commit message using an attention-based RNN encoder-decoder NMT model. They filter again in a "verb/direct-object pattern" from 2M data and finally used the 26K commit message data. Loyola et al. (2017) uses an NMT model similar to , but uses git diff and commit pairs collected from 1∼3 repositories of Python, Java, JavaScript, and C++ as training data. Liu et al. (2018) propose a retrieval model using 's 26K commit as training data. Code modification is represented by bags of words vector, and the message with the highest cosine similarity is retrieved. Xu et al. (2019) collect only '.java' file format from  and use 509K dataset as training data for NMT. Also, to mitigate the problem of Out-of-Vocabulary (OOV) of code domain input, they use generation distribution or copying distribution similar to pointer-generator networks (See et al., 2017). van Hal et al. (2019) also argues that the  entire data is noise and proposes a pre-processing method that filters the better commit messages.  argue that it is challenging to represent the information required for source code input in the NMT model with a fixed-length. In order to alleviate this, it is suggested that only the added and deleted parts of the code modification be abbreviated as abstract syntax tree (AST) and applied to the Bi-LSTM model. Nieb et al. presented a large gap between the contextual representation between the source code and the natural language when generating commit messages. Previous studies have used RNN or LSTM model, they use the transformer model, and similarly to other studies, they use Liu et al. (2018) as the training data. To reduce this gap, they try to reduce the two-loss that predict the next code line (Explicit Code Changes) and the randomly masked word in the binary file.

Git Process
Git is a version management system that manages version history and helps collaboration efficiently. Git tracks all files in the project in the Working directory, Staging area, and Repository. The working directory shows the files in their current state. After modifying the file, developers move the files to the staging area using the add command to record the modified contents and write a commit message through the commit command. Therefore, the commit message may contain two or more file changes.

Text Summarization based on Encoder-Decoder Model
With the advent of sequence to sequence learning (Seq2Seq) (Sutskever et al., 2014), various tasks between the source and the target domain are being solved. Text summarization is one of these tasks, showing good performance through the Seq2Seq model with a more advanced encoder and decoder. The encoder and decoder models are trained by maximizing the conditional log-likelihood below based on source input X = (x 1 , ..., x n ) and target input Y = (y 1 , ..., y m ).
where T is the length of the target input, y 0 is the start token, y T is the end token and θ is the parameter of the model. In the Transformer (Vaswani et al., 2017) model, the source input is vectorized into a hidden state through self-attention as the number of encoder layers. After that, the target input also learns the generation distribution through self-attention and attention to the hidden state of the encoder. It shows better summarization results than the existing RNNbased model (Nallapati et al., 2016).
To improve performance, most machine translations use beam search. It keeps the search area by K most likely tokens at each step and searches the next step to generate better text. Generation stops when the predicted y t is an end token or reaches the maximum target length.

CodeSearchNet
CodeSearchNet (Husain et al., 2019) is a dataset to search code function snippets in natural language. It is a paired dataset of code function snippets for six programming languages (Python, PHP, Go, Java, JavaScript and Ruby) and a docstring summarizing these functions in natural language. A total of 6M pair datasets is collected from projects with a re-distribution license. Using the CodeSearch-Net corpus, retrieval of the code corresponding to the query composed of natural language can be resolved. Also, it is possible to resolve the problem of documenting the code by summarizing it in natural language (Code-to-NL).

CodeBERT
Recent NLP studies have shown state-of-the-art in various tasks through transfer learning consisting of pre-training and fine-tuning (Peters et al., 2018). In particular, BERT (Devlin et al., 2018) is a pre-trained language model by predicting masked words from randomly masked sequence input and uses only encoder based on Transformer (Vaswani et al., 2017). It shows good perfomances in various datasets and is now extending out of the natural language domain to the voice, video, and code domains.
CodeBERT is a pre-trained language model in the code domain to learn the relationship between programming language (PL) and natural language (NL). In order to learn the representation between different domains, they refer to the learning method of ELECTRA (Clark et al., 2020) which is consists of Generator-Discriminator. NL and Code Generator predict words from code tokens and comment tokens masked at a specific rate. Finally, NL-Code Discriminator is CodeBERT after trained through binary classification that predicts whether it is replaced or original.
CodeBERT shows good results for all tasks in the code domain. Specially, it shows a higher score than other pre-trained models in the code to natural language(Code-to-NL) and code retrieval task from NL using CodeSearchNet Corpus. In addition, CodeBERT uses the Byte Pair Encoding (BPE) tokenizer (Sennrich et al., 2015) used in RoBERTa, and does not generate unk tokens in code domain input.

Dataset
We collect a 345K code modification dataset and commit message pairs from 52K repositories of six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby) on Github. When using raw git diff as model input, it is difficult to distinguish between added and deleted parts, so unlike , our dataset focuses only on the added and deleted lines in git diff. The detailed data collection and pre-processing method are shown as a pseudo-code in Algorithm 1: To collect only the code that is a re-distributable license, we have listed the Github repository name in the CodeSearchNet dataset. After that, all the repositories are cloned through multi-threading. Detailed descriptions of functions that collect the commit hashes in a repository and the code modifi-Algorithm 1 Code modification parser from the list of repositories.  cations in a commit hash are as follows: • get_commits is a function that gets the commit history from the repository. At this time, the commits of the master branch are filtered, excluding merge commits. Commits with code modifications corresponding to 6 the program language(.py, .php, .js, .java, .go, .ruby) extensions are collected. To implement this, we use the open-source pydriller (Spadini et al., 2018). • get_modifications is a function that gets the line modified in the commit. Through this function, it is possible to collect only the added or deleted parts, not all git diffs.
While collecting the pair dataset, we find that the relationship between some code modifications and the corresponding commit message is obscure and very abstract. Also, we check that some code modification or commit message is a meaningless dummy file. To filter these, we create the filtering function and the rules as follows.
1. To collect commit messages with various format distributions, we limit the collection of up to 50 commits in one repository. 2. We filter commits whose number of files changed is one or two per commit message. 3. Commit message with issue number is removed because detailed information is abbreviated. 4. Similar to , the non-English commit messages are removed. 5. Since some commit messages are very long, the first line is fetched. 6. If the token of code through tree-sitter 3 , a parser generator tool, exceeds 32 characters, it is excluded. This removes unnecessary things like changes to binary files in code diff. 7. By referring to the  and Conventional Commits( § 1) rules, the commit message that begins with a verb is collected. We use spaCy 4 for Pos tagging. 8. We filter commit messages with 13 verb types, which are the most frequent. Figure 2 shows the collected verb types and their ratio for the entire dataset.
As a result, we collect 345K code modification and commit message pair datasets from 52K Github repositories and split commit data into 80-10-10 train/validation/test sets. This results are shown in Table 1.

CommitBERT
We propose the idea of generating a commit message through the CodeBERT model with the our dataset ( § 4). To this end, this section describes how to feed inputs code modification (X = ((add 1 , del 1 ), ..., (add n , del n ))) and commit message (Y = (msg1, ..., msg n )) to CodeBERT and how to use pre-trained weights more efficiently to reduce the gap in contextual representation between programming language (PL) and natural language (NL).

CodeBERT for Commit Message Generation
We feed the code modification to the encoder and a commit message to the decoder input by following the NMT model. Especially for code modification in the encoder, similar inputs are concatenated, and different types of inputs are separated by a sentence separator (sep). Applying this to our CommitBERT in the same way, added tokens (Add = (add 1 , ..., add n )) and deleted tokens (Del = (del 1 , ..., del n )) of similar types are connected to each other, and sentence separators are inserted between them. Therefore, the conditionallikelihood is as follows: where M is commit message tokens, C is code modification tokens and concat is list concatenation function.
[cls] and [sep] are speical tokens, which are a start token and a sentence separator token respectively. Other notions are the same as Section 3.2.
Unlike previous works, all code modifications in git diff are not used as input and only changed lines in code modification are used. Since this removes unnecessary inputs, it shows a significant performance improvement in summarizing code modifications in natural language. Figure 3 shows how the code modification is actually taken as model input.

Initialize Pretrained Weights
To reduce the gap difference between two domains(PL, NL), We use the pretrained CodeBERT as the initial weight. Furthermore, we determine that removing deleted tokens from our dataset ( § 4) is similar to the Code-to-NL task in CodeSearchNet (Section 3.3). Using this feature, we use the initial weight after training the Code-to-NL task with CodeBERT as the initial weight. This method of training shows better results than only using Code-BERT weight in commit message generation.

Experiment
To verify the proposal in Section 5 in the commit message generation task, we do two experiments.
(1) Compare the commit message generation results of using all code modifications as inputs and using only the added or deleted lines as inputs. (2) Ablation study several initial model weights to find the weight with the smallest gap in contextual representation between PL and NL.

Experiment Setup
Our implementation uses CodeXGLUE's code-text pipeline library 5 . We use the same model architecture and experimental parameters for the two experiments below. As a model architecture, the encoder and decoder use 12 and 3 Transformer layers. We use 5e-5 as the learning rate and train on one V100 GPU with a 32 batch size. We also use 256 as the maximum source input length and 128 as the target input length, 10 training epochs, and 10 as the beam size k.

Compare Model Input Type
To experiment generating a commit message according to the input type, only 4135 data is collected from data with code modification in the '.java' files among 26K training data of Loyola et al. (2017). Then we transform these 4135 data into two types, respectively, and experiment with training data for RoBERTa and CodeBERT weights: (a) entire code modification in git diff and (b) only changed lines in code modification. Figure 3 shows these two differences in detail. Table 3 shows the BLEU-4 values when inference with the test set after training about these two types. Both initial weights show worse results than (b), even though type (a) takes a more extended input to the model. This shows that lines other than changed lines as input data disturb training when generating the commit message.

Ablation study on initial weight
We do an ablation study while changing the initial weight of the model for 345K datasets in six programming languages collected in Section 4. As mentioned in 5.2, when the model weight with high comprehension in the code domain is used as the initial weight, it is assumed that the large gap in contextual representation between PL and NL would be greatly reduced. To prove this, we train the commit message generation task for four weights as initial model weights: Random, RoBERTa 6 , CodeBERT 7 , and the weights trained 6 https://huggingface.co/roberta-base 7 https://huggingface.co/microsoft/codebert-base on the Code-to-NL task(Section 3.3) with Code-BERT. Except for this initial weight, all training parameters are the same. Table 2 shows BLEU-4 for the test set and PPL for the dev set for each of the four weights after training. As a result, using weights trained on the Code-to-NL task with CodeBERT as the initial weight shows the best results for test BLEU-4 and dev PPL. It also shows good performance regardless of programming language.

Conclusion and Future Work
Our work presented a model summarizing code modifications to solve the difficulty of humans manually writing commit messages. To this end, this paper proposed a method of collecting data, a method of taking it to a model, and a method of improving performance. As a result, it showed a successful result in generating a commit message using our proposed methods. Consequently, our work can help developers who have difficulty writing commit messages even with the application.
Although it is possible to generate a high-quality commit message with a pre-trained model, future studies to understand the code syntax structure remain in our work. As a solution to this, Com-mitBERT should be converted to AST (Abstract  Syntax Tree) before code modification is taken into the encoder like .