Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations

The lack of description of a given program code acts as a big hurdle to those developers new to the code base for its understanding. To tackle this problem, previous work on code summarization, the task of automatically generating code description given a piece of code reported that an auxiliary learning model trained to produce API (Application Programming Interface) embeddings showed promising results when applied to a downstream, code summarization model. However, different codes having different summaries can have the same set of API sequences. If we train a model to generate summaries given an API sequence, the model will not be able to learn effectively. Nevertheless, we note that the API sequence can still be useful and has not been actively utilized. This work proposes a novel multi-task approach that simultaneously trains two similar tasks: 1) summarizing a given code (code to summary), and 2) summarizing a given API sequence (API sequence to summary). We propose a novel code-level encoder based on BERT capable of express-ing the semantics of code, and obtain representations for every line of code. Our work is the ﬁrst code summarization work that utilizes a natural language-based contextual pretrained language model in its encoder. We evaluate our approach using two common datasets (Java and Python) that have been widely used in previous studies. Our experimental results show that our multi-task approach improves over the baselines and achieves the new state-of-the-art.


Introduction
Developers spend the most time writing code but not much in writing its description. It is reported that much of the developers' code does not have any description (Hu et al., 2018b). This has detrimental effects on other developers who will be reading and trying to understand the code base / ** * Returns area by name * / public int getAreaByName(String name) { return getArea(name); } (a) Code 1 having API Sequence getArea / ** * Returns twice the area given a name * / public int getTwiceAreaByName(String buildingName) { return getArea(buildingName) * 2; } (b) Code 2 having API Sequence getArea Figure 1: Code 1 and Code 2 implement different functionalities but use the same API sequence. . To alleviate effort in writing the code description, code summarization, the task of automatically generating code description given a piece of code, has been proposed in the software engineering and AI community (Haiduc et al., 2010;Moreno et al., 2013;Iyer et al., 2016;Hu et al., 2018a).
Previous work on code summarization using an auxiliary model trained to produce API (Application Programming Interface) embeddings showed promising results when applied to a separate code summarization model (Hu et al., 2018b). However, different code may assume the same set of API sequence. For example, Figure 1a and 1b show two different code snippets having the same API sequence, getArea. Despite having the same API sequence, the code summary shown in the comments on top is different: Figure 1a is about getting an area given a name, while Figure 1b is about doubling an area. Thus, training a model to summarize the given code based on its API sequence may induce confusion into the model. Nevertheless, we note that the API sequence can still be useful and has not been actively explored.
In this work, we further leverage the API sequence for code summarization. Specifically, we propose a novel multi-task approach that simultaneously trains two similar tasks: 1) code to summary, and 2) API sequence to summary. Our model consists of an encoder-decoder architecture. That is, we propose a novel code-level encoder based on BERT (Devlin et al., 2019), which has recently shown remarkable improvement in numerous NLP downstream tasks. Our encoder is able to express the semantics of code and obtain modeling for every line of code. Our work is also the first code summarization work that utilizes a natural languagebased contextual pre-trained language model in its encoder. For multi-task learning, the two different tasks utilize the same set of shared layers that produce contextual embeddings to further train the individual tasks.
We evaluate our approach on two popular datasets (Java and Python) that have been widely used in previous studies. Our experimental results show that by learning to identify lines of code, our model is able to learn more effectively. Furthermore, our multi-task approach improves over the baselines and achieves the new state-of-the-art performance.
In summary, our key contributions include: • A novel multi-task learning model that consists of two different but semantically similar tasks of generating summaries from either code and API sequence.
• A novel approach of representing lines of code that leads to improved performance.
• Experimental results compared with baselines that achieve state-of-the-art performance in code summarization.

Related Work
Most existing approaches that perform code summarization using neural networks define the output task to be a sequence generation (Iyer et al., 2016;Hu et al., 2018a,b;Liang and Zhu, 2018). These approaches leverage recurrent encoder-decoder models with attention mechanisms. One prior work proposed a new convolutional attention model for code summarization that outputs short name-like summaries (Allamanis et al., 2016). Different from this previous work, our work is a multi-task (Code to Summary and API Sequence to Summary) approach that utilizes a contextual pre-trained language model in the area of code summarization.
There has been research that leverages source code representation for code summarization. A number of recent work transforms the source code into Abstract Syntax Tree (AST) and encodes it using TreeTransformer (Harer et al., 2019), Tree-LSTM (Shido et al., 2019), andGraph Neural Networks (LeClair et al., 2020). Another prior work has improved the code summarization using the AST by flattening it into a sequence through structure-based traversal (Hu et al., 2018a). Later work further improved the model by proposing a representation that decouples the code structure from structure-based traversal (LeClair et al., 2019). Our work leverages a more readable and simplified structure than AST by tokenizing every line of code (e.g., [CLS] Other existing work leverages various learning techniques such as reinforcement learning (Wan et al., 2018), dual learning , and retrieval-based techniques (Zhang et al., 2020) to build the code summarization models. Recent work leverages a transformer model to generate natural language summary (Ahmad et al., 2020). In contrast, our work uses a pre-trained language model as a code-level encoder.
A previous technique uses API usage information that enhances the code summarization model, showing the effectiveness of the knowledge from API sequence (Hu et al., 2018b). We noted the exploration of API sequence in code summarization is limited. Although multiple different codes can have the same API sequence and the API sequence may not contain the full information of the code, the API sequence can directly provide more structural information of the code, e.g., the data type used, and their sequences. This can be viewed as a friendlier/structured/cleaner intermediate representation. Though it may be lossy in the code, such intermediate representation facilitates the model training as we have empirically shown that it is improving the model performance. Contrary to (Hu et al., 2018b), our work consists of a multi-task approach that summarizes code as well as API sequence. Our experiments show that this multi-task approach is useful in learning a better code-to-summary model.
There has been work that uses a dual task model  in which one model is Figure 2: Overview of the proposed approach. Different from the original BERT structure, we prefix every line of code with a [CLS] symbol for learning the line of code representation. Each line of code also has a different type of segment embeddings. Our experimental results showed that the differentiation among different lines of code achieves a higher performance. Our decoder consists of a standard 6-layer Transformer decoder. On multi-tasking learning, our proposed model consists of training two different but similar tasks: 1) Code to Summary Generation, and 2) API sequence to Summary Generation. trained in step s, and the other model is trained in step s + 1. The result of the model in step s is then used for the model in step s + 1. This cycle keeps on repeating until convergence. Our work does not require cycle dependency between two models but simultaneously trains two tasks, which is more efficient.
CodeBERT (Feng et al., 2020) and PYMT5 (Clement et al., 2020) present pre-trained models by using a multi-layer bidirectional transformer encoder (Feng et al., 2020) and a text-to-text transfer transformer T5 (Clement et al., 2020), respectively. CodeBERT (Feng et al., 2020) and PYMT5 (Clement et al., 2020) both support multiple downstream tasks such as code search (Feng et al., 2020), code generation (Clement et al., 2020), and code summarization (Feng et al., 2020;Clement et al., 2020). Different from these techniques, our model is designed for code summarization tasks specifically, thus provides better performance as our re-sults show in Section 5 1 . Furthermore, our proposed approach can be combined with any pretrained models, other than BERT, potentially improving upon their original performances.

Proposed Approach
As shown in Figure 2, our proposed model has a common encoder-decoder architecture, consisting of training two different but similar tasks: 1) Code to Summary Generation, and 2) API sequence to Summary Generation. Although we trained two different tasks, our main task is Code to Summary Generation. We first describe the encoder-decoder architecture, followed by the multi-task learning framework of the two tasks.

Encoder Architecture
The large-scale pre-trained language models have shown remarkable performance in recent NLP studies. However, the use of such models is rarely studied in programming languages. In this work, we explore the potential of a popular pre-trained language model, BERT (Devlin et al., 2019). Specifically, we make use of the uncased base model. We have previously conducted our experiments using cased pre-trained models, taking into consideration camel cases. However, the results are not better than using uncased pre-trained models. Thus, we have omitted them. We are aware of a recent code representation model, CodeBERT (Feng et al., 2020). However, there has been no study on how a popular BERT-like structure that performs effectively in downstream NLP tasks can be leveraged in code summarization, given the language similarity between programming languages and the English language. In this study, we proposed the learning of line of code representation in a BERT-like encoder and showed that it achieves better performance than a vanilla BERT encoder.
In the original BERT model, every sentence is prefixed with the [CLS] token and ends with a [SEP] token. We observed that the indentation of code lines can have a special meaning for certain programming languages. For example, combining two different lines of code in Python may cause errors. Thus, we believed that a model may be able to train better if it can identify the line difference. In the encoder, our work models every line of code distinctly by inserting an additional [CLS] token at the start of every line of code. Similar to vanilla BERT, [SEP] is appended as the last token for every line of code. We use the original code indentation and we do not further process the code to conform to a certain indentation. In our early experiment, we have attempted to model different forms of whitespace indentation. However, these modelings do not improve the overall model performance. Thus, we exclude them in our final model design. Additionally, all the code and summary are lower-cased, and every non-alphanumerical symbol in the code is treated as a separate token.
The input is a running sequence of code tokens that are arranged in lines of code (LOC) beginning from the top of every method/function. where public and string refer to the second and the third token of the input (e.g., x 1 2 and x 1 3 in input x above). Note that the first token of every line of code (e.g., x 1 1 ) is [CLS]. As shown in Figure 2, each token x i for line j is assigned three kinds of embeddings: token embeddings, segmentation embeddings, and position embeddings. Token embeddings refer to the semantics of each token. Segmentation embeddings are used to distinguish between different lines of code. For example, for each LOC, the approach assigns segment embeddings E A or E B depending on whether the line of code is even or odd, as shown in Figure 2. Position embeddings indicate the position of each token within the line of code.
These three embeddings are added to a single input vector and fed into a bidirectional Transformer with multiple layers, i.e., where h 0 = x is the input vectors. The superscript l indicates the depth of the stacked layer. LN is the layer normalization operation (Ba et al., 2016) and M HAtt is the multi-head attention operation (Vaswani et al., 2017). F F N is a Feed-Forward Network. As a result, the encoder generates an output vector T i (shown in Figure 2) for each token with rich contextual information.

Decoder Architecture
Our decoder is a six-layered Transformer (Vaswani et al., 2017) initialized randomly. While our encoder is a pre-trained model, our decoder must be trained from scratch. This makes fine-tuning of BERT unsuitable. To mitigate this imbalance issue, Adam optimizer (Kingma and Ba, 2015) with different hyperparameter values β 1 = 0.9 and β 2 = 0.999 is used in the encoder and decoder, respectively.
Additionally, different warm-up steps and learning rates are imposed in the encoder and decoder, i.e., where lr E and lr D denote the learning rates for the encoder and decoder, respectively, and warmup E and warmup D denote the warmup steps for the encoder and the decoder, respectively. lr E and warmup E are initialized to 2e −3 and 20,000, respectively, and lr D and warmup D are initialized to 0.1 and 10,000 respectively. lr E and warmup E are set lower than its decoder counterparts so that the encoder can be trained with more accurate gradients when the decoder is becoming stable (Liu and Lapata, 2019). For every task, we set the same learning rates and the warmup steps for both an encoder and a decoder.

Multi-task Learning
Our multi-task learning approach is similar to one designed for natural language . The lower layers in Figure 2 indicate the shared layers across all tasks, and the top layer represents task-specific outputs. The shared layers contain final contextual embeddings, which are the output of multiple stacked transformer layers. The input to the transformer layers is the summation of the token embeddings, segment embeddings, and position embeddings. The task-specific layer uses a transformer decoder for two different tasks where the input to the decoder is the contextual embeddings from the shared layers. Algorithm 1 illustrates our multi-task learning procedure. During the multi-task learning, for every mini-batch in Task #1 and Task #2, the model is updated according to the objective of Task #1 and #2, respectively. Such setup has been reported to be effective and approximately optimize the sum of all multi-task objectives . We describe Task #1 and Task #2 in detail as follows.

Task #1: Code to Summary Generation
This task takes the code as the input and gives the summary as the target output. Each line of code is further prefixed with the [CLS] token for learning the line of code representation. The last token for every line of code is [SEP]. We use the Algorithm 1 TRAINING AN MT MODEL 1: for all mini-batch do 2: 1. Compute Loss: L(Θ) 3: L(Θ) = Eq. 5 for Task #1 4: L(Θ) = Eq. 6 for Task #2 5: 2. Compute gradient: ∇ (Θ) 6: 3. Update model: Θ = Θ -∇ (Θ) 7: end for cross-entropy loss as the objective function, i.e., where y T 1 i denotes the target token of the summary at time step i for Task #1, and yˆT 1 i denotes the probability of generating the token for Task #1 at time step i. N is the total number of words generated.

Task #2: API sequence to Summary Generation
Task #2 is similar to Task #1 except that instead of taking every code token as input, API sequence is used as input. Furthermore, our approach does not distinguish between different lines of code in Task #2, and the entire API sequence of a function is treated as a single line of code. The objective function is also set as the cross-entropy loss, i.e., where y T 2 i denotes the target token of the summary at time step i for Task #2, and yˆT 2 i denotes the probability of generating the token for Task #2 at time step i. N is the total number of words generated.

Experimental Setup
This section describes the datasets used in our experiments (Section 4.1), the different metrics used in the automatic evaluation (Section 4.2), and the qualitative evaluation (Section 4.3). The different baselines are discussed in Section 4.4 and the hyperparameters to our models are listed in Section 4.5.

Datasets
We made use of two common datasets, Java (Hu et al., 2018b) and Python (Miceli Barone and Sennrich, 2017; Wan et al., 2018). They have been widely used in previous work (Hu et al., 2018b;Wan et al., 2018;Ahmad et al., 2020). Each dataset consists of pairs of code and a single sentence summary describing the code. All the datasets were split distinctly into Train, Validation and Test set (shown in Table 1). We use the exact same datasets in each split as previous studies Ahmad et al., 2020) without any alteration. Java The Java methods and their summary were collected from Java projects in Github from 2015 to 2016. The first sentence of every method in the Javadoc was extracted to be the ground truth code summary. As a result, each Java method forms a <code, summary> pair. Following prior work (Gu et al., 2016), the API sequence of a Java method was collected by parsing the Java method using Eclipse's JDT compiler 2 , constructing an AST tree, and extracting the API sequence represented in the AST tree. The second column of Table 1 presents the data statistics of Java. Python The python functions and their summary were collected from Python projects in Github in 2016. If a python function consists of a docstring, the first sentence of the docstring is treated as the ground truth code summary, and its corresponding function forms the code of the summary. Similar to Java, the Python code is first parsed into an AST tree using asttokens 3 , which is a common library for transforming python code into the AST form. The API sequence of a Python function are then extracted from the AST tree. The third column of Table 1 shows the data statistics of Python.

Metrics for Quantitative Analysis
We evaluate our approach using three widely used metrics in code summarization, as follows.
BLEU (Papineni et al., 2002) quantifies the lexical similarity of the generated summary to the ground truth summary by counting the common n-grams. METEOR (Banerjee and Lavie, 2005) measures the alignment between the generated and 2 https://www.eclipse.org/jdt/ 3 https://pypi.org/project/asttokens/ the ground truth summary by exact, stem, synonym, and paraphrase matches between words and phrases. ROUGE-L (Lin, 2004) measures the longest common subsequence overlap between the generated and ground truth summary, and focuses on recall scores.

Qualitative Analysis
We randomly select 200 generated summaries along with their original code, 100 pairs each for Java and Python, following similarly to prior research (Liu and Lapata, 2019;Grusky et al., 2018). Amazon Mechanical Turk (MTurk) workers were hired to rate the quality of the generated summaries. The MTurkers rated the summary voluntarily, and for each rated summary, the MTurkers are given a compensation of one cent. We used four common criteria to evaluate the summarization quality (Liu and Lapata, 2019): Informativeness How well does the summary capture the key points of the code? Relevance Are the details provided in the summary consistent with details in the code? Fluency Are the summaries well-written and grammatically correct?
Comprehension Can the summaries helps in understanding the code? Three different workers were required to rate each summary between one and five, where one is the worst and five is the best. We also ask the MTurkers for their Java/Python coding experience and if they understand the generated summaries and code. In addition to the MTurk surveys, we performed additional analysis on the same set of codesummary pairs as those reviewed by MTurkers. The purpose is to further unravel the quality of our generated summaries by investigating the difference between generated and ground-truth summaries.

Baseline Models
We compare our approach with the following eight baseline models as seen in Table 2. CODE-NN (Iyer et al., 2016) uses token embeddings as source code embeddings, and the overall model architecture is based on LSTM. Additionally, it uses a global attention mechanism that computes a weighted sum of the source code embeddings during the decoding process. Tree2Seq (Eriguchi et al., 2016) Table 2: Comparison of our proposed approach with the baseline results. Our proposed model and ablation settings are shown in the bottom three rows. Our proposed model and ablation settings consistently achieve the state-ofthe-art performance when compared with all the other baseline models in both the Java and Python datasets.
code is transformed into a structural tree representation, which is used as the input to the encoder.
RL+Hybrid2Seq (Wan et al., 2018) uses both code and the Abstract Syntax Tree of the corresponding code as the input to a Reinforcement Learning model. DeepCom (Hu et al., 2018a) also uses both code and the Abstract Syntax Tree as input, but for a general Sequence-to-Sequence model. For the attention mechanism, it considers both code and Abstract Syntax Tree. API+CODE (Hu et al., 2018b) uses API sequence and code summary to first train a Sequence-to-Sequence model. A secondary model is then created for the purpose of code to summary. In the secondary model, if the code is an API, its embeddings would be borrowed from the first model. CodeBERT 1 tunes CodeBERT's Code Summarization task with CodeBERT's own training and validation datasets while CodeBERT 2 tunes with the common training and validation datasets presented in Table 1. For both, the default tuning settings recommended by the authors of CodeBERT are used. In testing, we use the same common test datasets in Table 1 for all models.

Hyperparameters
We applied dropout of probability 0.1 before all linear layers and used label smoothing (Szegedy et al., 2016) with smooth factor 0.1. During decoding, the beam size is set to 5. We follow prior work Ahmad et al., 2020) to set the maximum length of code and summary to be 150 and 50, respectively. Other hyperparameters, including the number of epochs, are tuned based on the model performance on the validation set. All experiments are conducted on eight NVIDIA RTX 2080 GPUs during the four-week-long period.

Results
We provide the automatic evaluation of the baseline models and our proposed model in Section 5.1 followed by the human evaluation under Section 5.2. In addition to the human evaluation, we also provide several examples for additional qualitative analysis.

Quantitative Results
To evaluate the effectiveness of our LOC modeling and multi-task model discussed in Sections 3.1 and 3.3, respectively, we performed the ablation study by running the experiments without these two components. The second bottom row shows the result of our proposed model without the multi-task component. This means that the model considers only the single task for code to summary (i.e., Task #1 in Figure 2) without the task of API sequence to summary (i.e., Task #2 in Figure 2). The third bottom row shows the result of our proposed model without both multi-task and LOC modeling components. The model without the LOC modeling means that it treats the code as a contiguous sequence of tokens. In short, the comparison result between the second and third bottom rows show the effectiveness of our LOC modeling. The comparison result between the second and last bottom rows shows the effectiveness of our multi-task model. In the table, we omit the results of training a singletask model on API sequence to summary because different code may have the same API sequence, thus the model would not be well-trained. We have elaborated this with an illustration using Figures 1a  and 1b in Introduction. Table 2 shows that by considering learning the LOC modeling, the majority of the metrics are improved. For example, in the Java dataset, two (BLEU and ROUGE-L) out of three metrics in "Ours w/o multi-task" show improvement over "Ours w/o multi-task w/o LOC modeling". In the Python dataset, all three metrics have improved. This suggests that our proposed LOC modeling is effective. Furthermore, our proposed multi-task approach (last row in Table 2) scores the best performance in both the Java and Python datasets, achieving the new state-of-the-art. Figure 4 shows the qualitative examples of the generated summaries. The first column consists of an example of Java -the creation of GridLayout. Although the generated summary and ground truth summary differ largely in terms of the unigram, they have the same semantics. The second column consists of an example of Python code on listing keys. Both the generated summary and the ground truth summary are identical. The figure shows that our generated summaries achieve good quality by producing the same semantics to the ground truth. Table 3 shows the survey results from Amazon MTurkers on the generated summaries given the  Java and Python code. The majority of the MTurkers have (86.3%) Java experience and (72.2%) Python experience between 1 to 5 years, in the Java survey and Python survey, respectively. Only a minority of the MTurkers do not understand the code and the generated summary: 19.4% for Java and 27.3% for Python. On average, MTurkers found that the generated summaries for Java are informative, relevant, and fluent. For Python, MTurkers found the generated summaries are less informative, less relevant, and less fluent than summaries for Java. We believe that the main reason for the lower performance of Python might be due to the flexibility of the Python programming language (as compared to the Java programming language), which is dynamically typed, and it allows developers to write multiple different variants of code for the same functionality. For example, in Python, instead of having multiple lines of code in a loop, the developer can combine them into a single line for list comprehension. For both Java and Python, MTurkers found the generated summaries can help them understand the code better. For the authors' analysis, the majority of the generated summaries produce the same meaning as the ground truth: 35% (Java) and 16% (Python) provide identical summaries to the ground truth, and 29% (Java) and 38% (Python) hold different structures but have the same semantics as the ground truth. Those yielding different meanings still achieve high quality generated summaries by missing just a few points from the ground truth (e.g., "delete and create a directory in ground truth" becomes "create a directory" in generated summaries) and by using slightly different adjective phrases that do not damage the key points (e.g., "latex preamble" in ground truth becomes "current preamble").

Conclusions
In this work, we proposed a novel and effective multi-task approach for generating summaries from code. Two different but similar tasks, 1) generating summaries from code, and 2) generating summaries from API sequence, are trained simultaneously. Our proposed model also considers modeling every line of code (LOC) whereas existing work treats code as a single contiguous sequence. To the best of our knowledge, this is the first work that utilizes a natural language-based pre-trained language model for a code summarization task. Our experimental results on two popular datasets, Java and Python, show that our proposed model performs better than all baselines, achieving the new state-ofthe-art performances. Additionally, both the multitask component and LOC modeling component of our proposed model are demonstrated to be effective. Furthermore, our proposed approach can be combined with any pre-trained models, other than BERT, potentially improving upon their original performances.