Learning Sequential and Structural Information for Source Code Summarization

We propose a model that learns both the sequential and the structural features of code for source code summarization. We adopt the abstract syntax tree (AST) and graph convolution to model the structural information and the Transformer to model the sequential information. We convert code snip-pets into ASTs and apply graph convolution to obtain structurally-encoded node representations. Then, the sequences of the graph-convolutioned AST nodes are processed by the Transformer layers. Since structurally-neighboring nodes will have similar representations in graph-convolutioned trees, the Trans-former layers can effectively capture not only the sequential information but also the structural information such as sentences or blocks of source code. We show that our model out-performs the state-of-the-art for source code summarization by experiments and human evaluations.


Introduction
Descriptions of source code are very important documents for programmers. Good descriptions help programmers understand the meaning of code quickly and easily.
Source code has sequential information (code tokens) and structural information (dependency and structure). To understand the content of the code and generate a good summary, both pieces of information are essential to understand the summary. However, most previous works used only one kind of information. Iyer et al. (2016); Liang and Zhu (2018); Hu et al. (2018b); Allamanis et al. (2016) simply converted the source code into sequences of tokens and tried to extract the features of the source code from the sequential information of the source code. They rarely considered the structure information about the relationship between tokens.
On the other hand, Shido et al. (2019); Harer et al. (2019); LeClair et al. (2020); Scarselli et al. (2008) proposed tree-based models to capture the features of the source code. They used the structural information from parse trees but hardly considered the sequence information of code tokens.
In order to accurately understand and represent the source code, it is necessary to encode the structural information as well as the sequential information. Recently, Ahmad et al. (2020) tried to represent both the sequential and the structural information using the Transformer model with relative encoding. Since relative encoding clipped the maximum distance for attention without considering the parse tree, it is limited to represent the structure of the code.
In this work, we propose a model that learns both the structural and sequential information of source code. We represent code snippets as abstract syntax trees (ASTs), and apply graph convolution (Kipf and Welling, 2017) to the ASTs to obtain the node representation reflecting the tree structure such as parents, children, and siblings. Nodes that are close to each other in a tree, such as parent and child nodes and sibling nodes, will have similar representations. Next, we convert the graph-convolutioned ASTs into sequences by the pre-order traversal and process them with Transformer layers (Vaswani et al., 2017). Since structurally-neighboring nodes have similar representations, such nodes will pay more attention to one another in the Transformer layers. Thus, the Transformer layers can easily capture not only the sequential information but also the structural information such as sentences or blocks of source code.
We also modify ASTs to better represent the structural information of the source code. We add sibling edges to represent neighboring blocks in source code and add a node representing the name of the function for Python. With the modification, our model can better catch the blocks in the source code.
In the experiment, we show that our model outperforms the state-of-the-art for source code summarization. We use two well-known Java (Hu et al., 2018b) and Python (Wan et al., 2018) datasets collected from Github. We additionally perform human evaluations and analyze the attention maps between code tokens to see how well the model has captured the structural information of code. The result proves that it is very effective to model both structural and sequential information for source code summarization.
We describe the modified AST (mAST) in Section 2 and present the proposed approach in Section 3. In Section 4, we show the superiority of our approach with experimental results and human evaluations. We describe the related work in source code summarization and compare their approaches with our proposed approach in Section 5. Finally, we conclude the paper in Section 6. The Python AST parser we used does not create a node for the function name unlike the Java AST parser. Since the function name is a very important keyword in generating summaries, we add the function name in the red box (a) to FunctionDef node in the Python AST.
2 Representing Code as mAST Figure 1 shows a code snippet and its AST in Java. The abstract syntax tree (AST) is a structure to represent the abstract syntactic structure of code in a programming language. Source code is separated into blocks and can be transformed into a tree structure. The leaf nodes of an AST represent code identifiers and names. The non-leaf nodes represent the grammar or the structure of the language. All non-leaf nodes in an AST have the structural information about which blocks they belong to (parent node) and which block they have (child nodes). So, we can easily catch the structure information of code from ASTs.
In order to more effectively represent structural information, we modify ASTs by adding edges between siblings. Statements or blocks at the same level in a code snippet are represented as sibling nodes. For example, Expression node (line 2 in the code), Foreach node (line 3) and Return node (line 7) are the statements or blocks at the same level, and they are represented as siblings in the AST. However, in ASTs, blocks at the same level (sibling nodes) are not directly connected as shown in Figure 1b. We can indirectly catch such information via parent nodes.
Neighboring blocks are very important for the sequential and structural understanding of source code. To directly represent neighboring blocks, we In the case of Python, the Python AST parser we used does not create a node for the function name, unlike the Java AST parser. As the function name is a very important keyword in generating summaries, we add the function name to FunctionDef node as a child in the Python AST as shown in Figure 2. Then, we modify Python ASTs by adding sibling edges.

Proposed Model
We propose a model using graph convolution layers and transformer layers to summarize the source code. To encode both the structural and the sequential information of code, we combine both of the above layers. Figure 3 shows the overview of our model. A given snippet is represented as a modified AST (mAST). The initial representation of nodes in the mAST is generated by the embedding layer. Since the embedding layer generates the representation considering only nodes themselves, we use graph convolution to capture the structural information. We apply graph convolution to each node in mAST. Then, we can have node representations considering the structural features as well as the node features.
The graph-convolutioned mASTs are converted into sequences by pre-order traversal, and the sequences are given to the Transformer encoder. Since structurally-neighboring nodes have similar representations, the Transformer encoder can effectively capture not only the sequential features but also the structural features such as sentences or blocks of source code. After the Transformer encoder generates the representations by reflect-ing the sequential and structural information, the Transformer decoder generates summaries.

Graph Convolution Network
Graph convolutional network (Kipf and Welling, 2017) is one of graph neural networks for representing nodes based on neighborhood features of each node in graph data. In this paper, graph convolution layers are used to capture the structure information of mASTs. Since the mAST extracted from a given code C is a graph, we denote an AST as G(C) = {V, E}, where V is a set of nodes and E is a set of edges. Initially, nodes in V are one-hot encoded tokens and then mapped into representation vectors, X, by the embedding layer.
Given representation X of nodes and E of edges, new representations of nodes are calculated by graph convolution layers as follows.
where A is the adjacency matrix, W l is the graph convolution weight matrix in the l-th layer, σ is the activation function, n is the total number of nodes in an mAST, and d is the embedding dimension. The feature of each node represented by the graph convolution layer is denoted as H. In the experiment, the dimension of the weight matrix in a graph convolution is d=512.

Transformer Encoder-Decoder
After graph convolution layers, the mAST is converted into a sequence by pre-order traversal. The pre-order traversal is applied to the original AST, not the mAST, because mASTs are transformed into graphs by adding sibling edges. The mAST is used to obtain the structural representation of nodes by considering nodes and their neighbors. Since the original AST contains the original structure of the source code, we use it to obtain a sequence. The Transformer encoder and decoder follow the graph convolution layers. The sequence of the mAST nodes is processed into the Transformer encoder. The Transformer architecture is good at capturing long-term dependencies in a sequence. Since we used graph convolutions, which generate similar representations for structurally-neighboring nodes, the Transformer encoder can easily capture dependencies between nodes in the same code block, between similar code blocks, and between code blocks at the same level. As a result, the Transformer encoder can generate new representation vectors which well reflect sequential and the structural information.
Next, the Transformer decoder generates the token of summary from the vectors generated by the Transformer encoder. In the experiment, the dimension of nodes and summary tokens is d model =512. The Transformer encoder and decoder are respectively composed of a stack of N = 6 layers.

Experiment
We perform various experiments to show the superiority of our model for source code summarization.

Setup
Datasets We evaluate our model using Java dataset (Hu et al., 2018b) and Python dataset (Wan et al., 2018). The statistics of the experiment datasets are shown in Table 1. We used the Java parser used by Alon et al. (2019) and the Python parser used by Wan et al. (2018) for extracting the abstract syntax tree of the code.
Baselines We compare our model with baseline models based on sequential information by Iyer et al. (2016); Hu et al. (2018a,b); Wei et al. (2019); Ahmad et al. (2020) and based on structural information by Eriguchi et al. (2016); Wan et al. (2018). We refer to the baseline results reported by Ahmad et al. (2020).
Hyper-parameters We set the maximum length to 200, and the vocabulary sizes for code and summary to 50,000 and 30,000, respectively. We train our proposed model using Adam optimizer (Kingma and Ba, 2015). The mini-batch size and dropout rate are 32 and 0.2. We set the maximum training epoch to 200, and use early stopping. We adopt beam search during inference time and set the beam size to 4.

Quantitative Result
Overall Result Table 2 shows the overall performance of the models. We present three proposed models: AST-Only, AST+GCN and mAST+GCN.
AST-Only is the proposed model without graph convolution layers. The model converts code snippets into ASTs and does not include graph convolutions. The sequenced AST nodes are given to the Transformer encoder and decoder. We present this model to verify how much ASTs are effective for source code summarization. It performs better than the baselines except for TransRel model proposed by Ahmad et al. (2020). This result shows that AST, which has more structural information on source code, is better than simple code for source code summarization.
AST+GCN is the proposed model with graph convolution layers but without AST modification (adding sibling edges). Code snippets are converted into ASTs and node representations are generated by graph convolutions. The sequenced graphconvolutioned AST nodes are input to the Transformer encoder and decoders. This model can verify how much the graph convolutions are useful. It shows better performance than the baselines.
mAST+GCN is the proposed model with modified ASTs by adding sibling edges and with graph convolution layers. It outperforms all baseline models. The performance improves by 0.91 and 0.3 BLEU, 0.74 and 0.35 METEOR, and 0.06 and 0.08  ROUGE-L points in comparison to TranRel for Java and Python datasets, respectively. The proposed model with the AST modification has better performances on BLEU, METEOR, and ROUGE-L (excepts for METEOR in Java) than without the modification. This proves that modified ASTs help models learn more structural information of code than general ASTs.

Position of graph convolution layers
We perform additional experiments with different positions of graph convolution layers. The positions are the front of the Transformer encoder, the back of the Transformer encoder, and both the front and back of the Transformer encoder. Table 3 shows the performance scores according to the position of the graph convolution layers. The front model is the same as mAST+GCN which has one graph convolution layer in front of the encoder.
The back model does not have graph convolution layers in front of the Transformer encoder but has one next to the encoder. Nodes in an mAST are input to the encoder without graph convolutions, but graph convolutions are applied to the output of the encoder. Since the Transformer encoder can catch structural patterns in simple sequences, the graph convolution in the back of the encoder may work better than the one in front because it can feed more sharp structural information to the decoder.
The front+back model has two graph convolution layers: one in front and the other in the back of the Transformer encoder. It may catch much stronger structural patterns. The structurally encoded representations by a graph convolution layer are fed to the encoder and the output of the encoder is structurally enhanced once more by a graph convolution layer.   Table 4: Performance by number of graph convolution layers model is next to the front, and the back is the worst, which means that the graph convolution before the encoder is effective.
Since the Transformer has the ability to extract comprehensive features considering not only sequential but also structural information in sequences, the convolution layer in the back of the encoder may destruct such features and degrade the performance. However, the graph convolution layer in front of the encoder can help the encoder analyze structural patterns and to extract better features because it enhances structural information.

Number of graph convolution layers
We analyze the performance according to the number of graph convolution layers. Graph convolutions are effective at capturing structural features, so more layers can help improve the performance. We tried one to three layers in front of the encoder. Table 4 is the result of each model with 1, 2 and 3 layers. The results show that our proposed model with one graph convolution layer in front of the Transformer encoder has better performance than others. We think that this is because the graph structure of AST is not as complex as the general graph structure. So, the node representation has the oversmoothing problem when the graph convolution layer is stacked deep.

Qualitative Result
We present the qualitative analysis of our model. The attention map of an example code is compared to show how much our model catches the structural information. In order to further validate the performance metrics of our model, we perform a human evaluation on randomly sampled code snippets.

Attention Map Comparison
We analyze attention maps of mAST+GCN, AST+GCN and the baseline by Ahmad et al. (2020) to verify how our model generates node representations compared to the others. Since we try to emphasize the structural information, we need to verify how much our model reflects the structural information to generate representations.
We observe the attention maps for the sample code in Figure 5. We draw an attention map by evaluating the pairwise dot product of the output of a Transformer layer in the encoder. For mAST+GCN and AST+GCN, the output of a layer is the sequence of the mAST nodes, and for the baseline, it is the sequence of the program tokens.
We compare the attention maps of the first and the last Transformer layer in the encoder of each model. Figure 4a, 4b and 4c are the attention maps of the first layer of mAST+GCN, AST+GCN and the baseline. Figure 4d, 4e and 4f are the attention maps of the last layer of mAST+GCN, AST+GCN and the baseline.
Rectangles on the diagonal as shown in a red box in Figure 4a represent blocks in the snippets. Since nodes or tokens in a block may have high similarities, we can see rectangles along with the diagonal. In the attention maps of the first layers, the rectangles are faint because the structural information has not been processed much yet, but we can see many distinct rectangles which means that the structural features are clearly captured.
If we compare the attention maps by our models and the baseline, there are many small rectangles in the baseline. On the contrary, in our models, we can see a few large rectangles in which there are small rectangles. We see a hierarchical structure in the attention maps of our model.
The fact that the baseline produces many small rectangles implies that the baseline can capture only small structural features. We can also note that these small structural features are smaller than statements, considering that the example snippets have only 4 lines. The baseline hardly captures large structural features.
On the other hand, our model effectively captures large and hierarchical structural features. We can easily identify rectangles that match with statements or blocks in the attention maps by our proposed models.
The attention maps from mAST+GCN and AST+GCN are very similar. However, we can see differences in each attention map of the first and the last layer. If we compare the first layer attention maps, the blocks of mAST+GCN are more distinct, which implies that the modification of ASTs by adding sibling edges is helpful to model structural information. If we compare the last layer attention maps, we can see that hierarchical structures of rectangles are clear in the map by mAST+GCN, which also says that the modification is effective to capture structural features.

Human Evaluation
We performed human evaluation (Kryscinski et al., 2019) on the Java dataset to prove the effectiveness of how good summaries our model generates. We randomly choose 100 snippets and ask 4 people with knowledge of the Java language to evaluate the summaries. They are CS graduate students and have many years of experience in Java languages. We ask them to evaluate the 3 following aspects: • Fluency (quality of the summary)  Table 5: Human evaluation of the appropriateness of the generated summaries on the Java dataset. We ask annotators to select a more appropriate summary from two candidates generated by different models. Our proposed model outperforms the baseline.
• Relevance (selection of the important content from source code) • Coverage (selection of the whole content of source code) We show pairs of summaries from our model and the baseline (Ahmad et al., 2020) to the annotators, and ask them to select one of win, tie, and loss in the three aspects, respectively. Our model shows superiority in all aspects as shown in Table 5. The scores of fluency and relevance are higher than the baseline, which means that our model generates more appropriate summaries using more natural expressions. Figure 6 shows some examples of summaries for the qualitative comparison. We choose 6 Java snippet examples. We choose them from the snippets on which all the annotators make the same decision in each aspect. The three snippets on the left are the ones that the annotators choose win (our model is better) and the right ones for loss.

Related Work
As techniques and methods of deep learning have developed, researches for source code summarization have been studied based on sequence-tosequence models. Iyer et al. (2016) proposed a model that performed source code summarization task for the first time. Allamanis et al. (2016) summarized the source code using a convolutional attention network model. Hu et al. (2018a) proposed an RNN-based sequence-to-sequence model using the pre-order traversal sequence of the abstract syntax tree. Also, Hu et al. (2018b) summarized source code with the knowledge on imported APIs using two encoders (source code encoder and API encoder). Ahmad et al. (2020) proposed a Transformer model with a relative position for summarizing source code. Wei et al. (2019) proposed a dual model that learned the code and summary sequence simultaneously. Wan et al. (2018) adopted reinforcement learning to summarize source code. These approaches mainly focused on the sequential and context information of code, but little considered the structural information about the relationship between code tokens.
There are also studies that convert source code to AST to represent the structure of source code. Liang and Zhu (2018) proposed a tree-based recursive neural network to represent the syntax tree of code. Shido et al. (2019) represented source code using the tree structure encoder of tree-LSTM. Harer et al. (2019) adopted tree-transformer to encode the structure of ASTs. Fernandes et al. (2019) proposed the structured neural model for source code summarization. Alon et al. (2019) represented source code based on AST paths between pairs of tokens. LeClair et al. (2020) proposed a model that encoded the AST of source code using graph neural networks. These approaches utilized ASTs to capture structural features, but less considered the sequence characteristics of code in a program language.

Conclusion
We proposed a model that learned both the sequential and the structural features of code for source code summarization. We adopted the abstract syntax tree (AST) and graph convolution to model the structural information and the Transformer to model the sequential information. We also modified the AST to deliver more structural information.
We verified that modified ASTs and graph convolutions were very effective to capture the structural features of code through quantitative and qualitative analysis. We also showed the superiority of our model over the state-of-the-art for source code summarization by experiments and human evaluations.