Code Summarization with Structure-induced Transformer

Code summarization (CS) is becoming a promising area in recent language understanding, which aims to generate sensible human language automatically for programming language in the format of source code, serving in the most convenience of programmer developing. It is well known that programming languages are highly structured. Thus previous works attempt to apply structure-based traversal (SBT) or non-sequential models like Tree-LSTM and graph neural network (GNN) to learn structural program semantics. However, it is surprising that incorporating SBT into advanced encoder like Transformer instead of LSTM has been shown no performance gain, which lets GNN become the only rest means modeling such necessary structural clue in source code. To release such inconvenience, we propose structure-induced Transformer, which encodes sequential code inputs with multi-view structural clues in terms of a newly-proposed structure-induced self-attention mechanism. Extensive experiments show that our proposed structure-induced Transformer helps achieve new state-of-the-art results on benchmarks.


Introduction
By 2020, software development and maintenance become an indispensable part of human work and life. Various assistant technical measures have been taken to facilitate more enjoyable software development, among which it is especially welcomed by programmers when there is a code summarization task generating sensible human language annotations automatically. * Corresponding author. This paper was partially supported by National Key Research  The received change dict is the jsonified version of the changes to a file in a changeset being pushed to the Tool Shed from the command line. This method cleans and returns appropriate lines for inspection. In early days, code summarization was a derivative problem of information retrieval (Haiduc et al., 2010;Eddy et al., 2013;Wong et al., 2013Wong et al., , 2015 by matching the most similar code snippets which are labeled with summaries. Such method lacks generalization and performs unsatisfactorily. Thus in recent years, researchers treated code summarization as a task of language generation (Iyer et al., 2016;Liang and Zhu, 2018), which usually depends on RNN-based Seq2Seq models (Cho et al., 2014;Bahdanau et al., 2015).
It is already known that RNN-based models may encounter bottleneck when modeling long  sequences due to its poor long-term dependency. For instance, a normal snippet of Java as shown in Table 1 usually has hundreds of tokens. More recently, Ahmad et al. (2020) used an enhanced Transformer-based model to capture long-term and non-sequential information of source code, which outperformed previous RNN-based models by a large margin.
On the other hand, in the light of the structural nature of programming languages, structure clues are supposed to greatly enhance programming language processing task like code summarization (Fernandes et al., 2019). Indeed, substantial empirical studies showed that Abstract Syntax Tree may help models better comprehend code snippets and achieve more sensible generation results. Previous approaches could be divided into two categories. The first is to employ non-sequential encoders (e.g., TBCNN (Mou et al., 2016), Tree-LSTM (Shido et al., 2019), Tree-Transformer (Harer et al., 2019), Graph Neural Network (Allamanis et al., 2018Alex et al., 2020;Wang et al., 2021)) to directly model structural inputs. The other is to pre-process structural inputs to apply sequential models on them. Uri et al. (2019) used LSTM to encode code structure by sampling possible paths of AST. Another similar work is structure-based traversal (SBT) (Hu et al., 2018a), which manages to flatten ASTs into linear sequences. Though existing studies achieve success on the concerned code summarization task more or less, there is still room in improving both of the above modeling approaches. It is well known RNN encoders like LSTM only have limited capabilities in capturing long-range dependencies in sequence, and GNN-like models may be too sensitive to local information, which casts a natural solution, what if incorporating SBT into the Transformer? However, it is surprising that SBT only works effectively with LSTM but not the Transformer accord-ing to Ahmad et al. (2020). We attribute this to the linear and nonlinear inconsistence between SBT and encoder forms. SBT enables sequential encoders to learn non-sequential relationship (such as syntax) still in a certain elaborate linear forms. RNN may be effectively enhanced by SBT right because of its sequential architecture through attention mechanism. Transformer learns features through self-attention network (SAN), nevertheless which acts more like a non-sequential process. Consequently, such sequential features are unsuitable for a non-sequential architecture to extract implicit structural information. We boldly call it Feature-Model Match problem in Table 2. In this paper, we thus design an improved Transformer variants, structure-induced Transformer (SiT) to alleviate such difficulty in terms of a structure-induced selfattention mechanism, so that the resulted model may enjoy both merits, capturing long-range dependencies and more global information. The proposed model design has been applied to benchmark datasets and helps achieve new state-of-the-art performance.  Figure 1: Use of adjacency matrix to transform original self-attention, left-hand complete graph, into structureinduced self-attention, right-hand graph which looks clear-cut. Note that we omit self-circles for concision.

Structure-based Code Summarization
The following sections present our code summarization method with two parts, in which the first is about structure representation of code, and the second is our proposed structure-induced Transformer.

Structure Representation of Code
Note that programming language like source code is subtle that certain different formats may result in different compilations. Thus pre-processing could be an great impact in code summarization. We adopt Abstract Syntax Tree (AST) for representing the language grammar of source code as usual. Figure 2 depicts a typical AST, which is composed of terminal nodes and non-terminal nodes. A non-terminal node represents certain construction like If and BinaryOp, while terminal nodes represent its semantic components, such as identifiers and numbers.
In model implementation, we adopt adjacency matrix A to represent the AST instead of structure based traversal method as in Hu et al. (2018a), which represents tree structure in a sequential format. Such choice is well compatible with Transformer, which calculates attention weights by performing a dot-product of key-query pairs and results in an attention matrix of l × l. We let l equal to number of AST nodes, then code summarization with Transformer becomes possible through applying a position-wise multiplication of A and original attention matrix. Inspired by Code Property Graph (CPG) (Yamaguchi et al., 2014;, we further expand AST into a multi-view network (MVN or multi-view graph) (Sindhwani et al., 2005;Zhou and Burges, 2007;Kumar et al., 2011). An MVN is composed of multiple views, each view corresponding to a type of structural relationships while all views sharing the same set of vertexes (Shi et al., 2018). In this paper, we construct a three-view graph based on different code semantics, which are abstract syntax, control flow and data dependency. We show an example in Figure 2, where we use colorful strokes to describe different compositions in the graph. Note that we only utilize terminal nodes which are marked as rounded rectangles.

sample()
Specifically, we first generate an AST, on the basis of which we add additional edges to further represent the flow of control and data. For control flow, since Transformer is order-sensitive with position encoding, we only need to focus on each statement node. For instance, nodes b, =, a, +, 1 make a complete statement b=a+1. We connect each of them since they are in the same execution order. For data dependencies, we connect relevant data across the whole program, as the variable b in expression print(b) and assignment b=a+1 respectively, where the former is defined and loaded from the latter. Now we may obtain three adjacency matrices of syntax, flow and dependency respectively, which are colored in red, yellow and blue in Figure 2. We combine them together and finally obtain a multiview graph. Additionally, we add global attention on the root, which is allowed to attend to all tokens in the code, and all tokens in the code can attend to it. With aggregated structure, our structure-based code summarization is expected to capture various semantics of programs.
Note that our multi-view graph is different from CPG. which is original for C/C++ only and we do not find an appropriate analysis platform for other languages.

Structure-induced Transformer
Followed by appropriate structure representation and graph construction, we now propose our structure-induced Transformer (SiT) for code summarization, which is a structure-sensitive transformer (Zhang et al., 2020b;Narayan et al., 2020; model and is able to comprehend code snippets both semantically and syntac- tically. Meanwhile, we do not introduce extra parameters in SiT so that guarantee the training efficiency. In this section, we first review the selfattention network (SAN) of Transformer in terms of attention graph. Then we correspondingly propose structure-induced self-attention to build the structure-induced Transformer.
Vanilla Self-Attention Transformer is composed of stacks of identical layers for both encoder and decoder (Vaswani et al., 2017). Each layer emphasizes on self-attention mechanism, which is denoted as: where X = (x 1 , . . . , x l ) denotes the input sequence of sub-words, l denotes the sequence length and d k denotes the hidden size per head. Now we view each sub-word as a vertex n and inner product of each key-value pair as a directed edge e, the SAN can be described as a directed cyclic graph. Equation 1 can be rewritten as follow: The attention scores E = {e ij } refers to a weight matrix of edges where e ij represents how significant node n i attend to node n j , while value matrix N = {n i } refers to each node representation. Note that SAN actually generates a fully connected cyclic graph without consideration of the very needed structure-aware representation for our concerned task.

Structure-induced Self-Attention
To represent the needed structure information, we propose structure-induced self-attention network (Si-SAN).
Specifically, we introduce multi-view network into Equation 1, that is, multiply the adjacency matrix by key-query pairs: (3) where A mv refers to the multi-view representation of code.
Note that Si-SAN does not change the input code but appropriately incorporate code structure into SAN by changing its attention pattern. As shown in Figure 1, when a ij = 0 in A mv , the attention between n i and n j will be dropped out (Wu et al., 2021). We consequently obtain a more explicit attention graph. Different from calculating global information onto the whole sentence in original SAN, Si-SAN is expected to calculate structural information more accurately.

Structure-induced Module
To enhance robustness and avoid over-pruning, we introduce structure-induced module, which is a stack of two layers, SAN and Si-SAN. In each module, SAN is followed by Si-SAN and the output is the combination of both layers. Specifically, given input sequence X = (x 1 , . . . , x l ), where l denotes sequence length, we first pass it through an SAN layer to obtain hidden representation denoted as H = (h 1 , . . . , h l ): where h refers to number of heads of multi-head attention while SAN i refers to self-attention of  Table 3: BLEU, ROUGE-L and METEOR for our approach compared with other baselines. † refers to pre-trained models while * refers to models we rerun. The results of upper part are directly reported from Ahmad et al. (2020). Note that we only rerun Transformer and CodeBERT since they are much stronger than the other baselines. However, our results are even stronger. We show the ranges compared to the Transformer in Ahmad et al. (2020).
where the aggregation we use is simple positionwise sum. We explore that the structure-induced module is more robust and leads to a better performance. In each stack, model begins to learn global information with SAN, where all connections are available. Subsequently, through Si-SAN, model is told which of the connections are useful and which should be shut down and thus avoiding over-pruning. Note that SiT with 3 stacks of structure-induced modules still consists of 6 encoder layers and 6 decoder layers, but only changes the architecture between modules of Transformer, not introducing any extra parameters. Figure 3 depicts the overall architecture of SiT. Compared to original Transformer, our SiT with Si-SAN encodes a more accurate relative representation of code through pruning redundant connections.

SiT-based Code Summarization
Based on our structure-induced Transformer (SiT), now we specify our code summarization process.
We first transform the input code into adjacency matrices of multiple views and combine them through a weighted sum: where α, β, γ refer to the corresponding weight for each view. Then we pass code sequences and corresponding adjacency matrices into SiT encoder, which contains 3 Si-SAN layers. For decoder, we apply original Transformer decoder with cross attention. Finally, the summarization of the input code is generated through autoregressive decoding.

Datasets and Pre-processing
Datasets Our experiments are conducted on two benchmarks of Java (Hu et al., 2018a) and Python (Wan et al., 2018), and for both we follow their training, test and development divisions.
Graph Construction For Java code, we refer to the method provided in (Hu et al., 2018a). They use javalang module of Python to compile Java and fetch AST in a dictionary form. For Python code, we generate trees by ourselves based on ast and asttokens modules. Finally, we write a script to resolve ASTs into multi-view adjacency matrices 1 , where we let α = β = γ = 1 for all experiments 2 .
Out-Of-Vocabulary Code corpus in programming language may have a much bigger vocabulary than natural language, including vast operators and identifiers. We have to introduce vast out-of-vocabulary (OOV) tokens (usually replaced by UNK ) (Hu et al., 2018a) to keep it in a regular size. To avoid OVV problem, we apply CamelCase and snake case tokenizers (Ahmad et al., 2020) to reduce code vocabulary and remove all extra nodes which do not correspond to specific tokens.

Baselines
We take all three categories of state-of-the-art models as our baselines for comparison.
Transformer We refer to the enhanced Transformer in (Ahmad et al., 2020) which equipped with copy attention (See et al., 2017) and relative position encoding (RPE) (Shaw et al., 2018). For fair enough comparison, we run their model on our machine under the same environment with SiT. Note that we also utilize RPE in SiT because of its better capability in capturing long sequences, while we do not utilize copy attention.

Training Details
We train our model on a single nVidia Titan RTX with batch size in {32, 64}. The learning rate is in {3e-5, 5e-5} with warm-up rate of 0.06 and L2 weight decay of 0.01. The maximum number of epochs is set to 150 for Transformer and 30 for CodeBERT. For validation, we simply use greedy search, while for evaluation, we use beam search with beam size in {4, 5, 8} and choose the best result 3 .

Main Results
Scores Table 3 shows the overall results on Java and Python benchmarks. The Transformer baseline is strong enough as it outperforms all the previous works by a significant margin. However, our model is more powerful, further boosting Transformer with more than 1 BLEU points on Java and Python respectively and achieves new state-of-theart results. Specifically, SiT achieves higher scores on Python, increasing by 1.59, 1.62 and 1.34 points on BLEU, ROUGE-L and METEOR respectively. According to dataset statistics, Python contains 5 times more unique code tokens than Java, which makes it much more challenging. Thus the superiority of SiT on Python tends to be notable. Even so, SiT still boosts Transformer by 1.18, 0.82 and 1.15 points on BLEU, ROUGE-L and METEOR respectively on Java.
Convergence Moreover, Figure 4 shows the trend of BLEU scores on development set over training steps. SiT achieves a much faster convergence rate than Transformer. For instance on Python dataset, SiT arrives the best performance of Transformer in about 100 epochs, while the latter one still needs 50 more to finally achieve the optimal. Note that the running time of each epoch for both models is the same. Such high convergence rate helps showcase the necessity of Si-SAN.
Pre-training On the other hand, we can see that CodeBERT also achieves competitive results on both Java and Python. However, SiT is still more powerful on most metrics, which outperforms CoderBERT by 2.15, 0.95 and 1.15 points on BLEU, ROUGE-L and METEOR respectively on Java. However, CodeBERT performs much better on Python, which outperforms SiT by 1.00 and 0.58 points on ROUGE-L and METEOR. Note that CodeBERT is much bigger in size than Transformer and SiT (see Appendix A). For further verification, we follow CodeBERT and conduct a RoBERTa-based (Liu et al., 2019) SiT to further fine-tune on both Java and Python. As shown in Table 3, pre-trained SiT obtains attractive results, further improving CodeBERT on all the metrics, which implies that our elaborate encoder design is still effective even under powerful pre-training assistance.

Ablation Study and Analysis
This section reports our ablation studies to valid our model on the dataset of Python-V2 4 (Barone and Sennrich, 2017), in which we conduct standard and unified pre-processing for strict fair comparison.

Si-SAN vs. SAN
To valid the effectiveness of Si-SAN, we gradually replace SAN layers in original Transformer with   Si-SAN. Take Transformer model with Si-SAN proportion of 50% as an instance, we replace the second, fourth and last three encoder layers with Si-SAN and do not apply structure-induced module.
The results of variant models with incremental proportions of Si-SAN layers are shown in Table 4. Intuitively, all of the Transformers obtain improvements when equipped with Si-SAN layers. We can also see that SiT outperforms Transformer with similar proportion of Si-SAN, which proves the effectiveness of structure-induced module. However, it is surprising that Transformer with all 6 layers of Si-SAN still outperforms original Transformer even if it may be over-pruned.

Si-SAN vs. Sparse SAN
To further valid our structure-based approach, we compare the performance of structure-induced attention with other sparse attention patterns, window attention in Longformer, ETC (Beltagy et al., 2020; and random attention in BigBird (Zaheer et al., 2020). We depict different attention patterns in Figure 5. The default sequence length in SiT is 400, and then we set both w and r to 64 in window and random attention respectively.
As shown in

Si-SAN vs. SBT
We reproduce SBT method on Java (Hu et al., 2018a) and apply it on our Transformer. For fair enough comparison, we let β = γ = 0 and conduct single-view SiT which only leverages AST information. As depicted in Figure 6, flattening ASTs into linear sequences does not result in improvement, which is consistent with Ahmad et al. (2020). However, we achieve substantial improvement while incorporating AST into Transformer using Si-SAN, which indicates our improved model design is indeed effective. In addition, the average length of the input code will be much longer with SBT, which may introduce additional training cost. As shown in Figure 6, SiT is 1.5 times faster than Transformer with SBT.

Large Model
It is known that for nearly all deep models, increasing model size may cover quite much of model structure design improvement. Thus, it is possible that the improvement on base-size model may not work on large-size one. To valid this, we compare SiTs with Transformers under larger scale. As we can see pictorially in Table 7, with increasing parameter scale, SiTs with 12 heads and 16 heads both outperform the corresponding Transformers by 0.42 and 0.54 BLEU point respectively.

Parameter Sharing
Recently, parameter sharing on BERT (Devlin et al., 2019) has achieved promising results (Lan et al., 2020). Similar as ALBERT, we introduce crosslayer parameter sharing in both Transformer and SiT, sharing all parameters in all encoder layers. Note that we train our models from scratch and keep the decoder fixed.
As shown in Table 7, SiT performs much better on parameter sharing than Transformer does. We believe that code summarization task highly depends on structural information, and this is why SiT can still achieve good results with simply one group of encoder parameters while Transformer encounters a serious decline. On the other hand, it makes possible for lite model, which may balance high efficiency and performance.

Related Work
RNN-based Approaches While numbers of works (Haiduc et al., 2010;Eddy et al., 2013;Wong et al., 2013Wong et al., , 2015Zhang et al., 2020a) on code summarization usually depended on information retrieval, most of the recent works tend to treat it as a machine translation problem. Meanwhile attention mechanism is broadly used for better performance on capturing long-range features. Allamanis et al.
(2016) proposed a Convolution Neural Network (CNN) with copy attention, and more commonly, Iyer et al. (2016); Liang and Zhu (2018) proposed to use Recurrent Neural Network (RNN) with attention mechanism to summarize code snippets into natural language. Hu et al. (2018b) introduced API knowledge from related tasks while Cai et al.
(2020) introduced type information to assist training, which also gained promising results. Additionally, reinforce learning (Wan et al., 2018) and dual learning (Wei et al., 2019;Ye et al., 2020) are also shown effective to boost model performance.
Transformer-based Approaches It is known that RNN-based models may encounter bottleneck when modeling long code sequences. Ahmad et al. (2020) proposed an enhanced Transformer with copy attention and relative position encoding while Gupta (2020); Dowdell and Zhang (2020) proposed to use Transformer (Vaswani et al., 2017)  It is worth further exploration and practice on pretraining approaches for out concerned tasks.

Conclusion
This paper presents a novel structured-induced Transformer model on code summarization task. By well-designed architecture, the proposed model may effectively incorporate multi-view structure into attention mechanism without tricky implementation. We further adopt a new module architecture to aggregate both global self-attention and structure-induced self-attention representations. Experiments on two challenging benchmarks including Java and Python show that the proposed model yields new state-of-the-art results.