BLOCSUM: Block Scope-based Source Code Summarization via Shared Block Representation

Code summarization, which aims to automatically generate natural language descriptions of the source code, has become an essential task in software development for better program understanding. Abstract Syntax Tree (AST), which represents the syntax structure of the source code, is helpful when utilized together with the sequence of code tokens to improve the quality of code summaries. Recent works on code summarization attempted to capture the sequential and structural information of the source code, but they considered less the property that source code consists of multiple code blocks. In this paper, we propose BLOCSUM, BLOck scope-based source Code SUMmariza-tion via shared block representation that utilizes block-scope information by representing various structures of the code block. We propose a shared block position embedding to effectively represent the structure of code blocks and merge both code and AST. Furthermore, we develop variant ASTs to learn rich information such as block and global dependencies of the source code. To prove our approach, we perform experiments on two real-world datasets, the Java dataset and the Python dataset. We demonstrate the effectiveness of BLOCSUM through various experiments, including ablation studies and a human evaluation.


Introduction
A description of source code is very important in software development because it helps developers better understand programs.Advances in deep learning have enabled automatic code summarization and increased software maintenance efficiency.Previous approaches on automatic source code summarization can be categorized into sequence-based, structure-based, and hybrid approaches.Sequencebased approaches generated summaries by capturing the sequential information of source code (Iyer et al., 2016;Allamanis et al., 2016;Liang and Zhu, 2018;Hu et al., 2018b;Wei et al., 2019;Ye et al., 2020;Ahmad et al., 2020).They tokenized source code into a sequence of code tokens and encoded them using seq2seq models.Meanwhile, structurebased approaches used Abstract Syntax Tree (AST) to capture the structural information of code (Hu et al., 2018a;Fernandes et al., 2019;Shido et al., 2019;Harer et al., 2019;Zhang et al., 2019;LeClair et al., 2020;Liu et al., 2021;Lin et al., 2021;Allamanis et al., 2018;Wang and Li, 2021;Wu et al., 2021).They parsed the source code into the AST and utilized graph models such as Graph Neural Networks (GNNs).Some works flattened AST into a pre-order traversal sequence (Alon et al., 2019(Alon et al., , 2018;;LeClair et al., 2019;Wang and Li, 2021;Choi et al., 2021).Hybrid approaches utilized both the token sequence and the ASTs of codes (Wan et al., 2018;Wei et al., 2020;Zhang et al., 2020;Shi et al., 2021).They parallelly processed token sequences and ASTs with independent encoders and tried to merge them in the decoder.
However, the existing approaches have some limitations.Sequence-based approaches treated the source code as a single statement, so they incorporated only the sequence information of the code without any structural information such as code blocks.Structural information is very important to understand code because a snippet of code can be considered as a hierarchy of blocks.
Structure-based approaches tried to catch such structural information of code, but they considered less sequential information of code.A snippet of code is a sequence, so sequential information is also important to understand code.Another problem is that they depended only ASTs to capture structural information of code.However, ASTs are not suitable for capturing structural information because they are syntax trees for grammatical purpose.Since the AST is a tree, there is only one path between every pair of nodes.Any two nodes in ASTs are connected, but with a relatively long path, which hinders capturing the structural rela- System.out.println("find min value"); } } System.out.println("max : "+max); System.out.println("min : "+min); } tions of nodes.It causes difficulties in propagating structural information to distant nodes in the AST.Some structure-based approaches tried to alleviate these problems by providing additional graphs such as Control Flow Graph (CFG) and Program Dependence Graph (PDG), but the cost and time required to produce these graphs and integrate them into one graph are not negligible.
Hybrid approaches utilized both code and its AST.Since both sequential and structural information of code is necessary to understand code, the approaches showed higher performances than the previous ones.However, they failed to effectively merge two different types of information.They simply adopted independent encoders for each of them and tried to merge them in the decoder.Due to independent encoders, their representations are easy to be independent.It will make it hard to effectively merge both sequential information of code and structural information.Since the token sequence and the AST of code are just different descriptions of the same code, they need to be encoded so that they are correlated with each other.
To address these limitations mentioned above, we exploit the fact that the source code is a set of blocks consisting of multiple statements for a specific purpose.The code tokens in one code block are configured for the same purpose.As shown in Figure 1a, the code block if { ... } (orange) consists of statements that are executed when a certain condition is true.So, it needs to consider the information on which block each code token belongs to.To better capture structural information, we need to give each token not only positional information but also block positional information when encoding.
Since ASTs do not have enough information to capture the structural information of code, we need to modify ASTs.The additional information we try to add is block dependency and global dependency between nodes.For example, as shown in Figure 1b, the node "max" in the orange block of AST is only connected to its parent node, "Statement", but there exists implicit block dependency that the node "max" belongs to the same orange block as the node "data" in the purple dashed line.Furthermore, there exists implicit global dependency between nodes.For example, the node "max" in the orange box is the same variable as the nodes in the green dashed line.We need to add such information to ASTs.Also, we need to match blocks in the token sequence with blocks in the AST.For example, code tokens ("if", "(", "data", ..., "}") (orange) in Figure 1a can be mapped to its corresponding nodes in the AST ("If", "data", ... , "println") (orange) in Figure 1b.Here, the block in the code and the block in the AST are the same part with an equal role, so we will utilize information on which block of code corresponds to which block of AST.Such information can make the code and AST correlated, and assist effective merging of two kinds of information.
In this paper, we propose BLOCSUM, BLOck scope-based source Code SUMmarization via shared block representation that utilizes blockscope information of token sequences and ASTs.First, we propose the shared block position em-  beddings for effectively representing the structure of the code block and combining a correlation between the code and the AST encoders.Furthermore, we develop simple yet effective variants of ASTs to learn rich information such as block and global dependencies of the source code.To validate our approach, we perform experiments on the Java dataset and the Python dataset.We prove the superiority of BLOCSUM through various experiments including ablation studies and a human evaluation.

BLOCSUM
In this section, we present the details of our model.
Figure 2 shows the overall architecture of BLOC-SUM.We first introduce the shared block position embedding and the abstract syntax tree variants and explain the architecture of BLOCSUM in detail.

Shared Block Position Embedding
We suppose that there are code tokens c i in a code snippet C = {c 1 , c 2 , ...} and AST nodes n i in its AST sequence N = {n 1 , n 2 , ...}.We aim to predict a summary given the code tokens and the AST nodes.
Code blocks are the basic structural components of source code.Usually, code tokens in a block are gathered for a certain purpose, so the tokens need to be identified that they are in the same block.To distinguish which blocks are, we assign indexes to each block in the order of blocks in the code.Then each token has a block position with an index of the block it belongs to.If the code token is in nested blocks, we choose the innermost block index as the block position.
In order to utilize the block information of each token, we develop the block position embedding layer.Tokens in the same block have the same block position embedding.The code token embedding for the code encoder, E c , is defined as follows: for code token t.In the equation 1, W c , P c , and B c are the word, position, and block position embedding layers for the code encoder, respectively.Two position embeddings, P c and B c , are learnable positional encoding.We also combine the AST nodes with block position embeddings to ensure that nodes in the same block have identical block information when node representations are learned by the AST encoder.As with the block position of the code tokens, each node is assigned a block position value.Since the block in the AST is a sub-tree structure in the AST, the node has an index of the sub-tree to which it belongs.Nodes in the same block have the same block positional embedding as code tokens.The AST node embedding for the AST encoder, E s , is defined as follows: for node n.In Equation 2, W s , P s , and B s are the word, position, and block position embedding layers for the AST encoder, respectively.The position of a node is defined as the position in the pre-order traversal sequence of the AST.Two position embeddings, P s and B s , are also learnable.
There are two different types of inputs, code and AST, but they are just different descriptions for the same snippet.If two encoders for code and AST learn representations for code tokens and AST nodes separately, their representations will be easy to be independent and very hard to effectively combine both sequential and structural information.
In order to correlate the representations learned by two encoders, we allow the encoders to share the block position embedding layer.If the code token and the AST node belong to the same block, they will have the same block position embedding value.That is, we utilize additional information on which parts of the code correspond to which parts of the AST to generate better representations.If the block position embedding layers are shared, the embeddings for a token, t, and a node, n are as follows: where B is the shared block position embedding layer.Shared block position embedding can effectively merge the information from two encoders.Also, it helps the code and the AST encoders to capture the structure of source code by providing block information.

Abstract Syntax Tree Variants
The original AST is a structure in which a node is connected only to its parent and children nodes.It contains local information, but it does not include the entire structure information of the code.
Two nodes in the same block have implicit block dependency.There is also global dependency that two nodes have the same meaning even if their hop distance is very long.To utilize rich structural information such as block and global dependencies, we develop a simple yet effective method to reconstruct variants of AST.We define three variants of the AST: AST-original, AST-block, and AST-global.
AST-original is the original AST, which contains information on the local dependency between nodes, as shown in Figure 3a.The variants of AST are graphs of which nodes are the same as those of AST, but of which links are different.
AST-block is the first variant of AST.We obtain it by removing the edges in AST-original, and adding new edges between the nodes belonging to the same block to represent the block structure information, as shown in Figure 3b.It represents information on the block dependency between nodes in the AST.
AST-global is the second variant of AST.As shown in Figure 3c, we fully connect all the nodes in the AST.It represents the global and complete dependency between nodes in the AST.If the preorder sequence of AST-global is learned by graph model, the node representations in the sequence represent context information of the AST.
Each of the three AST variants represents local, block, and global dependencies between nodes in the AST.When they are learned organically by the AST encoder, node representations will contain rich structural information of the AST.

BLOCSUM Architecture
Code Encoder Our code transformer encoder consists of 6 transformer layers (Vaswani et al., 2017).Each layer of the code transformer encoder is composed of two layers: multi-head selfattention (Vaswani et al., 2017) and feed-forward network.And residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) are performed on each two sub-layers.The transformer encoder captures the sequential and block information of the code tokens to which the shared block position embedding is added.
AST Encoder We use Graph Attention Networks (GATs) (Velickovic et al., 2018) for learning three different AST variants defined above: AST-original, AST-block, and AST-global.Our AST encoder consists of 6 multiple GAT encoder layers.Each layer of the AST multiple GAT encoder consists of three GATs for each variant AST.Each GAT captures local dependencies for AST-original, block dependencies for AST-block, and global dependencies for AST-global, respectively.
For the l-th layer of the AST encoder, the process is performed as follows: h l gn = GAT gn (A gn , h l−1 n ) where GAT ln , GAT bn , GAT gn and A ln , A bn , A gn denote the GAT layers for three variant ASTs and the adjacency matrices in the AST-original, ASTblock, and AST-global, respectively.Especially, the GAT for AST-global is the same as self-attention for learning the context of all nodes in the AST.
Finally, the three representations are combined and performed from residual connection and layer normalization by the following equation: where h l n is the concatenated node embedding in the l-th layer of AST GAT encoder, LN denotes layer normalization, and F F N is a feed-forward network.
With the deep AST encoder layers, the node representations combine and propagate the local, block, and global information of the AST.

Summary Decoder
The summary transformer decoder consists of 6 transformer decoder layers (Vaswani et al., 2017).Code token representations are learned with sequence and block information in the code transformer encoder and node representations are learned with local, block, and global dependencies of the code in the AST encoder.Given the code and node representations learned from each encoder, the summary transformer decoder learns to predict the summary of the original code token by fusion of the code and the AST representations.The multi-head self-attention in the decoder is performed sequentially on the code representations and node representations.
Finally, when the summary transformer decoder predicts the t-th words, the copy mechanism (See et al., 2017) is applied to directly use the code tokens and AST nodes.

Setup
Datasets We evaluate using the benchmarks of the two real-world datasets, the Java dataset (Hu et al., 2018b) and the Python dataset (Wan et al., 2018).The experiment datasets are divided into 69,708/8,714/8,714 and 55,538/18,505/18,502 for train/valid/test, respectively.For extracting the AST of each dataset, we used a java parser javalang in the Java dataset and a python parser ast in the Python dataset used by Wan et al. (2018).Refer to Appendix A for the statistics of the datasets in detail.

Hyper-parameters
We set the maximum length of code, AST, and summary to be 200, 200, and 50, respectively.For training the model, we use Adam optimizer (Kingma and Ba, 2015).We set the minibatch size as 80.The maximum training epoch is 100, and if the performance does not improve for 5 epochs, we stop early.Refer to Appendix C for the implementation details.
Evaluation Metrics We use BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE-L (Lin, 2004) as metrics.We adopt S-BLEU, which indicates the average sentence-level BLEU score.Refer to Appendix B in detail.

Quantitative Result
Overall Result Table 1 shows the overall performance of BLOCSUM and baselines on the Java and Python benchmark datasets.First, we can see that BLOCSUM improves the performance by 4.47 and 2.64 BLEU, 5.68 and 4.66 METEOR, and 4.63 and 3.85 ROUGE-L on the Java and Python datasets compared to the sequencebased approach, TransRel.In comparison with the structure-based approach, SITTransformer, the performance of BLOCSUM improves by 3.29 and 1.05 BLEU, 4.53 and 2.11 METEOR, and 3.84 and 2.23 ROUGE-L on the two datasets.The result shows that it is effective to capture the overall structure of code when AST is utilized together with the sequence of code tokens.Moreover, BLOC-SUM performs better than hybrid approaches.Compared to GCNTransformer, the result shows that it is more effective to capture both the sequential and structural information of the code considering the local, block, and global dependencies of AST rather than a flattened AST.Also, BLOC-SUM considering the correlation between code and node representations performs better than Tripletpos using two independent encoders for the code and AST.Finally, we compared our approach with CodeBERT, a strong pre-trained program language model.BLOCSUM performs significantly better than CodeBERT trained on large code data.The result shows that our approach is more appropriate for code modeling than the pre-trained model in the code summarization task.

Qualitative Result
Ablation Study we perform ablation studies to validate the effectiveness of shared block position embedding and AST variants on the Java and Python datasets.First, we design five models for comparison to verify the shared block position embedding: 1) not use block position embedding (unuse) 2) only use code block position embedding (code block emb) 3) only use AST block position embedding (ast block emb) 4) use separate block position embedding for code and AST (separate) 5) use shared block position embedding (share).
In Table 2, code block emb and ast block emb have better performance than unuse.This result shows that block position embedding is effective in capturing the block information in each encoder.Also when both code and AST encoder use each separate block position embedding, we can see that it is more effective than only using one block po- sition embedding.Moreover, share has the best performance in comparison with other models.The shared block position embedding can learn the correlation between the code and AST encoders through the block-scope information rather than each separate block position embedding.Shared block position embedding not only effectively captures the structure of the code block, but also helps connect both the code and AST.Second, we compare our approach with other combinations for verifying the effectiveness of variant ASTs: 1) AST-original (o) 2) AST-block (b), 3) AST-global (g).
As illustrated in  Human Evaluation We performed a human evaluation on the Python dataset to demonstrate the quality of generated summaries.We randomly select 100 code snippets and ask three people with knowledge of the Python language to evaluate the summaries.They are CS graduate students with many years of experience in Python languages.Following the human evaluation metrics of (Choi et al., 2021), we ask them to evaluate the 3 following metrics: 1) Fluency (Quality of the summary & grammatically correct), 2) Relevance (Selection of the consistent content in source code), 3) Coverage (Selection of the important content in source code).We show pairs of summaries generated from BLOCSUM and the baseline fine-tuned (Feng et al., 2020) to the evaluators, and they select one of win, tie, and loss in three metrics for both results.Table 4 shows the results of human evaluation on the generated summaries on the Python dataset.The scores of fluency is lower, but the relevance and coverage are very higher than the baseline, CodeBERT.We analyzed the generated summaries of the two models and identified that BLOCSUM generates it similarly to the ground truth, reflecting the keyword of the code.CodeBERT, a pretrained language model, can generate more fluent and grammatical summaries, but the length is relatively short and a very plain summary with no keywords.The average tokens in the ground truth are 10.14, while the average tokens in summaries generated by BLOCSUM and CodeBERT are 9.91 and 8.16, respectively.We think that short sentences are more grammatically advantageous than long sentences.BLOCSUM has the highest tie in terms of fluency but the highest win in terms of relevance and coverage.The result means that BLOCSUM reflects more the core characteristic of the code than CodeBERT.
Comparison with the baselines Table 5 shows the summary examples generated from our proposed model on the Python dataset.The result on the Python dataset example shows BLOCSUM predicts the keywords "wsgi" and "request" by reflecting block-scope information.Although there is no dependency between two words in the code and the original AST, BLOCSUM utilizes three different types of AST variants and jointly learns structural dependency in three aspects to improve the performance of the model.Refer to Appendix D for more summary examples on the Java and Python datasets.2020) proposed a Transformer model using a relative position.However, these approaches have limitations in that they did not explicitly incorporate the structural information of the source code, which is just as crucial as capturing the code semantics.Also, they did not learn the code block information because they learned the code as a sequence of tokens.2021) leverage the retrieve-and-edit framework to improve the performance for code summarization.Allamanis et al. (2018), Wang andLi (2021), andWu et al. (2021) tried to capture rich information using additional graphs such as CFG and PDG.But, these approaches considered only the structural information of the AST without considering the sequential information of the code token.2020) proposed a retrieval-based approach using syntactic and semantic similarity for source code summarization.Liu et al. (2021) proposed a hybrid GNN using a retrieval augmented graph method.Wei et al. (2020) proposed a comment generation framework using AST, similar code, and exemplar from code.Choi et al. (2023) proposed a self-attention network that adaptively learns the structural and sequential information of code.But, they tried to model the code using more code information through retrieval methods.

Conclusion
In this paper, we proposed BLOCSUM, BLOck scope-based source Code SUMmarization via shared block representation that utilizes blockscope information by representing various structures of the code block.We designed two methods using the fact that a code block is a fundamental structural component of the source code.We propose the first method, the shared block position embedding, for effectively representing the structure of the code block and merging a correlation between the code and the AST encoders.Furthermore, we developed to reconstruct simple yet effective AST variants to learn rich information such as block and global dependencies of the source code.Experimental results demonstrated the effectiveness of BLOCSUM and confirmed the importance of block-scope information in the code.

Limitations
In this paper, we conducted an experiment on code summarization using two benchmark datasets, the Java dataset (Hu et al., 2018b) and the Python dataset (Wan et al., 2018).BLOCSUM may need to be tested for its generalizability to other program languages.We chose two program languages (Java and Python) that were easily parsed to map the block position of Code and AST.We believe that since other programming languages have similar syntactic structures, BLOCSUM should be able to achieve similar performance on them as well.

Ethics Statement
This paper proposes block scope-based source code summarization via shared block representation that utilizes block-scope information by representing various structures of the code block, which is beneficial to increase the efficiency of developers.The research conducted in this paper will not cause any ethical issues or have any negative social effects.The data used is all publicly accessible and is commonly used by researchers as a benchmark for program and language generation tasks.Our proposed method does not introduce any ethical or social bias or worsen any existing bias in the data.

A Statistics of Experiment Datasets
For obtaining ASTs of the Java and Python dataset, we use the javalang1 and ast2 library, respectively.Also, we tokenize the source code and the AST to subtokens as the form CamelCase and snake-case .

B Evaluation Metrics
BLEU (Papineni et al., 2002) is a Bilingual Evaluation Understudy to measure the quality of generated code summaries.The formula for computing BLEU is as follows: ωn log pn where p n is the geometric average of the modified n-gram precisions, ω n is uniform weights 1/N and BP is the brevity penalty.METEOR (Banerjee and Lavie, 2005) is used to measure how closely the metric scores match the human judgments about the quality of the translation.So unigram precision (P ) and unigram recall (R) are computed and combined via a harmonic mean.The METEOR score is computed as follows: where f rag is the fragmentation fraction.α, β, and γ are three penalty parameters whose default values are 0.9, 3.0, and 0.5, respectively.
ROUGE-L (Lin, 2004) is used to apply Longest Common Subsequence in summarization evaluation.ROUGE-L used LCS-based F-measure to estimate the similarity between two summaries X of length m and Y of length n, assuming X is a reference summary sentence, and Y is a candidate summary sentence, as follows:

C Implementation Detail
We conducted experiments on Ubuntu 18.04 with 4 2080 Ti GPUs.The environment of the sever supports python 3.9, Cuda 10.2, pytorch 1.9, and pytorch geometric 1.7.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: An example of code snippet and its AST in the Java language.(a) The orange box denotes the code block and the number means the index of the code block.(b) The nodes in the orange rectangular box belong to the same block in the AST.

Figure 3 :
Figure 3: We introduce three different types of the AST: (a) AST-original referred as the original AST, (b) ASTblock connected by edges between two nodes which are in the same code block, (c) AST-global connected by edges between all the nodes in the AST.
Approaches Iyer et al. (2016)   andAllamanis et al. (2016) proposed to use Long Short Term Memory (LSTM) and Convolutional Neural Networks (CNNs) for the source code summarization.Liang and Zhu (2018) proposed a treebased recursive neural network to represent the syntax tree of code.Hu et al. (2018b) and Chen et al. (2021) summarized the source code with the APIs knowledge.Wei et al. (2019) used a dual training framework by training code summarization and code generation tasks.Also, Ye et al. (2020) considered the probabilistic correlation between the two tasks.Choi et al. (2020) proposed attention-based keyword memory networks for code summarization.Ahmad et al. ( Structure-based Approaches Hu et al. (2018a) proposed an RNN-based model using the pre-order traversal sequence as input.Shido et al. (2019); Harer et al. (2019) adopted Tree-LSTM, Tree-Transformer to encode tree-based inputs.LeClair et al. (2020) proposed encoded AST using graph neural networks and trained LSTM.Liu et al. (2021) proposed a retrieval augmented method with Graph Neural Network (GNN).Zhang et al. (2019) proposed AST-based Neural Network (ASTNN) for encoding the subtree.Lin et al. (2021) proposed Tree-LSTM to represent the split AST for code summarization.Li et al. ( β 2 )R lcs P lcs R lcs + β 2 P lcs where β = P lcs /R lcs and F lcs is the value of ROUGE-L.

Table 1 :
Comparison of our proposed model with the baseline models on the Java and Python datasets.We fine-tuned CodeBERT* with an input length of 200 and an output length of 50 for the two datasets.
Table 3, the results show that leveraging more structural information, such as block or global dependencies, performs better than modeling AST-original with only local dependency.Also, combining AST-original, AST-block, and ASTglobal has the best performance in comparison with

Table 3 :
Ablation study on combination of AST variants.

Table 4 :
Human evaluation of the appropriateness of the generated adversarial example on the Python dataset.

Table 5 :
A qualitative example on the Python dataset.
keyword memory networks.In 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 564-570.IEEE.

Table 7 :
The average training and inference time for BLOCSUM takes about 40 and 0.5 hours, respectively.BLOCSUM has about 76 million parameters.Hyper-parameters of BLOCSUM.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?We used the same hyperparameter as the previous study.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?We adopted the median value among the 3 models.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? 3. Experiment Results and Appendix B, C D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.