CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure

Code pre-trained models (CodePTMs) have recently demonstrated significant success in code intelligence. To interpret these models, some probing methods have been applied. However, these methods fail to consider the inherent characteristics of codes. In this paper, to address the problem, we propose a novel probing method CAT-probing to quantitatively interpret how CodePTMs attend code structure. We first denoise the input code sequences based on the token types pre-defined by the compilers to filter those tokens whose attention scores are too small. After that, we define a new metric CAT-score to measure the commonality between the token-level attention scores generated in CodePTMs and the pair-wise distances between corresponding AST nodes. The higher the CAT-score, the stronger the ability of CodePTMs to capture code structure. We conduct extensive experiments to integrate CAT-probing with representative CodePTMs for different programming languages. Experimental results show the effectiveness of CAT-probing in CodePTM interpretation. Our codes and data are publicly available at https://github.com/nchen909/CodeAttention.


Introduction
In the era of "Big Code" (Allamanis et al., 2018), the programming platforms, such as Github and Stack Overflow, have generated massive opensource code data.With the assumption of "Software Naturalness" (Hindle et al., 2016), pre-trained models (Vaswani et al., 2017;Devlin et al., 2019;Liu et al., 2019) have been applied in the domain of code intelligence.
Existing code pre-trained models (CodePTMs) can be mainly divided into two categories: structure-free methods (Feng et al., 2020;Svy-atkovskiy et al., 2020) and structure-based methods (Wang et al., 2021b;Niu et al., 2022b).The former only utilizes the information from raw code texts, while the latter employs code structures, such as data flow (Guo et al., 2021) and flattened AST1 (Guo et al., 2022), to enhance the performance of pre-trained models.For more details, readers can refer to Niu et al. (2022a).Recently, there exist works that use probing techniques (Clark et al., 2019a;Vig and Belinkov, 2019;Zhang et al., 2021) to investigate what CodePTMs learn.For example, Karmakar and Robbes (2021) first probe into CodePTMs and construct four probing tasks to explain them.Troshin and Chirkova (2022) also define a series of novel diagnosing probing tasks about code syntactic structure.Further, Wan et al. (2022) conduct qualitative structural analyses to evaluate how CodePTMs interpret code structure.Despite the success, all these methods lack quantitative characterization on the degree of how well CodePTMs learn from code structure.Therefore, a research question arises: Can we develop a new probing way to evaluate how CodePTMs attend code structure quantitatively?
In this paper, we propose a metric-based probing method, namely, CAT-probing, to quantitatively evaluate how CodePTMs Attention scores relate to distances between AST nodes.First, to denoise the input code sequence in the original attention scores matrix, we classify the rows/cols by token types that are pre-defined by compilers, and then retain tokens whose types have the highest proportion scores to derive a filtered attention matrix (see Figure 1(b)).Meanwhile, inspired by the works (Wang et al., 2020;Zhu et al., 2022), we add edges to improve the connectivity of AST and calculate the distances between nodes corresponding to the selected tokens, which generates a distance matrix as shown in Figure 1(c).After that, we define CAT-score to measure the matching degree between the filtered (c) The distance matrix (filtered) Figure 1: Visualization on the U-AST structure, the attention matrix generated in the last layer of CodeBERT (Feng et al., 2020) and the distance matrix.(a) A Python code snippet with its corresponding U-AST.(b) Heatmaps of the averaged attention weights after attention matrix filtering.(c) Heatmaps of the pair-wise token distance in U-AST.
In the heatmaps, the darker the color, the more salient the attention score, or the closer the nodes.In this toy example, only the token "." between "tmpbuf" and "append" is filtered.More visualization examples of filtering are given in Appendix D.
attention matrix and the distance matrix.Specifically, the point-wise elements of the two matrices are matched if both the two conditions are satisfied: 1) the attention score is larger than a threshold; 2) the distance value is smaller than a threshold.If only one condition is reached, the elements are unmatched.We calculate the CAT-score by the ratio of the number of matched elements to the summation of matched and unmatched elements.Finally, the CAT-score is used to interpret how CodePTMs attend code structure, where a higher score indicates that the model has learned more structural information.
Our main contributions can be summarized as follows: • We propose a novel metric-based probing method CAT-probing to quantitatively interpret how CodePTMs attend code structure.
• We apply CAT-probing to several representative CodePTMs and perform extensive experiments to demonstrate the effectiveness of our method (See Section 4.3).
• We draw two fascinating observations from the empirical evaluation: 1) The token types that PTMs focus on vary with programming languages and are quite different from the general perceptions of human programmers (See Section 4.2).
2) The ability of CodePTMs to capture code structure dramatically differs with layers (See Section 4.4).
2 Code Background

Code Basics
Each code can be represented in two modals: the source code and the code structure (AST), as shown in Figure 1(a).In this paper, we use Tree-sitter2 to generate ASTs, where each token in the raw code is tagged with a unique type, such as "identifier", "return" and "=".Further, following these works (Wang et al., 2020;Zhu et al., 2022), we connect adjacent leaf nodes by adding data flow edges, which increases the connectivity of AST.
The upgraded AST is named as U-AST.

Code Matrices
There are two types of code matrices: the attention matrix and the distance matrix.Specifically, the attention matrix denotes the attention score generated by the Transformer-based CodePTMs, while the distance matrix captures the distance between nodes in U-AST.We transform the original subtoken-level attention matrix into the token-level attention matrix by averaging the attention scores of subtokens in a token.For the distance matrix, we use the shortest-path length to compute the distance between the leaf nodes of U-AST.Our attention matrix and distance matrix are shown in Figure 1(b) and Figure 1(c), respectively.

CAT-probing 3.1 Code Matrices Filtering
As pointed out in (Zhou et al., 2021), the attention scores in the attention matrix follow a long tail distribution, which means that the majority of attention scores are very small.To address the problem, we propose a simple but effective algorithm based on code token types to remove the small values in the attention matrix.For space limitation, we summarize the pseudocodes of the algorithm in Appedix Alg.1.We only keep the rows/cols corresponding to frequent token types in the original attention matrix and distance matrix to generate selected attention matrix and distance matrix.

CAT-score Calculation
After the two code matrices are filtered, we define a metric called CAT-score, to measure the commonality between the filtered attention matrix A and the distance matrix D. Formally, the CAT-score is formulated as: where C is the number of code samples, n is the length of A or D, 1 is the indicator function, θ A and θ D denotes the thresholds to filter matrix A and D, respectively.Specifically, we calculate the CAT-score of the last layer in CodePTMs.The larger the CAT-score, the stronger the ability of CodePTMs to attend code structure.

Experimental Setup
Task We evaluate the efficacy of CAT-probing on code summarization, which is one of the most challenging downstream tasks for code representation.This task aims to generate a natural language (NL) comment for a given code snippet, using smoothed BLEU-4 scores (Lin and Och, 2004) as the metric.
Datasets We use the code summarization dataset from CodeXGLUE (Lu et al., 2021) to evaluate the effectiveness of our methods on four programming languages (short as PLs), which are JavaScript, Go, Python and Java.For each programming language (short as PL), we random sample C = 3, 000 examples from the training set for probing.
Pre-trained models We select four models, including one PTM, namely RoBERTa (Liu et al., 2019), and three RoBERTa-based CodePTMs, which are CodeBERT (Feng et al., 2020), Graph-CodeBERT (Guo et al., 2021), and UniXcoder (Guo et al., 2022).All these PTMs are composed of 12 layers of Transformer with 12 attention heads.We conduct layer-wise probing on these models, where the layer attention score is defined as the average of 12 heads' attention scores in each layer.The comparison of these models is introduced in Appedix B. And the details of experimental implmentaton are given in Appedix C.
In the experiments, we aim to answer the three research questions in the following: • RQ1(Frequent Token Types): What kind of language-specific frequent token types do these CodePTMs pay attention to?
• RQ2(CAT-probing Effectiveness): Is CATprobing an effective method to evaluate how CodePTMs attend code structure?
• RQ3(Layer-wise CAT-score): How does the CAT-score change with layers?these types are quite different.For example, the Top-3 frequent token types for Java are "public", "s_literal" and "return", while Python are "for", "if", ")".2) There is a significant gap between the frequent token types that CodePTMs focus on and the general perceptions of human programmers.For instance, CodePTMs assigned more attention to code tokens such as brackets.3) Attention distribution on Python code snippets significantly differs from others.This is caused by Python having lesser token types than other PLs; thus, the models are more likely to concentrate on a few token types.

CAT-probing Effectiveness
To verify the effectiveness of CAT-probing, we compare the CAT-scores with the models' performance on the test set (using both best-bleu and best-ppl checkpoints).The comparison among different PLs are demonstrated in Figure 3.We found strong concordance between the CAT-score and the performance of encoder-only models, including RoBERTa, CodeBERT, and GraphCode-BERT.This demonstrates the effectiveness of our approach in bridging CodePTMs and code structure.Also, this result (GraphCodeBERT > Code-BERT > RoBERTa) suggests that for PTMs, the more code features are considered in the input and pre-training tasks, the better structural information is learned.
In addition, we observe that UniXcoder has com- pletely different outcomes from the other three CodePTMs.This phenomenon is caused by UniXcoder utilizing three modes in the pre-training stage (encoder-only, decoder-only, and encoder-decoder).This leads to a very different distribution of learned attention and thus different results in the CATscore.

Layer-wise CAT-score
We end this section with a study on layer-wise CATscores.Figure 4 gives the results of the CAT-score on all the layers of PTMs.From these results, we observe that: 1) The CAT-score decreases in general when the number of layers increases on all the models and PLs.This is because attention scores gradually focus on some special tokens, reducing the number of matching elements.
2) The relative magnitude relationship (GraphCodeBERT > Code-BERT > RoBERTa) between CAT-score is almost determined on all the layers and PLs, which indicates the effectiveness of CAT-score to recognize the ability of CodePTMs in capturing code structure.3) In the middle layers(4-8), all the results of CAT-score change drastically, which indicates the middle layers of CodePTMs may play an important role in transferring general structural knowledge into task-related structural knowledge.4) In the last layers (9-11), CAT-scores gradually converge, i.e., the models learn the task-specific structural knowledge, which explains why we use the score at the last layer in CAT-probing.

Conclusion
In this paper, we proposed a novel probing method named CAT-probing to explain how CodePTMs attend code structure.We first denoised the input code sequences based on the token types predefined by the compilers to filter those tokens whose attention scores are too small.After that, we defined a new metric CAT-score to measure the commonality between the token-level attention scores generated in CodePTMs and the pairwise distances between corresponding AST nodes.
Experiments on multiple programming languages demonstrated the effectiveness of our method.

Limitations
The major limitation of our work is that the adopted probing approaches mainly focus on encoder-only CodePTMs, which could be just one aspect of the inner workings of CodePTMs.In our future work, we will explore more models with encoder-decoder architecture, like CodeT5 (Wang et al., 2021b) and PLBART (Ahmad et al., 2021), and decoder-only networks like GPT-C (Svyatkovskiy et al., 2020).

A Frequent Token Types Filtering Algorithm
Algorithm 1 describes the procedure to generate frequent token types.

C Experimental Implementation
We keep the same hyperparameter setting for all CodePTMs.The detailed hyperparameters are given in Table 1.
Our codes are implemented based on PyTorch.All the experiments were conducted on a Linux server with two interconnected NVIDIA-V100 GPUs.

D Case Study
In addition to the example visualized in Figure 1, we have carried out three new examples to show the effectiveness of filtering strategy in Section 3.1, The visualization are shown in Table 3.

Figure 2 :
Figure 2: Visualization of the frequent token types on four programming languages.

Figure 2 Figure 3 :
Figure2(a)-(d) demonstrates the language-specific frequent token types for four PLs, respectively.From this figure, we see that: 1) Each PL has its language-specific frequent token types and
Table 2 gives the comparison of the PTMs used in our experiments from three perspectives: the inputs of model, the pre-training task, and the training mode.