Benchmarking Language Models for Code Syntax Understanding

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained language models can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that existing language models pretrained on code still lack the understanding of code syntax. In fact, these pre-trained programming language models fail to match the performance of simple baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of syntactic structure understanding. Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.


Introduction
Large-scale pre-training of language models has become the de-facto paradigm for a variety of natural language processing tasks.Furthermore, recent studies show that models pre-trained on a massive amount of code also achieve competitive performance on many tasks, e.g., code generation and † Corresponding authors. 1 Our code and dataset are available at https://github.com/dashends/CodeSyntax. Figure 2: A preview of the model performance comparison on NL and PL syntax understanding tasks.Pretrained models capture NL syntax relatively well, but perform worse in understanding PL syntax.The Offset baseline picks the token using a fixed positional offset.We use BERT-large and RoBERTa-base configurations (corresponding to the configurations of CuBERT and CodeBERT).The plot shows top-1 scores.See Tables 3  and 4 for the full results.
code classification.These tasks are closely related to natural language (NL) tasks in their problem formulation.Nowadays, the common practice for solving these coding tasks is to utilize the language model architectures and training schemes that are originally designed for NL.The design principle of these neural language models is significantly different from the classic rule-based program generation systems.Specifically, neural language models take the program as a token sequence, while classic program generation systems utilize the language grammar and code structure.Despite the advanced performance of pre-trained language models on code understanding tasks, what these models have learned from the code corpus remains unclear.
In this work, we investigate whether large-scale pre-training is all we need for code representation learning.In particular, we conduct the first systematic study to analyze how the pre-trained language models understand the syntactic structures of programs.To this end, we introduce CodeSyntax, a large-scale benchmark consisting of programs annotated with the syntactic relationships between different tokens.The ground truth syntactic relationships are extracted from edges in the abstract syntax trees (AST) of the programs.Figure 1 shows some examples.These syntactic relations are function-wise similar to dependency relations for NL, where prior work has demonstrated that the attention heads of pre-trained language models can help to identify NL relation types (Clark et al., 2019;Raganato et al., 2018).To measure how well the pre-trained language models capture the code syntactic structures, we adopt the approach to the PL domain.We focus on investigating the zeroshot capability of existing pre-training methods in our experiments, and we evaluate these pre-trained models without finetuning them on our benchmark.
We evaluate the state-of-the-art pre-trained language models for code representation learning, including CuBERT (Kanade et al., 2020) and Code-BERT (Feng et al., 2020).A common characteristic of these models is that they share the same Transformer-based architectural design as NL models (Vaswani et al., 2017;Devlin et al., 2019).This allows us to directly compare their performance in capturing the syntax structure.We present a preview of our key results in Figure 2. Our main observation is that pre-training is insufficient for learning the syntactic relations in code.First, we find that the models pre-trained on code do not always outperform models pre-trained on NL corpus alone.Surprisingly, compared to CodeBERT which is trained on both text and code corpora, RoBERTa achieves better performance without training on any code with identical model architecture.This indicates that pre-training on programs as token sequences does not help learn the syntactic relations.On the contrary, without dependency rela-tions, pre-training still enables language models to understand the NL syntax to some extent.
Moreover, for code syntax understanding, the pre-trained models even perform worse than simple baselines that pick the tokens with a fixed offset.For example, always selecting the (p+2)-th token as the p-th token's dependency yields higher accuracy than any attention head for several relation types.On the other hand, the same model architectures pre-trained on text corpora achieve decent accuracy in identifying the dependency relations in the NL domain, where the performance of the same simple baselines is far behind.
Our analysis reveals several key differences between NL and PL that lead to different capabilities of understanding the syntax for pre-trained models.First, programs are more structured than NL sentences.Programs usually contain hierarchical structures representing long-term dependencies between code tokens.Consequently, a large number of syntactic relation types are between distant tokens, which can be difficult to recognize for attention heads.On the contrary, the dependency relations in NL sentences mostly connect nearby token pairs, and in this case the attention heads are more capable of identifying the correct relations.Meanwhile, language models are good at recognizing keyword-based relations, such as picking the corresponding else keyword for an if token.Interestingly, we find that the inclusion of tokens such as newlines and semicolons notably affects the performance in the code domain.
Our findings suggest that existing pre-trained models perform quite differently in PL and NL domains in terms of the ability to understand syntax.Thus, directly applying training paradigms developed for NL could be suboptimal for program learning, and we consider designing better approaches to model the code structure as future work.

CodeSyntax: Benchmarking Code Syntax Understanding
We construct the CodeSyntax benchmark to evaluate the performance of language models on code syntax understanding.We focus on Python and Java languages, on which the publicly released model checkpoints of both CuBERT (Kanade et al., 2020) and CodeBERT (Feng et al., 2020) are pretrained.We obtain the code samples from Code-SearchNet (Husain et al., 2019), which is a largescale dataset consisting of code in different pro- Note that we can easily extend the dataset to cover more languages since the workflow for extracting relations is automated and AST parsers are available for most popular programming languages.
We observe several characteristics of relations in CodeSyntax.First, the keywords in PL play an important role in recognizing the code structure.Specifically, some relation types have fixed keywords as the edge nodes, such as the If:if→else relation.Meanwhile, compared to the dependency relations in NL, the relation edges in the program AST tend to connect nodes that are much farther away from each other.As shown in Figure 3, the average offset between head and dependent nodes is no more than 10 for dependency relations in NL, while the average offset for a relation type can be more than 100 tokens in programs.Specifically, in CodeSyntax, there are 22 near dependency types whose average offsets are less than 10, and 12 far  dependency types whose average offsets are above 10.

Evaluation Setup
Do pre-trained language models capture the code structure without direct supervision of the syntactic information?To investigate this question, we evaluate several pre-trained language models without finetuning, and compare their performance in understanding the syntax for NL and PL.
Natural language benchmark.To compare the performance on CodeSyntax to NL syntax understanding, we construct the NL benchmark that includes English and German.Specifically, we use the English News Text Treebank: Penn Treebank Revised (Bies et al., 2015)  Attention probing approach.Some prior works demonstrate that a Transformer architecture (Vaswani et al., 2017) pre-trained on a text corpus, such as BERT (Devlin et al., 2019), contains attention heads that specialize in certain dependency relations in NL (Raganato et al., 2018;Clark et al., 2019).Specifically, in the Transformer architecture, each vector e i for an input token is transformed into the query and key vectors q i and k i via some linear transformations, and the transformations vary among different attention heads.For the i-th token, the attention weight assigned to the j-th token is The attention weight indicates how important the j-th token is with respect to the i-th token.
Typically, different attention heads learn different weights between input tokens.Therefore, to measure the correctness of recognizing a relation type r, for each edge <h, t, r> in the program AST where h is the head node and t is the dependent node, we enumerate all attention heads to compute the attention weight α h,t .If an attention head tends to assign high attention weights that connect the pair of tokens belonging to the relation type r, we consider the relation type to be captured.We defer more implementation details of attention map extraction to Appendix B.
Metrics.We use the unlabeled attachment score (UAS) to measure the syntax understanding performance, and we consider top-k scores with different values of k.To compute top-k scores for language models, for each attention head, given the head token h in a relation edge <h, t, r>, we compute the attention weight over all tokens in the input code, and we consider the prediction to be correct if the attention weight over the dependent token t is among the top-k tokens with the highest attention weights.For each relation, we select the best-performing attention head and use its score as the model's score for that relation.We calculate a model's average score over all relations as the final score of the model.
In NL dependency parsing problems, the dependent node t usually corresponds to a single word.However, in PL, the dependent can be a block that contains multiple code tokens.For example, in the If:if→body relation, the head is the keyword if, while the dependent is the entire body block.Therefore, we measure three metrics.First-token metric and last-token metric: the prediction is deemed correct if it successfully predicts the first and last token of the dependent block, respectively; Anytoken metric: the prediction is considered correct if it can predict any token within the dependent block.While we agree that these are not perfect metrics and one single metric may be incomplete, we observe that our findings generally hold for all the three metrics we evaluated.Note that the firsttoken metric is stricter than the any-token metric by design.Unless otherwise specified, we report the top-k scores using the first-token metric by default.
Model architectures.Table 2 summarizes the models evaluated in this work.For language models over code, we consider CuBERT (Kanade et al., 2020) and CodeBERT (Feng et al., 2020), and we evaluate their released pre-trained checkpoints.Both of them are based on architectures initially designed for NL.Specifically, CuBERT utilizes the BERT (Devlin et al., 2019) architecture, and CodeBERT (Feng et al., 2020) utilizes the RoBERTa (Liu et al., 2019) architecture.For NL models, we also evaluate multilingual variants of BERT and RoBERTa on the German dataset, i.e., Multilingual BERT (Pires et al., 2019) and XLM-RoBERTa (Conneau et al., 2020).Both of the two code language models are cased, so we also evaluate the cased versions of the NL models.
Programming Languages Natural Languages Baselines.To examine how well the attention performs through comparisons, we design a simple offset baseline and a simple keyword baseline.
The offset baseline with an offset value of i always selects the token after i positions of the input token as its prediction when i > 0, and selects i positions before the input token when i < 0. The keyword baseline with a keyword of key always predicts the next key token as its prediction.In our

Experiments
In this section, we present the results of pre-trained language models for both PL and NL syntax understanding tasks, and discuss the key observations that distinguish PL from NL.We present our main results to compare the performance in syntactic relation understanding on PL and NL in Tables 3 and 4, respectively.First, on CodeSyntax, language models generally perform worse than simple offset baseline and its combination with the keyword baseline, which indicates  5.Although the best combined baseline still outperforms language models, the performance gap shrinks drastically.In par-ticular, CuBERT achieves better scores than the offset baseline, and the improvement on Java is more notable.We defer the full results of different top-k scores on both PL and NL benchmarks to Appendix D. In the following sections, we discuss the key factors that affect prediction performance.Figure 4: Top-k scores for Java syntax understanding using the last-token metric.

Main Results
To examine why the offset baseline outperforms CodeBERT and CuBERT, and why the relative performance differences get smaller when using the any-token metric, we conducted case studies and error analysis in Section 4.2 and Section 4.3, which both quantitatively and qualitatively categorize the error patterns.
Firstly, we investigate the most frequently attended code tokens, and we observe that the attention heads tend to recognize the reserved tokens and keywords in PL.For example, CuBERT and CodeBERT get an improved score on Java because the semicolon token is part of the ground truth dependent node, which is a popular token attended to by language models.Based on this observation, we perform an ablation study on the presence of the semicolon in ground truth annotations.When the semicolon tokens are removed from ground truth dependent nodes, we also disable the language models to attend to semicolons in the input code.Since the semicolon appears at the end of each Java statement, here we compute the lasttoken score which may be significantly affected by semicolons.As shown in Figure 4, CuBERT substantially outperforms baselines when semicolons are included in the ground truth labels.On the other hand, CuBERT reaches lower scores than baselines when semicolons are excluded from ground truth labels and predictions.The comparison suggests that attention heads are more capable of identifying frequent keywords in the model input.We defer the full ablation study on both Python and Java to Appendix F.
We further discuss the breakdown results with respect to relation types, and we select some representative relations for Python that highlight the performance differences between CuBERT and the offset baseline in Table 6.First, the attention is highly capable of performing keyword matching, which leads to decent accuracy on relations that connect popular keywords, such as If:if→else.However, when the head and dependent tokens are diverse, it becomes challenging for the language model to recognize the relation.For example, in relation types Assign:target→value and Call:func→args, both head and dependent nodes can take various identifier names defined by different programmers.In particular, CuBERT can not effectively utilize the relative positions of tokens to learn the relations, even if the dependent node is near the head node.In such situations, the offset baseline with a fixed offset value of 2 already surpasses the pre-trained model.To categorize the wrong predictions of the attention, we manually examine 50 error cases for each relation selected in Table 6, and present the error situations in Table 7. Again, we observe that the attention often incorrectly selects frequently occurring tokens such as brackets.Moreover, the model has difficulty capturing the hierarchical code structure, thus it often attends to nearby keywords regardless of logical code blocks.
Take the relation If:if→else as an example, on which the language model generally achieves the best performance.Shown in Figure 5 are two sample if-statements, where the first one does not contain nested flow control blocks while the second one contains a keyword while inside the if-body."..." denotes that some code is omitted.Visualizing their corresponding attention weights of the attention head that performs the best on the relation If:if→else, we observe that the attention head  correctly attends to the else token in the first example, while it wrongly attends to the while token inside the if-body in the second example.More examples like these can be found in Appendix E.

Related Work
Transformer-based language models have been widely used for natural language processing (Devlin et al., 2019;Liu et al., 2019;Wang et al., 2020Wang et al., , 2021;;Shen et al., 2022;Wang et al., 2022).Hewitt and Manning (2019) show that syntax trees are implicitly embedded in BERT's word representation space via a structural probe.Another line of work studies what is learned by the attention in language models (Clark et al., 2019;Raganato et al., 2018;Voita et al., 2019;Michel et al., 2019;Vig, 2019;Burns et al., 2018;Marecek and Rosa, 2018;Voita et al., 2018).In particular, Clark et al. (2019) evaluate the attention heads of BERT on dependency parsing tasks using the English Penn Treebank corpus, where the attention significantly outperforms offset baselines.On the contrary, we demonstrate that attention-based models largely perform worse than offset baselines on code syntax understanding.
The success of Transformer-based models for natural language processing leads to their application in the PL domain (Kanade et al., 2020;Feng et al., 2020;Rozière et al., 2020Rozière et al., , 2021;;Clement et al., 2020;Dehghani et al., 2019).Chen et al. (2021) evaluate the model performance by measuring the functional correctness on unit tests.Chirkova and Troshin (2021) empirically shows that Transformers can utilize syntactic information to make predictions in some code processing tasks, while we analyze attention's ability to understand syntactic relations.Karmakar and Robbes (2021) probe pre-trained models on four code understanding tasks.They focus more on code classification, e.g., they train a classifier for predicting the AST node tag and the code length.On the contrary, we probe the attention heads for syntactic relation understanding, and we aim to present a comprehensive study of the differences between pre-trained language models on NL and PL for capturing the syntax structures.
There have been some efforts that try to take code structure into account during pre-training of Transformer-based models for code.For example, GraphCodeBERT (Guo et al., 2021) utilizes data flow for pretraining; i.e., the relation of "where-the-value-comes-from" for variables.On our Python benchmark, GraphCodeBERT achieves a top-1 first-token score of 39.3, which is better than 33.1 of CodeBERT, and comparable to 39.2 of CuBERT.However, such a score is still worse than 43.6 of the offset baseline.This trend is consistent when evaluating with other metrics.These results show that pre-training on data flow helps improve the model's ability to understand code syntax, but there is still large room for improvement.

Conclusion
In this work, we introduce CodeSyntax, a largescale benchmark for measuring the performance of code syntax understanding.Based on CodeSyntax, we conduct the first comprehensive study to analyze the capability of pre-trained language models on understanding the code syntactic structures without further finetuning.We demonstrate that while the attention heads of pre-trained language models are able to identify dependency relations in NL to some extent, they have difficulty recognizing the syntactic relations in programs.Pre-trained models even generally perform worse than simple offset baselines, and they tend to attend to frequently occurring nearby tokens without taking the hierarchical code structure into consideration.
We also analyze the differences between NL and PL from the perspectives of pre-trained models.
Our evaluation suggests that PL has unique characteristics that distinguish them from NL, such as the long-term dependency between code tokens, and the hierarchy in the syntactic structures.Therefore, simply taking a program as a token sequence is insufficient for modeling the program structure, which could eventually limit the potential of language models for code understanding tasks.We consider developing new model architectures and pre-training algorithms to leverage and represent the code structure and dependency graph as important future work.

Limitations
For the limitations of our benchmark, the gold annotations are based on the AST parsers.Adding new programming languages whose parsers are unavailable will require additional labeling efforts.A limitation in our experimental setup is that we have only benchmarked six models across two kinds of natural languages and programming languages.Finally, the main focus of our study is to probe the language models for code understanding.As a result, we have not proposed models that could deal with the code syntax in natural language and programming language applications.Future work could include developing such models that capture both semantics and structures.

Ethical Considerations
We hereby acknowledge that all of the co-authors of this work are aware of the provided ACM Code of Ethics and honor the code of conduct.The followings give the aspects of both our ethical considerations and our potential impacts to the community.This work creates a benchmark to test the code syntax understanding of pre-trained language models.Instead of natural language, the programming language is used for pre-training.We do not anticipate the production of harmful outputs after using our benchmark and existing models, especially towards vulnerable populations.

Environmental Considerations
We use several pre-trained language models.According to the estimation in (Strubell et al., 2019), pre-training a model with a similar size as used in the work costs 1,507 kWh•PUE and emits 1,438 lb CO 2 .This work focuses on inference.Therefore, our energy cost and CO 2 emissions are relatively small.

A More Details on CodeSyntax Construction
Since the code search net dataset does not come with syntactic relation labels, we come up with a way of extracting syntactic relations.We first utilize python's tokenize module and javalang module to produce code tokens from source code, and then label these code tokens with syntactic relations by using AST parsers on source code.We utilize Python ast module (Foundation, 2021) and Java org.eclipse.jdt.core.dom.ASTParser class (Contributors, 2014) to parse source code into ast nodes.
The AST structure captures syntactical relations.
An AST node has children AST nodes and a name that denotes its class.We use the class of the node as label and children nodes as dependents and heads when generating annotations.For example, the source code A = B, which means assigning value B to target variable A, is parsed into the     8.

B More Details on Attention Map Extraction for Code Language Models
Our experiments follow the work of Clark et al. (2019).They evaluate the attention heads of BERT on dependency parsing tasks on an English dataset, while we extend the work to the PL domain.We adopt and extend some of their code, such as the functions for extracting attention from BERT and plotting attention weights.The main differences between our work and theirs are that we construct a novel dataset for syntax understanding tasks for PL and come up with related evaluation metrics to accommodate the characteristics of PL.

B.1 Model Input
Each of our code samples is an entire Python or Java function.To prepare the input to be fed to the models, we run CuBERT and CodeBERT tokeniza-tion to obtain sequences of input ids for each code sample.We insert a [CLS] token at the beginning and append a [SEP] token at the end.If the input length is longer than 512 tokens (the maximum number of tokens allowed), we discard that code sample.We never split a long code sample into several input sentences because the span of some dependency relations is very long within a function.
For example, for an if statement, the else block may be far away from the keyword if.If we split them into two input sentences, then attention will not be able to understand and predict the relation between them.To avoid uncommon data points, we remove a code sample from both CuBERT and CodeBERT's input if it is longer than 512 tokens after either one of CuBERT or CodeBERT's tokenization.

B.2 Token Alignment And Word-level Attention
BERT uses WordPiece tokenization (Wu et al., 2016) and RoBERTa uses byte-level Byte-Pair Encoding (BPE) (Sennrich et al., 2016), which may split a word into several subtokens.Additionally, CuBERT imposes some special rules when producing program vocabulary.However, our dataset's labels use code tokens generated by the tokenize module and the javalang module.Therefore, there exists a need to align CuBERT/CodeBERT subtokens with code tokens in order to evaluate the models on our dataset.We first generate such an alignment that maps each code token to a set of Cu-BERT/CodeBERT subtokens, and then convert the original subtoken-level attention to word-level attention.We follow (Clark et al., 2019) to combine the attention weights of subtokens, i.e., we sum up their attention weights.

C More Reproducibility Information
Here we provide more information according to the EMNLP 2022 Reproducibility Criteria.
• Train/validation/test splits for datasets used: We do not finetune the pre-trained models on our benchmark.The validation set of CodeSyntax contains the code samples that come from the validation set of CodeSearch-Net, and our test set contains the samples from CodeSearchNet's test set.We use our test partition to probe the pre-trained attention heads while the validation set is not used.

E Examples of Correct and Incorrect Predictions
In this section, we present some visualization examples where attention correctly or incorrectly predicts the dependents.The heads chosen in these examples are the best-performing heads of CuBERT evaluated using the first-token metric.We feed the entire function as input to the transformer, however, we only present relevant snippets here for simplicity.In the source code displayed, "..." denotes that the remaining part of the code is omitted.As a result, the attention from a token may not sum up to one in these figures because the rest of the function is omitted.
Relation Call: func → args.The corresponding attention weights are visualized in Table 9 for Python and 10 for Java.

Figure 1 :
Figure 1: Examples of syntactic relations for (a) natural languages (NL) and (b) programming languages (PL).Each relation is represented by an arrow.The relations in PL represent the syntax of code in a way similar to those in NL.

Figure 3 :
Figure 3: Offset distribution of relation types in (a) CodeSyntax and (b) NL corpus.The x axis is the average positional offset distance between heads and dependents for each relation.The y axis is the number of relations that has the average offset value.See Section 3 for more details on the NL corpus.

Figure 5 :
Figure 5: Two sample cases for the relation If:if→else and corresponding attention weights of CuBERT's head 17-2.

Figure
Figure 6: PL Top-k Scores On Test Set

Figure
Figure 7: English Top-k Scores

Table 1 :
Dataset statistics of selected relation types in CodeSyntax.For each relation type, we highlight the head and dependent nodes in the examples in bold, with the head in blue and the dependent in red.We defer the full statistics of all relation types to Table8in the appendix.

Table 2 :
Model architectures evaluated on PL and NL benchmarks.Models in the same row share the same architecture, but are pre-trained on different corpora.
experiments, we evaluate offset baselines with each possible offset value between 0 and 512 for PL, and -512 to 512 for NL.We use all Python and Java keywords for the keyword baselines on Python and Java datasets respectively, including tokens such as if, for, in, etc.To evaluate the top-k scores for baselines where k ≥ 2, we combine k simple baselines with different offset (keyword) values to give k predictions.To select k offset (keyword) values, we repeatedly and greedily include the next value that yields the highest performance increase for the relation type under consideration.

Table 4 :
Top-k scores for NL syntax understanding.Note that BERT-large and CuBERT share the same model configuration, and CodeBERT and RoBERTabase have the same model architecture.Unlike Table 3, we exclude Keyword and Combined baselines because they do not add upon the Offset baseline in terms of the performance.
Meanwhile, we present the any-token results on CodeSyntax in Table

Table 7 :
The full breakdown results of all relation types on both Python and Java can be found in Appendix G. Error analysis using CuBERT.

Table 9 :
Attention vs. offset baseline with fixed offset for each relation on Python dataset using first-token metric.In the score column, we present the accuracy score for CuBERT and offset baseline.In the offset column, the chosen offset is shown.Score differences are calculated as CuBERT score -offset baseline score for each relation, where a positive value indicates that the language model surpasses the baseline performance.Since CuBERT always outperforms Codebert, we only include results for CuBERT.

Table 10 :
Attention vs. offset baseline with fixed offset for each relation on Python dataset using any-token metric.

Table 12 :
Attention vs. offset baseline with fixed offset for each relation on Java dataset using any-token metric.