A Static Evaluation of Code Completion by Large Language Models

Large language models trained on code have shown great potential to increase productivity of software developers. Several execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. Nevertheless, it is expensive to perform the same evaluation on complex real-world projects considering the execution cost. On the other hand, static analysis tools such as linters, which can detect errors without running the program, haven’t been well explored for evaluating code generation models. In this work, we propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees. Compared with execution-based evaluation, our method is not only more efficient, but also applicable to code in the wild. For experiments, we collect code context from open source repos to generate one million function bodies using public models. Our static analysis reveals that Undefined Name and Unused Variable are the most common errors among others made by language models.Through extensive studies, we also show the impact of sampling temperature, model size, and context on static errors in code completions.


Introduction
Automatic code completion by large language models trained on numerous code repositories has demonstrated great potential in accelerating software development.Code assistant services powered by these models provide developers with code suggestions following the current context in realtime.However, it has been shown that about 70% of the suggestions are discarded by users in a recent study (Ziegler et al., 2022).Even worse, misleading recommendations can lead to failure in completing programming tasks (Vaithilingam et al., 2022).Therefore, it is important to understand the weakness of current code generation models through comprehensive evaluation and analysis.Recently, execution-based evaluation has become increasingly popular, where model-generated code is executed with unit tests to check functional correctness.Several benchmarks have been proposed along this direction, such as HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), MBXP (Athiwaratkun et al., 2022), CodeContests (Li et al., 2022), and DS-1000 (Lai et al., 2022).Although these benchmarks are highly reliable and accurate, they only focus on well-defined algorithmic and data science problems, which do not reflect the need in general software development.Running execution-based evaluation with real-world codebases is, however, prohibitively expensive because each project requires a different setup and the computation cost is potentially unbounded.
In contrast to the execution-based approach, static program analysis (or static analysis) can analyze programs without executing them.Although static analysis is usually unable to determine functional correctness, it covers a large collection of static error types, such as undefined names or unused variables that are illustrated in Figure 1.More importantly, the analysis can be very fast and does not require any project specific environment setup, which allows us to evaluate model completions for complex real-world code at large scale.Static analysis tools such as linters have been widely used, for example in code editors, to examine human-written code, but their value in evaluating code generation models has not been well explored yet.
In this work, we propose a static evaluation framework for Python language.Code snippets are first parsed into Abstract Syntax Trees (ASTs) and then analyzed by Pyflakes 1 , a popular static analysis tool for Python.To simulate real-world use cases of auto completion, we collect code from public Github repositories to build a function completion dataset of 100K problems.In each problem, we randomly mask out a function body in a Python file and ask the model to complete it given the preceding context up until the function header.We then evaluate public models by sampling 10 completions for each problem, resulting in one million generations for each model and sampling temperature, which will be examined by our static evaluation pipeline.
During AST parsing, we find most of the errors arise from incomplete generations that hit the max length limit.Otherwise, models of all sizes perform quite well in producing parsable codes.Moving forward, Pyflakes analysis reveals that Undefined Name and Unused Variable are the most prominent static errors in model-generated code.We also observe higher temperatures consistently lead to more errors.Scaling up the model, while able to reduce errors of many types, do not show a clear benefit for preventing undefined names.Through a more fine-grained classification, we find larger models generate fewer undefined variables but more undefined methods, which add up to a mixed result.Finally, we demonstrate that errors in context can lead to errors of the same type in generation, which is likely a consequence of large language models' in context learning capability.
In summary, our main contributions include the following.(1) We propose a static evaluation framework for code completion.(2) Our evaluation on public models reveals common static errors and how they are impacted by various factors such as temperature, model size, and context.
Abstract Syntax Tree An Abstract Syntax Tree (a.k.a., AST) is used to represent a source code in a concise tree form.By discarding unnecessary details of the underlying code and its corresponding parsed tree, AST only presents the main structural content of the source code following the language grammar (Aho et al., 2007).
Static Analysis Static analysis is a common way to detect software bugs without executing the program (Ayewah et al., 2008;Chess and McGraw, 2004;Chess and West, 2007;Zheng et al., 2006).Static analyzers tend to detect bugs by analyzing the static code text, its AST, documentation, etc.The users usually need to specify the error patterns and static analyzers use different AST, graph, and path analysis to find those patterns in the code.There are a plethora of static analysis tools and they can detect a wide range of errors depending on the specified patterns (Emanuelsson and Nilsson, 2008).For example, Linter is a popular tool that checks for coding style errors and thus, tries to enforce a coding standard (Van Oort et al., 2021).

The Function Completion Dataset
We introduce the function completion task, which is one of the most important use cases of auto completion services.Given an input code snippet that ends with a function signature plus an optional docstring, the model is asked to generate the function body.Previous works on code completion (Lu et al., 2021;Svyatkovskiy et al., 2019) have mainly focused on single-line completion.However, a single line is often too short to reveal models' capability in writing syntactically correct code.We believe function, as the fundamental building block in most programming languages, better serves this purpose.Software developers use code generation models as black-box services on a diverse set of coding projects.To better simulate the real-world scenario, we build an evaluation set by sampling from public Github repositories.Specifically we collected permissively licensed Python code in repositories that were created between April, 2022 and August, 2022.The selection criterion precludes any chronological overlap between our evaluation data and the training data of models to be tested in this work. 2 The collected Python codes are reformatted as function completion problems.We first use treesitter 3 to parse the whole file to identify all the functions.Then a function that contains a docstring is randomly selected.The code from the beginning of the file up until the end of the docstring is used as the context, and the function body is considered as the groundtruth.The rest of the file is discarded.At test time, we prompt the model with the context part as input, and let the model generate the function body.We choose only functions with docstrings so that context is well-defined and model can generate meaningful code completions.We further select test samples whose context length is between 64 and 768 tokens, and groundtruth length is shorter than 256 tokens, to match our model generation setting.Our final evaluation set consists of 100K function completion problems.

Static Error Analysis
We propose an evaluation pipeline to detect errors in function completions generated by models, illustrated in Figure 2. Suppose the model generates a completion x given the input context c.We cannot 2 CodeGen models were trained on data up until Oct, 2021. 3 https://tree-sitter.github.io/tree-sitter/directly analyze x which is partial code without context.Meanwhile, c may also contain errors especially in real-world cases.Therefore, we perform our analysis in two passes.We first check c for any errors in the input that need to be excluded, and then do another pass on the full code (c, x), the concatenation of the context and model completion.Any error that is identified in (c, x) but not in c must arise from x, or in other words, be generated by the model.More specifically, we conduct the following two steps of analysis for Python code.

AST parsing
In the first step, we parse both c and (c, x) into abstract syntax trees using Python's native ast module.If the code is parsable, an AST will be returned.Otherwise, a syntax error is captured.Based on the parsing outcomes, we take the following actions: 1.If c is not parsable, we are unable to conclude any error in generation.Empirically this rarely happens, as we will show in the next section.
2. If c is parsable but (c, x) is not, then we can confirm the reported syntax error is caused by model generation.However, notice that only one error will be returned even if there are multiple, due to the nature of AST parsing.
3. If both c and (c, x) are parsable, there's no AST error in model generation.The ASTs will be used for static analysis in the next step.

Static analysis with Pyflakes
If both c and (c, x) can be parsed into ASTs, we perform static analysis using Pyflakes.Pyflakes is a static analysis tool that checks a Python source file for errors by examining the AST.One advantage is that the analysis does not rely on dependencies of the source file, which is important given the diversity of packages used in real-world code.We run Pyflakes on c and (c, x) to identify errors in context and in full code.Errors that are detected in (c, x) but not in c are considered as introduced by model completion.

Experiments
With the proposed pipeline we conduct error analysis for CodeGen models (Nijkamp et al., 2022) on the test set described in Section 3, and present the analysis results.

Experiment Setup
We evaluate CodeGen-mono models of all sizes, ranging from 350M to 16B.We generate function completions using nucleus sampling with top-p 0.95.Sampling temperature is varied between 0.2 and 0.8 for the 2B model, and fixed to 0.4 for the rest models.We sample 10 generations for each problem, which results in one million code completions for each model and temperature.The maximum generation length is 256 tokens.Generated code completions are then passed to our static evaluation pipeline built with Python 3.8 and Pyflakes 3.0.1.Evaluating one million generations takes only a few hours on a single CPU thread, and can be fully parallelized for acceleration.

Validation of Model Output
While we mainly focus on static errors in this study, it is also important to validate that the models do generate relevant code.A counter-example would be to generate a single line of "return" for every function signature, which is syntactically correct but not meaningful at all.Towards this end, we calculate the edit similarity between model generation and groundtruth, and compare against Pass@1 from HumanEval (Chen et al., 2021)  code generation.Finally, the strong positive correlation between the last two columns shows that edit similarity on the function completion dataset can be used as an alternative metric for model comparison.

AST Results
We run AST parsing and find there are only 0.42% cases with unparsable context that need to be discarded.For the rest, we report percentage of generations with AST errors in Table 2.A full list of error types is included in Appendix A. For each type, we also show a code example in Appendix B.
While there are about 7-8% of unparsable generations, most of the parsing errors happen at the end of file (EOF), which means the generated code is incomplete due to the 256 max token limit.Extending generation length may help reduce EOF errors, but will require more computation and increase the perceived latency of the auto-completion service.
On the other hand, non-EOF errors only account for a tiny fraction, usually around 0.1-0.2%,which indicates CodeGen models can generally follow the abstract syntax grammar to produce parsable codes, regardless of model size and temperature.
Finding 1. Codes generated by models, unless incomplete, are mostly parsable into ASTs, regardless of model size or temperature.
We also show the top-3 non-EOF error types ranked by frequency, which are Invalid syntax, Print Missing Parentheses, and Keyword Argument Repeated.Notably, the first two categories are often related to Python's interpreter version.To illustrate, Python2-style print like print "abc"   will lead to Print Missing Parentheses in Python3.
Another example is that using async as a variable name will cause Invalid Syntax because async has become a reserved word since Python3.7.Models learn to make such errors from their training data which consists of code written for different Python versions.In many cases, it is difficult for a model to infer the intended interpreter version directly from the limited context.An interesting future direction is to guide models to generate version-compatible code given the target environment.
Finding 2. Interpreter version mismatch is one of the major reasons for non-EOF AST errors.

Pyflakes Results
We present frequencies of top 6 linter errors from Pyflakes in Table 3, with code examples in Appendix B. While Pyflakes also finds other problems in code, most of them are very sparse and thus less important, which we leave to Appendix A. Notice that one code snippet may contain multiple errors.We count each type only once in every test sample.Among all errors, Undefined Name and Unused Variable are the most common ones, where the model either calls a variable that is not defined, or defines a variable but never uses it.Closely related are Unused Import, Redefined While Unused and Undefined Local, which can be considered as special cases of the first two.Models also sometimes unnecessarily use f-strings by not giving any placeholder.It is worth pointing out that not all Pyflakes errors will impact execution.In fact among the six types, only Undefined Name and Undefined Local may cause runtime problems.However, all these errors can harm readability and maintenance which are critical for software development.Hence, it is important to address them to improve the quality of auto code completion.
Across sampling temperatures, we observe in every column that more errors are generated under higher temperatures, which is expected because generations in such cases are less confident.Finding 3. Higher temperature always leads to more errors of every type.
The impact of model size on error rate is less consistent though.For Unused Variable, Unused Import, and Undefined Local, error rate does decrease as the model scales up.However, the other three categories do not manifest such correlation.We investigate the underlying reason for this mixed result particularly in the case of Undefined Name.Notice that if an undefined name is a function call, it can potentially be defined afterwards outside the current function completion scope.While not guaranteed, the model might be able to fix this error by itself if we allow generating longer code instead of only one function.In contrast, using a variable without first defining it is usually a mistake.Even in some rare cases where the variable definition is made up correctly after the usage, such ordering is often less preferred in terms of coding  style.In Figure 3, we break down the undefined names into variables and functions.We find that larger models yield fewer undefined variables, but more undefined functions, which demonstrates that the correlation between error count and model size varies for different errors types.
Finding 4. While larger models are more accurate code generators (Nijkamp et al., 2022), scaling up model size does not lead to reduction in error counts for all error categories.

Correlation with Errors in Context
We further study the correlation between errors in context and in generation.Denote by c the input context, x the model generation, e the error type.
We write e ∈ c to mean c contains an error of type e.
For every e,4 we calculate P(e ∈ x|e ∈ c), the generation error rate when context contains the same type of error(s).We also report the relative ratio P(e∈x|e∈c) P(e∈x|e/ ∈c) to measure the impact of context.From Table 4, if the model observes errors in context, it is more likely to produce the same type of errors in generation, and the error rate can be amplified by 7∼200 times depending on the type.This is possibly an undesired consequence of the in-context learning capability of large language models.We also calculate P(e ∈ c|e ∈ x) to show how many of the generation errors co-occur with context errors.As indicated by the last column of Table 4, even though context errors can significantly amplify generations errors, the co-occurrences of two do not account for a large fraction.This implies problematic context is not the only factor for problematic generation, and it is often the case for models to produce errors even with correct context.Finding 5. Errors in context generally lead to more errors in generation.

Discussion
We present a static evaluation framework for code completions generated by large language models.By utilizing the proposed framework, we conduct error analysis of CodeGen models on a large scale real-world Python evaluation set.Our experiment reveals common static errors made by pretrained models, as well as their frequency trend across model sizes and sampling temperatures.By pointing out weaknesses of existing models, we hope our study also sheds light on future directions towards more accurate code generation.
There are a few limitations of this study.First, we focus on left-to-right code generation without considering right-side and cross-file context, which can be used to determine broader categories of errors with improved precision.Second, each static analysis tool has its own limitations.Thus, the presented analysis is limited by Pyflakes's accuracy and coverage to detect certain code issues.

A Full Error Categories
In addition to those discussed in Section 5, we list all error categories that can be detected in model generated code in our experiments, with a minimal frequency of 0.001% by any of the models (i.e. 10 observations out of the total 1 million generations).
AST errors (EOF errors indicated by asterisk):  def test_humanize_delta_handle_unknown_units ( self ): 13 """ humanize_delta should be able to handle unknown units , and will not abort .""" 1 """ 2 This program will continually ask our user to give a number 3 and will calculate the factorial result of the number and print it on the console .

Figure 1 :
Figure 1: A function completion example, with an Unused Variable error (gray) in context, and an Undefined Name error (red) in completion.

Figure 2 :
Figure 2: Evaluation pipeline.Left: We parse [context] and [context + generation] into ASTs.If [context] is not parsable, we stop without reporting any error on generation.If [context] is parsable, but [context + generation] is not, we report the AST error in generation.Right: If both are parsable, we run Pyflakes on the trees, which reports errors in [context] and errors in [context + generation].Taking the difference gives us errors in generation.

Figure 3 :
Figure 3: Number of undefined variables versus undefined functions.Larger models generate more undefined functions but fewer undefined variables.

Table 1 :
Edit similarity on function completion dataset and Pass@1 on HumanEval, of CodeGen models across different sizes and temperatures.(1) Edit similarity and HumanEval Pass@1 are positively correlated across different settings, which justifies edit similarity can be used as an alternative metric for model evaluation.(2) As expected, larger models have better edit similarity (a proxy to accuracy) on function completion task.

Table 2 :
Percentages of AST errors across different model sizes and temperatures.We show (1) total AST errors;(2) errors at the end of file (EOF); (3) errors not at EOF; (4) top 3 non-EOF errors.Models generally perform well at AST level except for EOF errors caused by max generation length limit.

Table 3 :
Percentages of Pyflakes errors across different model sizes and temperatures.Higher temperatures always lead to more errors in every category.On the other hand, larger models do not necessarily generate fewer errors.

Table 4 :
Correlation between errors in context and in generation for the 2B model.First two columns indicate errors in context can amplify errors in generation; the last column shows not all generations errors can be attributed to context.Other models have similar results.
Below we list one code example for each of the error categories shown in Table2 and 3. Following the definition of function completion task, in every example, context is from the beginning until the end of the docstring of the last function, and model completion is the body of the last function.