Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an architecture-independent approach for leveraging the syntactic hierarchies of source code for incorporating entire file-level context into a fixed-length window. Using concrete syntax trees of each source file we extract syntactic hierarchies and integrate them into context window by selectively removing from view more specific, less relevant scopes for a given task. We evaluate this approach on code generation tasks and joint translation of natural language and source code in Python programming language, achieving a new state-of-the-art in code completion and summarization for Python in the CodeXGLUE benchmark. We also introduce new CodeXGLUE benchmarks for user-experience-motivated tasks: code completion with normalized literals, method body completion/code summarization conditioned on file-level context.

A major difference between transformers and their antecedents like recurrent neural networks (RNN) is their strictly enforced finite context window. Whereas an RNN can iteratively consume as many tokens as is required, transformers can only consume up to a finite amount decided at training time. Further, it is impractical to simply expand the window as the memory and compute requirements of the attention mechanism scale quadratically with context length. There have been efforts to economically expand the window by modifying the attention mechanism with low-rank queries and keys (Beltagy et al., 2020), sparse connections (Parmar et al., 2018;Zaheer et al., 2020), and more recently approximation by kernel methods (Choromanski et al., 2021). There have also been meth-ods developed to condition generation on retrieved documents (Lewis et al., 2020b,a) for knowledgeintensive applications. Complimentary to these approaches, and consistent with any sequence model or architecture, we propose a method for extracting the most important features distant from the task at hand, implemented in this case using the syntax tree of source code.
A source code document has nested scopes and references to other documents/libraries, and software engineering tasks must leverage knowledge at all of these scales. Even single source code files often exceed the length of the context window, and so it is clear that progress in modeling source code requires overcoming this limitation. Instead of proposing a novel transformer architecture capable of ingesting more context, we propose a method of compressing the context of source code files using the syntax of the programming language itself. In this paper we introduce eWASH: Extended Window Access by Syntax Hierarchy, which leverages the syntax hierarchy of source code to give our models longer ranged vision by prioritizing higher level scopes and names for source code elements in a file which are not immediately in focus. Using eWASH assumes that higher level scopes like function signatures summarize the whole method, and viewed in this way eWASH could be applied to natural language by extracting key terms in a long document or summarizing features distant from the task at hand.
We start by explaining eWASH with a motivating example and define its features for three important software engineering tasks. The first is code completion, in which some number of tokens are predicted to extend a partially complete source code file. The second is method completion, wherein a whole method body is predicted from a method signature and docstring. The third is code summarization or docstring generation, wherein a method is mapped to a natural language docstring. We then discuss the Python training data and the models employed in this study, which include the auto-regressive GPT-C (Svyatkovskiy et al., 2020), an eWASH extended version called XGPT-C, Python method/docstring prediction model PyMT5 (Clement et al., 2020), similarly named XPyMT5, and Performer (Choromanski et al., 2021) and Reformer (Kitaev et al., 2020) baselines. We demonstrate through experiments the value of eWASH for the source code domain with state of the art performance for code completion and method completion and code summarization or docstring generation. Finally, we study user-experience motivated properties of XGPT-C, and extend the source code benchmark set CodeXGLUE (Lu et al., 2021) with two new tasks: literal-normalized code completion and method completion. Surprisingly, eWASH excels at code completion even when the context window is not a limiting factor.

Motivating example
In this paper we consider three software engineering tasks: code completion, method completion, and docstring completion or code summarization. Code completion is the auto-regressive prediction of one or more tokens conditioned on a provided context. Figure 1 shows an incomplete Python program implementing a neural network. A developer would like a model which can predict the body of ConvNet.forward (method completion) by composing the layers defined in ConvNet.__init__ and globally imported operations. There is a limited context window, illustrated at the left of the figure, and so while the model can (in this fortunate case) see the layer definitions, it is ignorant of the imports and the global LOGGER object.
In many cases predicting a whole method is not easy, but asking for a few members or a completed line  is still desirable. The bottom of fig. 1 shows the code completion task finishing an assignment and completing a line in a partial implementation of ConvNet.forward.
Here, again, crucial information is missing from the input of the model, which will prevent it from being able to reference the imports. Further, one can easily imagine a scenario in which more spurious information is fed into the model instead of, for example, the definitions of the neural network layers in the __init__. How can we ensure the model is shown important information for predictions?

Extended Window Access by Syntax Hierarchy
Software developers carefully organize their code into scopes and portable elements using hierarchies like methods and classes, so we hypothesize that labels of these scopes are more important for longrange modeling than their contents. We propose Figure 1: An example scenario where both method completion (top right) and code completion (bottom right) are performed by our eWASH models XPyMT5 and XGPT-C, respectively. Method completion aims to predict a whole method body conditioned on a signature and docstring and other context, and code completion aims to predict any number of tokens to complete a member, line, or even a scope context conditioned on any incomplete code string. In this case XPyMT5 is tasked with predicting the whole body of ConvNet.forward. In this case it's clear that both the layers assigned in ConvNet.__init__ and the import statements above are important information. In another case XGPT-C aims to complete the assignment, and again the class attributes and import statements are important. Both models have a limit context window illustrated at left, and exclude important information.
eWASH, Extended Window Access by Syntax Hierarchy, in which we compress the context provided to our model by prioritizing, for example, function signatures over function bodies. Since most code is written inside methods, we center method bodies as the focus of the modeling task for eWASH, calling each method being modeled the 'focal method.' Figure 2 shows how eWASH uses syntax hierarchies to prioritize elements of the context of the method and code completion example of Fig. 1. The focal method in this case is ConvNet.forward, but could be any other method in the module. The most important part for modeling the body of this focal method is its signature and docstring (if present) and containing class definition (if the focal method is a class method). After this we prioritize global import statements and assigned values (but not yet the assigned expressions), followed by class attributes, peer class method signatures, class docstring, peer class method docstrings, and finally global expressions and the code bodies of peer class methods.
In practice, eWASH is implemented by taking the concrete syntax tree of the source file and orga-nizing the syntactic elements in our priority list, tokenizing each element, and then descending the priority list, taking elements until the context window has been filled. For training the method completion of XPyMT5, we arrange the eWASH context in the input with a control code to indicate which method is to be completed (# target body in Fig. 1), and we arrange the target to be the method body. eWASH yields N total training samples from a file with N total methods and class methods. For docstring completion or code summarization, the source contains the method signature and body, and the target contains the desired docstring, and a control code is used to instruct the model which task it is to perform, just like PyMT5 (Clement et al., 2020).
For code completion, as we use an autoregressive decoder in the form of XGPT-C there is no special 'position,' and so we create a rolling window across the focal method body. We reserve 3/4 (768/1024 tokens) of the tokens for the context, and 1/4 (256/1024 tokens) for the rolling window of the body. In the case of a method which exceeds 256 tokens, the training sample for that method is decomposed into multiple 'windows,' and one file yields at least N training samples for a file with N method and class method definitions. The eWASH (Extended Window Access by Syntax Hierarchy) method selectively fills the model context going down the file, in the order of priority level indicated, and stops when the token budget of the model context is filled. eWASH presupposes that names of entities at higher scopes are more relevant to the task at hand than entities at lower scopes.

Pre-training
The data for training is the same for both XGPT-C and XPyMT5, and consists of all 5+ star GitHub repositories which are primarily Python, filtered by files which were either Python 3 compliant or were successfully fixed by lib2to3. Further, there was a time limit of 10 seconds placed on the parsing process to eliminate files which are essentially data files as they tend to contain very large lists for example. Table 2 shows summary statistics of this dataset for a sense of scale.

Fine-Tuning and Evaluation
For evaluation and fine-tuning of code completion we used the Py150 (Raychev et al., 2016) from CodeXGLUE (Lu et al., 2021), and for method and docstring completion we used the CodeSearch-Net (Husain et al., 2019) dataset. Py150 is larger than CodeSearchNet for Python, but has selected repositories with good docstring coverage, allowing better evaluation of the method/docstring completion task.

Baseline Models
We consider state-of-the-art transformer models for code completion and code summarization tasks in the CodeXGLUE benchmark as our baselines. Namely, the generative pre-trained transformer model for code  (GPT-C) and the Python method text-to-text transfer transformer model (Clement et al., 2020) (PyMT5). We also experiment with two memory-efficient transformers-Reformer (Kitaev et al., 2020) and Performer (Choromanski et al., 2021)-which enables modeling of context lengths in excess of 1024 tokens.

GPT-C
GPT-C is an auto-regressive language model pretrained on a large unsupervised source code corpora. Treating the source code data as a sequence of lexically-defined tokens, GPT-C extracts training samples as a sliding window from each source code file. This baseline uses an approach based on statistical language modeling of source code, with several normalization rules extracted from concrete syntax tree of a program. To overcome the issue of different styles and white space or tab conventions, it transforms the code into symbolic program tokens using custom tokenizer and regenerates the code with a common style. During pre-processing, GPT-C parses program code in each file, extracts information about token types, normalizes uncommon literals, trains a sub-token vocabulary, and encodes the files into sub-token sequences. This is done both for training and inference. GPT-C decoder-only model has about 125M parameters and a context length of 1024 tokens.

PyMT5
PyMT5 is a transformer encoder-decoder model jointly pre-trained on a large-scale corpus of Python source code and natural language contained in the docstring summaries. PyMT5 training samples are supervised pairs of function code features-function signatures, docstrings and bodies-extracted by means of a parser. PyMT5 is finetuned to translate between all non-degenerate combinations of code features in a multi-modal setting, e.g. simultaneously signature and docstring to body, signature to docstring and body, signature to body, etc. PyMT5 only uses information from a single method and naturally is missing imports, peer class and method definitions, and global as-  signments. PyMT5 has 406M parameters and a context width of 1024 tokens for both the encoder and decoder.

Memory-Efficient Transformers
Reformer and Performer transformer models attempt to break the infamous quadratic attention bottleneck and allow for efficient modeling with much longer than the standard 1024 token context window. Reformer (Kitaev et al., 2020) includes three memory optimizations: reversible layers (to trade off memory with time), axial positional embeddings, and a bucketed attention. Performer (Choromanski et al., 2021) develops a linear approximation to the attention layer AV ≈ σ(Q T K)V , where K, Q, and V are the key, query and value matrices of the attention mechanism and A is the softmax kernel approximation, and exploits the linearity to improve computational efficiency.

Code completion
Code completion is the auto-regressive completion of source code tokens, as illustrated in the bottom of Fig. 1. We perform code completion as defined in CodeXGLUE (Lu et al., 2021) as well as using normalized literals. The literal normalization improves user experience of the code completion tool  by abstracting per-sonally identifiable information and encouraging the model to focus on code modeling over arbitrary strings. Names, the phone number, IP addresses, and more may be preserved in the string or numeric literals. We normalize the literals in source codes to some special tokens. Considering that frequently used literals may contain useful information, e.g. "__main__" or "utf-8", we preserve the 200 most frequent string and 30 most frequent numeric literals.

Method completion
Method completion is the prediction of a method body implementation conditioned on a signature, optional docstring, and any more context. The authors of PyMT5 performed this task using no context beyond the focal method, and XPyMT5 uses eWASH compressed file-level context. We contribute method and docstring completion conditioned on file-level information as a task to CodeXGLUE, based on the CodeSearchNet dataset in order to bolster its user-experience motivated tasks.

Docstring Completion/Code Summarization
Docstring completion is the prediction of a docstring conditioned on a focal method and optional context, and was also performed by PyMT5 on focal-methods alone. Code summarization is closely related, as docstrings often express a summary of their method, but also include annotated arguments, return values, exceptions, and even test cases via doctest. We train on docstring completion but evaluate on the CodeSearchNet dataset which attempts to remove everything but the summary which is assumed to be in the first paragraph of the docstring.
6 Model Training

XGPT-C
We trained XGPT-C on the Python dataset described in 3. Each training sample is a method body along with its corresponding extended context. In XGPT-C, we follow GPT-C , using a multi-layer Transformer decoder as the model architecture and the causal language model as the training objective. We use 12 layers Transformer decoder, 12 attention heads with 768 hidden dimension in total and sentencepiece 1 BPE vocabulary of size 50,000. The total number of model parameters is 125M. The pre-training period takes 2 weeks on sixteen 32GB Tesla V100 GPUs, and all hyperparameters were left as .

XPyMT5
We trained XPyMT5 on the same Python dataset as XGPT-C. Similar to PyMT5, Each Python file yielded between N and 3N training samples where N is the number of methods and class methods in the file. Each method teaches the model to complete the method body conditioned on its signature (and docstring if it exists), to predict the docstring (if it exists) from the method, and to predict the whole method from just the docstring (if it exists). In this way, XPyMT5 also can jointly predict code and natural language, but we did not include all degenerate combinations like PyMT5 as the training set was already much larger due to the extended context. XPyMT5 uses the same whitespace-augmented GPT-2 (Radford et al., 2018) tokenizer as PyMT5, with about a vocabulary size of 50,000, and is the same architecture and hyperparameters as PyMT5 with 12 layers and 406M parameters. XPyMT5 was trained on 16 32GB Tesla V100 GPUs for 4 weeks, about 10 epochs total, using the same hyperparameters as reported by Clement et al. (2020). XPyMT5 was initialized with the English pretrained BART (Lewis et al., 2019) weights (with whitespace embeddings) and pre-trained using the BART de-noising objective for 5 weeks on the same hardware as above.

Reformer/Performer
We trained both Performer and Reformer models on the Python dataset described in 3 but without 1 https://github.com/google/sentencepiece eWASH. Each training sample is a whole source code file literal normalization applied. We adapt the open-sourced model implementations, 23 setting the architecture parameters of each to be as close to the same parameter count as XGPT-C as possible. Both used 12 layers, a context length of 4096, and 768 hidden dimensions. All other hyperparameters were unchanged from their default.

Metrics
The metrics we used to evaluate eWASH and thus XGPT-C and XPyMT5 are consistent with GPT-C and PyMT5 and other works in the literature. We report the longest common subsequence ROUGE-L as we expect in a developer tool scenario that users will want predicted code with the fewest edits. To that end, we also report the edit distance between the truth and hypothesis. In order to compare to other code completion models we report Exact-Match@N (EM) metrics (Rajpurkar et al., 2016), which counts the fraction of exactly correct predictions of some length (@N). For method and docstring completion we report BLEU-4 and ROUGE-L metrics, but not exact matches as it is too strict to meaningfully interpret for longer source-target pairs. For method completion we also report the fraction of syntactically correct methods as judged by Python 3.8 syntax.

Experimental Conditions
We aim to evaluate how well XGPT-C model can infer developers' true intents. We randomly selected 833 unique Python functions from the code completion test benchmark in CodeXGLUE (Lu et al., 2021), and, except for the first two tokens in each line, prompted the model at all other points inside the methods. The predictions are compared to the true continuation of the code. For method and docstring completion, the CodeSearchNet repositories and specific commit hashes were re-downloaded in order to extract the eWASH features in addition to the individual methods released. We will release this expanded CSN dataset and task to CodeXGLUE to improve its user-experience motivated metrics. Inference in all cases was performed with beam search with a beam width of 5. Figure 3: Comparing baseline GPT-C with XGPT-C in an offline evaluation of ExactMatch@1-5 code completion as a function of total token context length for the normalized literal scenario. Surprisingly, eWASH leads XGPT-C to benefit most over GPT-C at the shorter context lengths. XGPT-C also more exactly predicts tokens with longer context as well.  Table 3: Code completion evaluated on the CodeXGLUE test set by ExactMatch@1-5 and overall EM results for XGPT-C and GPT-C.

Code Completion Evaluation Results
As shown in Table 1, eWASH allows XGPT-C to beat both the GPT-C baseline and the memory efficient transformers on all the metrics computed. About 10% of our Python test files were greater than 1024 tokens in length, and evaluating separately on that subset yielded slight improvements of Performer/Reformer, but Performer only beat XGPT-C in terms of EM@5 at 55.9%. These evaluations were performed for source code inside methods, as the eWASH technique follows the syntactic hierarchy used by developers. Note that the bottom lines are trained and evaluated on the normalized literal dataset. XGPT-C sees a large absolute increase in ExactMatch@5 of 13% with normalized literals showing that, in addition to protecting user data, normalizing literals is an important part of a good IDE programming assistant. Hellendoorn et al. (2019) showed that artificial evaluation scenarios are often much more forgiving than real-world scenarios. To better evaluate whether these models can predict a developer's intent we compute the ExactMatch@1-5 benchmark, described in Sec. 7.1, broken down by total token context and length of same-line context. Figure 3 shows EM@1-5 metrics for the normalized literal scenario binned by the context length for the completion for GPT-C (left) and XGPT-C (right). It is clear that in all measured cases eWASH allows XGPT-C to better predict exact matches. Perhaps most strikingly, the largest relative increase in EM occurs for shorter context lengths, so that the syntactic hierarchy hypothesis underlying eWASH appears most beneficial for context lengths well within the context window. Figure 4 shows the same EM metrics broken down by same-line context length, to test how much the most proximal tokens matter for prediction. We see the same overall benefit of eWASH in XGPT-C, and only a slow increase as a function of sameline context. The average line length in our data is 18 tokens, so with 7 tokens of same-line context, XGPT-C can complete 5 tokens exactly more than 80% of the time while GPT-C can do so just shy of 60% of the time. Again, this is very interesting as eWASH confers great benefit even when context lengths do not exceed the context window, and supports our hypothesis that user-defined syntax hierarchies are very important signals for predicting method bodies.
Modern IDE environments like Visual Studio Intellicode (Visual Studio Intellicode) can present multiple predictions, which Hellendoorn et al.
(2019) showed can improve real-world user acceptance. Table 3 shows the overall ExactMatch@1-5 metrics for code completion regardless of context length. XGPT-C is the clear winner again for all the EM metrics, boosting total exact matches by over 12% for top-1 predictions and reaching 88.7% overall for top-5 predictions. We interpret this to mean that eWASH will enable superior on-line user acceptance of code completions.

Method Completion Evaluation Results
We evaluate eWASH for method generation, illustrated in the top of Fig. 1. Table 4 shows the comparison between XPyMT5 and PyMT5 and, and PyMT5 is superior in all the source-target comparison metrics. Syntax correctness is slightly lower, but the difference is not necessarily meaningful. The ROUGE-L metrics are dramatically improved, and is not necessarily surprising as XPyMT5 is conditioned on much more information than PyMT5. The syntax correctness of our fine-tuned models is slightly lower than the 92.1% reported by Clement et al. (2020).

Conclusions
Inspired by the performance of transformer models, their limited context window size, and the especially long-range nature of source code as documents, we developed Extended Window Access by Syntax Hierarchy. Our hypothesis was that the syntax hierarchy imposed by developers is a real signal of importance in a task context, and that methods, containing most lines of code, are most dependent on the higher-level scopes of their filelevel attributes. Our XGPT-C results for code completion supported this hypothesis, and, strikingly, offered most relative benefit for shorter context lengths. We showed with strict exact match metrics that eWASH allows a large relative improvement in code completion predictions. Finally we show dramatic improvement in method completion and code summarization with XPyMT5. eWASH can be applied to any programming language and in principal any language with hierarchical syntactic or stylistic structure. For this reason we believe eWASH to be a general purpose modeling approach for more optimally using finite context windows on structured documents, and could improve natural language understanding tasks as well. Further, any model, even the largest GPT-3 language model (Brown et al., 2020) can leverage the eWASH feature. Accompanying this manuscript we submit 3 new tasks to CodeXGLUE to bolster its user-experience motivated metrics: literal-normalized code completion, method-level code completion, and method/docstring completion conditioned on whole-file context.   conference and the ACM SIGSOFT symposium on the foundations of software engineering, pages 213-222.