Syntax-Aware Retrieval Augmented Code Generation

Neural code generation models are nowadays widely adopted to generate code from natural language descriptions automatically. Recently, pre-trained neural models equipped with token-level retrieval capabilities have exhibited great potentials in neural machine translation. However, applying them directly to code generation experience challenges: the use of the retrieval-based mechanism inevitably introduces extra-neous noise to the generation process, resulting in even syntactically incorrect code. Com-putationally, such models necessitate frequent searches of the cached datastore, which turns out to be time-consuming. To address these issues, we propose k NN-TRANX, a token-level retrieval augmented code generation method. k NN-TRANX allows for searches in smaller datastores tailored for the code generation task. It leverages syntax constraints for the retrieval of datastores, which reduces the impact of retrieve noise. We evaluate k NN-TRANX on two public datasets and the experimental re-sults confirm the effectiveness of our approach.


Introduction
Neural code generation aims to map the input natural language (NL) to code snippets using deep learning.Due to its great potential to streamline software development, it has garnered significant attentions from both natural language processing and software engineering communities.Various methods have been explored to facilitate code generation (Yin and Neubig, 2018;Wang et al., 2021;Guo et al., 2022).Recent progress in neural machine translation (NMT) shows that the non-parametric knearest-neighbour machine translation (kNN-MT) approach may significantly boost the performance of standard NMT models (Khandelwal et al., 2021) and other text generation models (Kassner and Schütze, 2020;Shuster et al., 2021)  the models with a token-level retriever.In particular, this neural-retrieval-in-the-loop approach facilitates the integration of external knowledge into the pre-trained model and provides a simple yet effective method to update the model by switching the retrieval datastore, without fine-tuning the model parameters.

by equipping
Can such neural-retrieval-in-the-loop approach benefit neural code generation?Our preliminary experiments reveal three main issues (cf. the example in Table 1) if it is adopted outright.Firstly, the model performance may be negatively affected by the noise in the retrieved knowledge.For example, "all" does not match the intention of the description, but it is recognized as the target token by the retriever, resulting in the generation of incorrect code.Secondly, the code generated by kNN-MT cannot guarantee syntactic correctness, as demonstrated by the mismatching parentheses in the given example.Thirdly, the token-level retrieval method requires similarity search of the entire datastore at each time step of inference, which hinders the deployment of such approach.
In this paper, we propose a novel code generation approach, i.e. kNN-TRANX, to overcome the limitations of the neural-retrieval-in-the-loop paradigm for code generation tasks.The basic idea is to integrate symbolic knowledge to ensure the quality of the generated code and expedite the retrieval process.To achieve this, we leverage the sequence-to-tree (seq2tree) model to generate abstract syntax tree (AST), which is a hierarchical tree-like structure used to represent the code, rather than generate target code snippet directly.This enables us to use AST construction rules to guarantee the syntactic correctness of the generated code and filter out retrieval noise.
We design kNN-TRANX as a two-step process (cf. Figure 2).In the first step, we construct two separated datastores, i.e., the syntactic datastore and the semantic datastore, based on the type of AST nodes.This allows us to determine the type of the next node to be predicted according to the grammar rules and query a specific datastore.In the second step, we utilize syntactic rules to filter out irrelevant knowledge and convert the similarity retrieval results of the current target token into a probability distribution, i.e., the kNN probability.This probability, together with probability from the neural network, yields the probability of the action to be used for AST generation via a learnable confidence parameter.It can help to minimize retrieval noise and dynamically exploit combinations of the two probabilities, resulting in improved code generation performance.
To evaluate the effectiveness of kNN-TRANX, we perform experiments on two publicly available code generation datasets (i.e., CoNaLa and Django).The experimental results show a 27.6% improvement in the exact match metric on the CoNaLa dataset and a 4.2% improvement in the BLEU metric on the Django dataset, surpassing five state-of-the-art models under comparison.Additionally, we conduct an experiment on model canonical incremental adaptation, which updates kNN-TRANX by switching the datastore.The experimental results demonstrate that our model can achieve performance comparable to fully finetuned models and reduce the trainable parameters by 7,000 times.

Background
In this section, we provide an overview of the kNN-MT paradigm and the seq2tree model.

kNN-MT
The kNN-MT paradigm (Khandelwal et al., 2021) is a translation mechanism that enhances the quality of model generation by incorporating an additional translation retriever.This allows NMT models to benefit from the retrieved knowledge.The paradigm comprises two main parts, namely, datastore building and model inferring.Datastore Building.The datastore consists of a set of key-value pairs, where the key is the decoder hidden state and the value is the corresponding target token.Formally, given a bilingual sentence pair (x, y) from the training corpus (X , Y), a pretrained NMT model f N M T (•) generates the i-th context representation h i = f N M T (x, y <i ), then the datastore D is constructed as follows.
Model Inferring.During inference, at time step i, given the already generated token ŷ<i and the contextual information ĥi , the kNN-MT model generates y i by retrieving the datastore, which can be calculated as (1) where T is the temperature and d j indicates the l 2 distance between query ĥi and the retrieved key h j .

Seq2tree Model
The purpose of the seq2tree code generation models is to generate ASTs instead of directly outputting code snippets.Compared to the sequenceto-sequence (seq2seq) models, the seq2tree models ensure the syntactic correctness of the generated code.Among the seq2tree models, BertranX (Beau and Crabbé, 2022) was recently proposed and represented the state-of-the-art architecture.BertranX employs BERT to process the input natural language and features a grammar-based decoder.
Figure 1: Example of ASDL for Python.ASDL defines a set of grammatical symbols, which are denoted in orange and distinguished by a unique constructor name highlighted in blue.Each rule assigns names to its fields or the symbols marked in black.The grammatical symbols can be classified into two types: nonterminals (e.g., expr) and terminals or primitives (e.g., identifier).Some of the grammatical symbols may have qualifiers (*) that allow for zero or more iterations of the symbol.
BertranX describes ASTs using sequences of actions based on ASDL (Wang et al., 1997) grammar, which gives concise notations for describing the abstract syntax of programming languages (cf. Figure 1 as an example).With ASDL, BertranX defines two distinct types of actions that generate ASTs, i.e., PREDRULE and GENERATE.The first type is used for initiating the generation of a new node from its parent node, which we mark as syntactic nodes in this paper; the second type on the other hand, is used to produce terminal or primitive symbols that we mark as semantic nodes.

kNN-TRANX
The workflow of kNN-TRANX is depicted in Figure 2. It consists of two main components: datastore building and model inferring.

Datastore Building
Given a pre-trained seq2tree model and the training dataset, we first parse code snippets to ASTs and generate all instances in the training corpus.This process allows us to capture and store the decoder representations along with their corresponding target tokens as key-value pairs.The actions that constitute an AST can be categorized into two groups: rules and primitives.These categories align with the actions of GENERATE and PREDRULE, respectively.As shown in Figure 3, the two types of nodes have significant differences in terms of type and quantity.Combining them into the same datastore could potentially reduce retrieval accuracy.Therefore, we employ separated datastores for each node type, referred to as the syntactic and semantic datastores respectively.Nodes representing the structural information (e.g., Expr and Call) are put into the syntactic datastore, while nodes representing the semantic information (e.g., text and split) of the code are put into the semantic one.
Given an NL-code pair (x, y) from the training corpus (X , Y), we first transform the code snippets Y into AST representations Z. Next, we calculate the i-th context representation h i = f θ (x, z <i ), where f θ (•) refers to the trained seq2tree model and z ∈ Z.The datastore is constructed by taking h i 's as keys and z i 's as values.Namely, As a result, two separated symbolic datastores can be constructed based on the various types of target actions within the training set.Constructing datastores in this manner is more effective than storing both types of actions in a single datastore since it helps reduce noise during retrieval.Moreover, the subsequent token type can be determined based on grammar rules, facilitating the retrieval of a specific datastore and accelerating the retrieval process.

Model Inferring
The process of model inference can be divided into three main phases, as shown in Figure 2. First, the code fragment x is put into the trained model to generate the context representation h i and compute the neural network distribution (p NN ).Then, we query the datastore using this representation to obtain the k-nearest-neighbor distribution (p kNN ).Finally, we combine these two distributions to predict the target token.In the subsequent sections, we will discuss three pivotal components of kNN-TRANX: syntax-constrained token-level retrieval, meta-k network, and confidence network.Syntax-constrained token-level retrieval.Given the current context representation h i generated by the model, we first calculate the l 2 distance d j = l 2 (h i , ĥj ) between the context representation h i and each neighbor ( ĥj , ẑj ) in the datastore to determine the k nearest neighbors.Previous studies (Meng et al., 2022;Dai et al., 2023) have restricted the search space based on potential input-output patterns to improve decoding efficiency and reduce the impact of noise.However, the restricted search space may also exclude some valuable knowledge.
To mitigate this problem, our approach features syntax-aware retrieval capability.In contrast to conventional seq2seq models, our model aims to generate ASTs that allow to incorporate symbolic knowledge and determine the retrieved tokens by means of syntactic rules.During the retrieval process, we can determine the type of the subsequent token based on the tokens already produced.If the next token is expected to represent the syntactic information, we just retrieve the syntactic datastore to accelerate the retrieval process, and vice versa.Additionally, we can also use the ASDL rules to exclude illegitimate tokens to reduce the amount of irrelevant information.For example, as seen in Figure 2, our model has already generated the node Expr in the previous time step.It should be noticed that kNN-TRANX have retrieved Call and alias nodes according to the distances.However, the child nodes of Expr do not support alias in the ASDL grammar.In this way, we filter out these nodes from the search results to reduce noise and avoid excluding valuable information.
Meta-k network.We retrieve k relevant pieces of knowledge from the datastore, and then map the distances between the query vector and the cached representation as probabilities.Empirically, the number of retrievals, k, is crucial in our model because too few retrievals may result in valuable information being ignored, while too many retrievals may introduce noise.To alleviate this problem, we employ the meta-k network (Zheng et al., 2021) 1).In this way, the kNN model can expand the search space while reducing the impact of the retrieval noise.
Confidence network.In order to utilize the knowledge of the symbolic datastore while maintaining the generalization ability of the neural network, we combine two probability distributions by means of weighting.Previous studies have integrated the distributions of kNN and NN by using a fixed or learnable parameter λ to measure their respective weights.Khandelwal et al. (2021) combine the probability distributions using fixed weights, but this approach fails to dynamically adjust the weights based on the distance retrieved.Zhu et al. (2023) adjust the weights based on the retrieval distance, but they overlook the confidence of the neural network output.Khandelwal et al. (2021) utilize the probability of the neural network and retrieval distance as features to dynamically select the value of λ which consider the confidence of the two distributions, but this approach neglects the correlation between the two distributions.
To address this issue, we propose a confidence network that estimates the confidence of both probabilities based on kNN and NN distributions.In addition, we incorporate the weights of each k-value as features into the confidence network, ensuring that the model is aware of the number of tokens that require attention.As such our model can capture the relationship between the two distributions.In cases where the two distributions conflict, we assign a higher weight to the distribution with higher confidence.The confidence λ is calculated from W as where S denotes the sigmoid activation function.
The final distribution at prediction z i is calculated as a weighted sum of two distributions with λ i , i.e., p

Experiments
In this section, we first introduce the datasets and evaluation metrics.Then, we conduct a comprehensive study and analysis on code generation and model canonical incremental adaptation.

Datasets and evaluation metrics
We evaluate kNN-TRANX on two code generation datasets, namely, CoNaLa dataset (Yin et al., 2018) and Django dataset (Oda et al., 2015).The CoNaLa dataset comprises 600k NL-code pairs collected from StackOverflow, out of which 2,879 NL were rewritten by developers.This dataset contains questions that programmers encounter in their real-world projects.On the other hand, the Django dataset consists of 18,805 examples, where each example consists of one line of Python code accompanied by corresponding comments.Compared to CoNaLa, approximately 70% of the examples in Django are simple tasks that include variable assignment, method definition, and exception handling, easily inferred from the corresponding NL predictions.We employed BLEU (Papineni et al., 2002), CodeBLEU (Ren et al., 2020), and exact match (EM) metrics to assess the performance of our experiments.

Code Generation
Implementation details.We use BertranX as the base seq2tree model for our experiments, which is trained on annotated data and 100k mined data.
To expedite the implementation, we leverage kNNbox (Zhu et al., 2023), an open-source toolkit for building kNN-MT, to implement kNN-TRANX.As explained in Section 3, kNN-TRANX creates two datastores.Due to the considerable difference in vocabulary sizes between the two datastores, we construct separate settings for the syntactic and semantic datastores.For the syntactic datastore, we set the upper limit K rule to 4 to account for its limited token variety.For the semantic datastore, we set K pri to 64.To train the meta-k network and confidence network, we employ the AdamW optimizer with a learning rate of 3e-4.To accelerate the datastore retrieval process, we incorporate the FAISS library (Johnson et al., 2019) for similarity retrieval.All experiments are performed on a single NVIDIA 2080Ti.Baselines.We compare kNN-TRANX against five state-of-the-art code generation models.
• TRANX (Yin and Neubig, 2018) is a seq2tree model consisting of a bidirectional LSTM encoder for learning the semantic representation and a decoder for outputting a sequence of actions for constructing the tree.Analysis.We conduct an ablation study of kNN-TRANX as shown in Table 3.The experimental results demonstrate that retrieval filtering method can significantly enhance the performance of the code generation models.For the method that combines NN distribution and kNN distribution, we compared the method of measuring retrieval distance proposed in adaptive kNN-box (Zhu et al., 2023).Experimental results show that our approach of considering both distributions comprehensively achieves better results.We also evaluate the effect of placing both types of action in the same datastore.The result shows that this approach significantly reduces the quality of the generated code by introducing a substantial amount of noise.Moreover, we analyze the effect of K rule and K pri on the experimental results, as presented in Table 4.
The results align with our conjecture that retrieving a small number of the syntactic nearest neighbors and a relatively large number of semantic entries leads to better code generation.Furthermore, Figure 5 shows that increasing the size of datastore can improve the quality of generated code.Additionally, Figure 5(a) and Figure 5(d) depict that even a small amount of syntactic knowledge stored can achieve a high quality of code generation.In contrast, the quality of generated code keeps improving as the size of semantic datastore increases.We believe this is because semantic knowledge is relatively scarce compared to syntactic knowledge, which can be demonstrated by Figure 3.

Model Canonical Incremental Adaptation
Implementation details.Although the recent proposed large-scale models such as Codex (Chen et al., 2021) and GPT-4 (OpenAI, 2023) have demonstrated its powerful code generation capabilities, one of the challenges is that the trained models are difficult to update.Models need to be continuously trained via incremental learning, which consumes huge computing resources.To make matters worse, incremental learning with new data can lead to catastrophic forgetting problems (Li and Hoiem, 2016).To address this, we validate our model using incremental learning by only updating the datastore without adapting the model parameters.scenario for incremental learning.Then, we update the datastore using {0k, 10k, 50k, 95k} mined data in CoNaLa dataset respectively.In our experiment, we update the datastore to perform efficient finetuning.Compared to the method of fine-tuning all parameters (which requires training 122,205k parameters), our method only needs to train 17k, greatly reducing the GPU memory consumption required for training, and achieving comparable results to fine-tuning all parameters.
It is worth noting that, compared to the previous kNN generative model, our model includes two datastores for syntactic and semantic information, respectively.There are 109 kinds of token in the syntactic datastore, and the number of corresponding datastore entries is 1,603k.However, the types of token in the semantic datastore may be infinite, depending on the actual defined vocabulary, so the knowledge corresponding to each vocabulary is relatively scarce, with only 518k entries in this ex-periment.Figure 5 confirms that providing a small amount of syntactic knowledge can improve the performance of the model.Therefore, we consider two ways to update the datastores in our experiments, i.e., updating both datastores and updating the semantic datastore only.Main results.We adopte the same evaluation metrics as code generation.As shown in Table 5, firstly, without using additional datastore, kNN-TRANX † can outperform BertranX † .As the knowledge in the datastore is constantly updated, we can see that kNN-TRANX has improved on three evaluation criteria.In terms of BLEU evaluation, kNN-TRANX † with 95k external data can achieve performance comparable to that of training-based BertranX.Furthermore, we update only the semantic datastore, which can also be effective.We also provide two examples to demonstrate how knearest-neighbor retrieval assists in model decisionmaking in Appendix A.1.It should be noted that the CoNaLa dataset was obtained through data mining, and the majority of the NL-code pairs obtained through mining are irrelevant, which greatly adds noise to our retrieval.Therefore, we believe that kNN-TRANX can perform even better on more reliable data sources through incremental learning.

Related Work
Code generation.Code generation aspires to generate target code through natural language to improve programmers' development efficiency.Yin and Neubig (2018) propose TRANX to generate ASTs instead of generating code snippets directly.Based on TRANX, Beau and Crabbé (2022) propose BertranX relying on a BERT encoder and a grammar-based decoder.Poesia et al. (2022) propose a framework for substantially improving the reliability of pre-trained models for code generation.Chen et al. (2022) propose CodeT to leverage pre-trained language models to generate both  the code snippets along with test cases, and select the best code based on the number of test cases passed.Besides, researchers used pre-training methods to incorporate more external knowledge into the model, effectively improving its performance on downstream tasks (Feng et al., 2020;Wang et al., 2021;Ahmad et al., 2021).Recently, large-scale pre-training models have demonstrated remarkable capabilities in code generation (Li et al., 2023;Wang et al., 2023;Touvron et al., 2023).
Retrieval-augmented models.2021) propose using retrieved natural language code to improve the performance of code generation and code translation models using REDCODER.
Recently, Khandelwal et al. (2021) propose kNN-MT, a non-parametric paradigm for construct-ing a datastore using a decoder representation as a key and using the corresponding target character as a value.The generation is done by retrieving the top k neighbors as the result.Based on this paradigm, a number of optimization methods have also been proposed.Zheng et al. (2021) use a meta-k network to dynamically adjust the retrieval weights.Jiang et al. (2022) improve the robustness of kNN-MT in terms of both model structure and training methods.Meng et al. (2022) propose Fast kNN-MT, which improves retrieval efficiency by constructing multiple smaller datastores.

Conclusion
In this paper, we propose kNN-TRANX.By providing syntactic and semantic datastores for seq2tree model, we are able to outperform the baselines.In addition, we provide more knowledge for the model by switch the datastores without fine-tuning the neural network.Experimental results show that kNN-TRANX exhibits competitive performance against learning-based methods through incremental learning.In the future, we plan to construct a smaller and more fine-grained syntactic datas- tore to reduce the search space of the model and accelerate model inference.

Limitations
Although our proposed approach can enhance the generalizability of the seq2tree model and enable rapid updates by switching datastores, incorporating extra datastores necessitates a similarity search at each time step of inference.Even though only one of the two datastores needs to be retrieved, the inference time for the model may still increase considerably (cf.Appendix A.2). Furthermore, the incorporation of additional datastores will consume storage space and may escalate with growing training data.Although the previous work has proposed approaches to reduce the content of datastore through pruning, this approach also leads to model performance degradation.

Figure 2 :
Figure 2: Workflow of kNN-TRANX to dynamically evaluate the weight of the retrieved knowledge.Meta-k network considers a range of values that are smaller than the upper bound K, instead of using a fixed value of k.Typically the range is set as S = {0, 1, 2, 4, • • • , 2 log 2 ⌊K⌋ }.To evaluate the weight of each of the values, we use distance d j and the count of distinct values in top j neighbors c j as features and obtain a normalized weight by p β (k) = softmax (f β ([d 1 , . . ., d K ; c 1 , . . ., c K ])) where f β (•) denotes the Meta-k Network.The prediction of kNN can be obtained by p kNN (z i |x, ẑ<i ) = kr∈S p β (k r ) • p krNN (z i |x, ẑ<i ) where p krNN indicates the k r -nearest-neighbor prediction results calculated as Equation (

Figure 4 :
Figure 4: The syntax match score of the generated code on the CoNaLa dataset.

Figure 5 :
Figure5: The impact of datastore size on the BLEU and CodeBLEU scores of CoNaLa dataset and Django dataset.We used three strategies to reduce the datastore size, which are reducing the storage of syntactic knowledge, semantic knowledge, and both types of knowledge.

Table 1 :
An example code generated by kNN-MT.

Table 2 :
Comparative results of models trained on the CoNaLa and Django test datasets.

Table 3 :
We use BertranX †1 as our base model to simulate the Ablation study of different strategies and networks on CoNaLa dataset and Django dataset.

Table 5 :
The results of model canonical incremental adaptation.BertranX † is trained on cleaned 2k and mined 5k data.kNN-TRANX † is built on the same size data as a datastore on top of BertranX † .When Only Semantics is set to False, both datastores are updated simultaneously, while True means only semantic datastore is updated.External data refers to the quantity of training data updating the datastore.