Analysis of Tree-Structured Architectures for Code Generation

Code generation is the task of generating code snippets from input user speciﬁcations in nat-ural language. Leveraging the linguistically-motivated hierarchical structure of the input can beneﬁt code generation, especially since the speciﬁcations are complex sentences containing multiple variables and operations over various data structures. Moreover, recent advances in Transformer architectures have led to improved performance with tree-to-tree style generation for other seq2seq tasks e.g., machine translation. Hence, we present an empirical analysis of the signiﬁcance of input parse trees for code generation. We run text-to-tree, linearized tree-to-tree, and structured tree-to-tree models, using constituency-based parse trees as input, where the target is Ab-stract Syntax Tree (AST) of the code. We evaluate our models on the Python-based code generation dataset CoNaLa and a semantic parsing dataset ATIS. We ﬁnd that constituency trees encoded using a structure-aware model improve performance for both datasets. We also provide an analysis of those aspects of the input parse trees which are most impact-ful. For instance, we ﬁnd that structure-aware encodings are better at modelling inputs with multiple variables and capturing long-range dependencies for code generation. 1


Introduction
Code generation is the task of converting input user specifications written in natural language (NL) to code snippets in a target programming language. It is a task-driven variant of semantic parsing, which translates natural language input to formal machineexecutable representation. Recent works have utilized the Abstract Syntax Tree (AST) -which is the syntactic tree representation of target source code -to generate better code snippets (Yin andNeubig, 2017, 2018;Sun et al., 2020;Rabinovich et al., 2017). The use of ASTs has achieved strong results but there has been relatively less work on utilizing the parse trees of the NL input. Constituency or dependency trees representing the syntactic structure of input can be leveraged to perform sub-tree alignment with corresponding AST of target code and benefit the downstream task. Hence, in this paper, we present several tree-to-tree models for the code generation task that convert the parse tree representation of NL input to AST representation of target source code. First, we base our model on the Transformer architecture (Vaswani et al., 2017). However, the standard Transformer is not designed to preserve the tree structure of the input parse trees. Hence, to better encode the trees, we modify a structure-aware Tree Transformer model (Nguyen et al., 2020) for the tree-to-tree code generation task. We focus on constituency-based parse trees in this paper because of space constraints as this is a short paper. Moreover, as pointed out by Nguyen et al. (2020), there is little evidence of constituency structures being learned implicitly in language models, whereas dependency structures have been shown to be implicitly embedded in models like BERT (Devlin et al., 2019;Hewitt and Manning, 2019). We evaluate our models on the CoNaLa dataset (Yin et al., 2018) and find that incorporating constituency parse trees in input using structure-aware encoders improves the quality of generated code. We further evaluate our models on the ATIS dataset (Hemphill et al., 1990), which translates natural language sentences into their lambda calculus logical forms and show that a structure-aware Transformer significantly improves performance over a standard Transformer.
We also focus on analyzing the input parse trees to find the aspects that benefit code generation. Our analysis comprises ablation experiments on our pro-posed structure-aware model and pattern analysis of the output from different models with respect to the characteristics of input natural language specification. Specifically, we analyze the variation in performance with the presence of user-defined identifiers and variable entities in input sentence, and the complexity of input trees. We find that the structure-aware model improves performance when such identifiers and variables are present towards the end of the input sentences and when the input sentences are short in length.

Related Work
Code Generation. Code generation for generalpurpose programming languages is a recent phenomenon, earlier works being focused on domainspecific languages (Gulwani and Marron, 2014;Raza et al., 2015). Recent works have mainly applied sequence-to-tree models for code generation, with the tree being the AST of target source code (Dong and Lapata, 2016;Yin and Neubig, 2017;Rabinovich et al., 2017;Yin andNeubig, 2018, 2019;Shin et al., 2019;Xu et al., 2020;Sun et al., 2020). While the use of ASTs for code generation has been substantially studied, to the best of our knowledge, the use of input parse tree for code generation is largely unexplored.
Semantic Parsing. Several methods have been proposed to parse natural language sentences to formal meaning representations like lambda calculus (Wong and Mooney, 2007), Alexa Meaning Representation Language (Kumar et al., 2017), Abstract Meaning Representations (AMR) (Banarescu et al., 2013), structured queries (Iyer et al., 2017;Yin and Neubig, 2018), etc. Many of the recent works for semantic parsing have focused on sequence-totree models leveraging tree structures like AST as the intermediate representation for target meaning representation (Yin and Neubig, 2018;Sun et al., 2020). Code generation can also be regarded as a form of semantic parsing where the target meaning representation is programming language snippet.
Source Trees and Structure-Aware Models. Several structure-aware tree-encoders have also been proposed to process the source trees (Chen et al., 2017a,b;Yang et al., 2017;Nguyen et al., 2020). While many of the tree-encoders are dependent on recurrent mechanism and hence are unparallelizable, Nguyen et al. (2020) propose a Transformer-based structure-aware model that is parallelizable. Concurrently, several tree-to-seq models have been proposed that leverage source syntactic trees for NLP tasks like machine translation (Eriguchi et al., 2016;Yang et al., 2017;Eriguchi et al., 2017;Chen et al., 2017b) and sentence modeling (Shi et al., 2018). There has been some work on leveraging hybrid tree -a joint treelike representation of the NL sentence and corresponding meaning representation -for semantic parsing (Lu et al., 2008;Jie andLu, 2018), while Harer et al. (2019) made use of source tree structures for code correction. However, the same is unexplored in the context of code generation. We study the use of tree-to-tree models for code generation and provide analysis of its various modules.

Baseline (Sequence-to-Tree Model)
We use a standard Transformer model (Vaswani et al., 2017) as our natural language-to-code baseline. We build a sequence-to-tree model with a regular Transformer encoder and decoder. The encoder maps the source sequence x = x 1 , x 2 , ..., x n to its vector representationx =x 1 ,x 2 , ...,x n , which is passed into the decoder. At each time step t, we linearize the AST generated till time step t − 1 i.e. AST y <t and concatenate its embedding with the embedding of the corresponding parent actions, following Yin and Neubig (2017). Decoder takes this partial AST vector representation and source vector representationx from encoder as input and expands the frontier non-terminal node of the partial AST. Here, ASTs are linearized by the pre-order depth-first traversal and the expansion of the AST, at each time step, is constrained by the grammar rules of the underlying programming language. We adopt the ASDL grammar and transition system (Yin and Neubig, 2018) that decomposes the production of an AST into a sequence of actions. At each time step t, the action a t can be of 3 types (see Appendix for details). Given the input specification x, the probability of generating an AST y can be expressed in terms of probabilities of generating corresponding actions: p(y | x) = t p(a t | x, y <t ). Here, a t is the action at time step t and y <t is the partial AST generated upto time step t. We also use a pointer network (Vinyals et al., 2015) to allow the model to copy relevant entities from input sequence while generating a terminal AST node.

Linearized Tree-to-Tree Model
Here, we use the identical model architecture as our baseline (see Sec. 3.1) but we replace the input NL sequence with its linearized constituency-based parse tree. Constituency trees aim to describe syntactic structure of the sentence by dividing it into sub-phrases. As discussed in Sec. 1, this structural information can promote alignment between source and target sub-trees (AST), thereby improving downstream generation task. In our model, constituency trees are linearized by the pre-order depth-first traversal (see Fig. 5 in Appendix). Our output is the AST representation of code.

Structured Tree-to-Tree Model with Hierarchical Accumulation
A standard Transformer encoder (see Sec. 3.1) is not designed to process the structural information of input parse trees. On the other hand, many treebased models have been proposed in the past to process the structural information (Chen et al., 2018;Eriguchi et al., 2016;Rao et al., 2019) but most of them are based on recurrent mechanism and hence, not parallelizable like Transformer-based models. This observation motivated us to build a Transformer-based structure-aware tree-to-tree model. In this paper, we adapt Tree Transformer, an attention-based tree-to-tree model with hierarchical accumulation proposed by Nguyen et al. (2020), for code generation. Hierarchical accumulation aims to encode the tree by performing a series of operations including upward cumulative-average and weighted aggregation on the interpolated tree matrix. Furthermore, the model includes hierarchical embeddings to induce biases that reflect hierarchy within each branch of the tree and among the siblings within a subtree. Finally, subtree masking is used to filter out irrelevant information during upward cumulative-average and weighted aggregation operations. In this model, our target is identical to that of our baseline i.e., the AST representation of the source code, which is later converted to source code with the help of the transition system. We linearize the AST in the same fashion as our baseline, concatenate it with the corresponding parent actions vector in the hidden dimension and pass it into the decoder along with the leaves and nodes vector representations from the encoder. We also add a pointer network (Vinyals et al., 2015) to allow the model to copy from leaves of input parse tree while generating a terminal AST node. Without Following previous works, we use corpus-level BLEU-4 and exact-match accuracy metrics for evaluation on CoNaLa and ATIS datasets respectively. See Appendix for details on training and inference. Table 1 shows BLEU scores from our experiments on the CoNaLa dataset. Our baseline Transformer model outperforms previous state-of-the-art LSTMbased model (Xu et al., 2020) by 0.93 BLEU points. The linearized constituency tree-to-tree model hinders the BLEU score compared to our baseline. However, the structured constituency treeto-tree model significantly outperforms baseline by 2.17 (p<0.01) 2 BLEU points and linearized constituency tree-to-tree model by 2.59 (p<0.01) BLEU points. It also outperforms the baseline model by 8% in terms of human-evaluated code quality (see Appendix). This suggests that the structured inputs can provide important cues for generating high quality code snippets through structureaware encodings. This information is lost when trees are converted to linearized inputs, thereby  leading to a drop in performance over the text-totree baseline. It is important to note that our models have significantly higher number of parameters compared to (Xu et al., 2020) (roughly 44-49M for our models vs 2M for their model) as their model consists of only one layer of LSTM. However, we ran the LSTM model with higher number of parameters by increasing the embedding dimensions and the number of hidden layers in encoder LSTM and we did not see significant improvement in BLEU score. This indicates that the superior performance of our models is primarily due to the rather than the increased count of learnable parameters. Table 2 shows accuracy scores from our experiments on the ATIS dataset. Our baseline model performs significantly worse than the LSTM-based TRANX (Yin and Neubig, 2018) and the accuracy further drops with the linearized constituency treeto-tree model. Our structured model, however, performs significantly better than the aforementioned models (p<0.01), a trend we observed in results on the CoNaLa dataset as well. The accuracy of the structured model is still slightly worse than the TreeGen model (Sun et al., 2020). This might be because the TreeGen model consists of an AST reader which encodes the partial code tree generated in previous timesteps using structureaware tree convolutions, during generation at each timestep. Our model lacks such a module for the target AST. Nonetheless, the overall trend among our three models suggests that parse trees benefit semantic parsing as long as their structure is incorporated in the model. However, if this extra hierarchical information is encoded in a linear fashion, it results in negative contribution to semantic parsing (row 4 in Table 2). Overall, our results also provide motivation for joint modelling of both, input and output parse trees, for semantic parsing.

Ablation Tests
We ablate our best model to understand the effect of the various modules in Tree Transformer on tar-    Table 3. First, we remove subtree masking which allows each node of the tree to attend over nodes that are not in the subtree rooted at that node in hierarchical accumulation. Second, we remove the use of hierarchical embeddings in our model. On the CoNaLa dataset, both experiments result in negative impact on the model's performance. This suggests that subtree masking is a crucial mechanism for structure-aware encoding i.e, for each node in the parse tree, only the relevant information within the subtree rooted at the node is useful. Comparatively, the results show that subtree masking is more important than hierarchical embeddings.

Pattern Analysis
Following Yin and Neubig (2017) and Xu et al.
(2020), we next analyze the input intents and the corresponding code generated by the baseline model and the structured model (on a subset of test samples of CoNaLa datset) to find recurring patterns. First, we observe that input specifications in CoNaLa dataset contain quoted strings, which often occur as user-defined identifiers or strings in generated code as well. We find that when these quoted strings appear towards the end of the input sentence, the difference in quality of output code by the two models in terms of average BLEU score is higher than usual i.e., more than 5 BLEU points (row 2 of Table 4). Moreover, when the input sentence contains two or more quoted strings,   the baseline model often fails to capture the semantic relationship between those strings in the output code resulting in lower BLEU scores (row 3 of Table 4). However, in the absence of any quoted strings, the structure-aware model does better than the baseline by only 2 BLEU points (row 4 of Table 4). This shows that the structured input, when paired with a structure-aware encoder, helps capture dependencies between semantic units. Fig. 1 and Fig. 2 provide examples of both these scenarios and Table 4 compares the average BLEU scores. Similarly, we notice that there are variable entities like city, airline, airport, time, etc. in the input specifications, which also appear in the corresponding outputs in the ATIS dataset. We find that our structure-aware model outperforms the baseline model by 12.57 points when such variables occur at the end of the input sentence (see row 3 in Table 6), suggesting that the model is able to capture long-term dependencies (see Fig. 8).

Comparison Based on Input Complexity
We compare the performance of the baseline textto-tree and structured tree-to-tree models w.r.t. input complexity i.e. the length of input sentences and height of the input parse trees in the test set of CoNaLa dataset. The variation of mean BLEU scores w.r.t. length of input sentence and height of input trees is shown in Figures 3 and 4 respectively. In both figures, we observe that the structure-aware model outperforms baseline by wider margins for inputs of shorter length and height. Similarly, there  are smaller but consistent improvements for inputs of medium complexity. The margins are largest for samples of high complexity, but this observation is supported by relatively few data points (see scatter plots in Appendix). From these results, we infer that the structured model is particularly helpful for short input sentences or parse trees in code generation. Similarly, the structured model significantly outperforms the baseline for shorter intent lengths in the ATIS dataset. However, we did not find any clear linkage between the height of input tree and the performance of our models on the ATIS dataset (see Figures 9 and 10 in Appendix).

Conclusion
We experimented with models to utilize input constituency parse trees for code generation and semantic parsing. Our tree-to-tree model significantly outperforms other approaches for code generation and is competitive for semantic parsing. We find that the hierarchical structure of parse trees helps the structure-aware model capture semantic relationships between user-defined identifiers and variable entities in the input intent. We use a standard Transformer model (Vaswani et al., 2017) as our natural language-to-code baseline. We build a sequence-to-tree model with a regular Transformer encoder and decoder. The encoder maps the source sequence x = x 1 , x 2 , ..., x n to its vector representationx =x 1 ,x 2 , ...,x n , which is passed into the decoder. At each time step t, we linearize the AST generated till time step t − 1 i.e. AST y <t and concatenate its embedding with the embedding of the corresponding parent actions, following Yin and Neubig (2017). The decoder takes this partial AST vector representation and the source vector representationx from encoder as input and expands the frontier nonterminal node of the partial AST. Here, the ASTs are linearized by the pre-order depth-first traversal and the expansion of the AST, at each time step, is constrained by the grammar rules of the underlying programming language. We adopt the ASDL grammar for Python and transition system (Yin and Neubig, 2018) that decomposes the production of an AST into a sequence of actions. At each time step t, the action a t can be of 3 types: • ApplyRule Action: Applies production rule R to the partial AST.
• Reduce Action: Denotes the completion of a field with optional or multiple cardinalities.
• GenToken Action: Expands a terminal node by generating a leaf token.
Given the input specification x, the probability of generating an AST y can be expressed in terms of the probabilities of generating the corresponding actions: Here, a t is the action at time step t and y <t is the partial AST generated upto time step t. We also use a pointer network (Vinyals et al., 2015) to allow the model to copy relevant entities from input sequence while generating a terminal AST node with GenToken Action.

B Experimental Setup
Training and Inference. All of our models have 6 encoder layers and 6 decoder layers. Our models are trained on GPUs using Google Colab and each model takes 2-3 hours for a single run. We perform manual hyperparameter tuning, using 4-5 runs for each model. We tried learning rates within the range [1e-4, 5e-5]. After manual tuning, for CoNaLa dataset, we trained all the models using learning rate of 1e-4. For ATIS dataset, we use learning rate of 2e-4 for our baseline model, 5e-5 for our Linearized Tree-to-Tree model and 4e-5 for our structured model. We use batch size of 64.

Models Wins Loses Tie
Structured T2T vs. Baseline 35% 27% 38% Table 5: Results from human evaluation of generated code. Wins and Loses refer to the %times code generated from structured tree-to-tree model was chosen over those from baseline model.
We parse source text into constituency trees using Stanford CoreNLP parser (Manning et al., 2014). During inference, we use beam search with beam size of 30 for CoNaLa dataset and beam size of 1 for ATIS dataset to predict the output AST for a given natural language intent. We begin the beam search with one AST initialized with the root node and run until maximum time-step T or until we find K complete ASTs, where K is the beam-size. The maximum time-step is set to 200.

C.1 Human Evaluation
We also perform human evaluation of 100 samples from the CoNaLa dataset (see Table 5). The annotator (non-coauthor graduate student, proficient in Python) was instructed to pick the better code output for a given input specification. The samples contained shuffled outputs from our baseline and structure-aware models. Outputs from structureaware model were preferred 35% of the times while those from the baseline were preferred 27% of the times and rest of the instances ended in a tie over code quality.

C.2 Scatter Plots
We present two scatter plots for demonstrating the effect of input complexity on model performance for the CoNaLa dataset. Fig. 6 compares sentencelevel BLEU score of predictions from our baseline and structured models against the length of input sentences. Similarly, Fig. 7 compares sentencelevel BLEU score of predictions from our baseline and structured models against the height of input parse trees. Both of these comparisons are performed on the test set of CoNaLa dataset.

C.3 Pattern Analysis
Following Yin and Neubig (2017) and Xu et al.
(2020), we next analyze the input intents and the corresponding code generated by the baseline model and our best model (on a small test sample), i.e., Structured Constituency Tree-to-Tree to find recurring patterns. We observe that the code generated by the structured model is significantly better for input intents containing certain characteristics. First, we observe that input specifications in CoNaLa dataset contain quoted strings (see placeholder str 0 in Fig. 1 of the main text). These I: what airport is at ci0 BM: (lambda $0 e (and (airport $0) (loc:t $0) (to $0 ci0 ))) ✗ SM: ( lambda $0 e ( and ( airport $0 ) ( loc:t $0 ci0 ) ) ) ✓ I: look for a flight to ci0 BM: (lambda $0 e ( and ( flight $0 ) ( to $0 ci0 ) ( from $0 ci0 ))) ✗ SM: (lambda $0 e ( and ( flight $0 ) ( to $0 ci0 ) ) ) ✓ Figure 8: Outputs of our baseline and structured model for the ATIS dataset when variable entities appear at the end of input sentences.  strings often occur as user-defined identifiers in the input sentence and as strings in generated code as well. We find that when these quoted strings appear towards the end of the input sentence, the difference in quality of output code by the two models in terms of average BLEU score is higher than usual i.e., more than 5 BLEU points (see row 2 in Table 4). We also find that when the input sentence contains two or more quoted strings, the baseline model often fails to capture the semantic relationship between those strings in the output code, resulting in lower BLEU scores. However, the structure-aware model succeeds at the task, resulting in higher BLEU scores (see row 3 in Table 4). In the absence of any quoted strings, the structure-aware model does better than the baseline by only 2 BLEU points (see row 4 in Table 4). This shows that when the structured input is paired with a structure-aware encoder, it helps capture the semantic relationships between multiple units. Fig. 1 and Fig. 2 provide examples of both these scenarios and Table 4 compares the average BLEU score of all the examples in the test set with 1) quoted string at the end of input sentence, 2) two or more quoted strings and 3) zero quoted strings, Similarly, we notice that there are variable entities like city (ci0 in Fig. 8), airline, airport, time, etc in the input specifications, which also appear in the corresponding outputs in the ATIS dataset. Such variables are anonymized with identifiers of same type following (Dong and Lapata, 2016). We find that when such variables occur at the end of the input sentence, our structured model does signifi-  cantly better than our baseline (see row 3 in Table 6) but the difference decreases in cases where the variables don't occur at the end of the input sentence (see row 2 in Table 6). Fig. 8 provides examples of the cases where variable entities occur at the end of the input sentences.

C.4 Comparison Based on Input Complexity
We compare the performance of our two models with respect to the complexity of input sentences. We rank the complexity of an input sentence by its length and the height of the corresponding parse tree i.e., the longest length of the path from the root node of the tree to its leaves. Firstly, we do this analysis on CoNaLa dataset. Fig. 3 presents a plot of length of input sentences and mean BLEU scores of generated code snippets. Fig. 4 presents a plot of height of input trees and mean BLEU scores of generated code snippets. Similarly, Figures 6 and 7 are scatter plots of sentence-level BLEU scores for generated code snipppets vs. length of input sentences and height of input parse trees respectively. We can see in both Fig. 3 and 4 that there is a wider gap between mean BLEU score of our structured model and baseline in the beginning, with the structured model performing significantly better. The gap narrows in the middle and widens towards the end. However, as we can see from the Figures 6 and 7, there are very few data points towards the end to draw any conclusion. From these observations, we infer that for code generation, structured model is particularly helpful for short input sentences or for short input parse trees. Similarly, the structured model significantly outperforms the baseline for shorter intent lengths in the ATIS dataset. However, we did not find any clear linkage between the height of input sentences and performance of our models for semantic parsing with the ATIS dataset.