Exploring Dynamic Selection of Branch Expansion Orders for Code Generation

Due to the great potential in facilitating software development, code generation has attracted increasing attention recently. Generally, dominant models are Seq2Tree models, which convert the input natural language description into a sequence of tree-construction actions corresponding to the pre-order traversal of an Abstract Syntax Tree (AST). However, such a traversal order may not be suitable for handling all multi-branch nodes. In this paper, we propose to equip the Seq2Tree model with a context-based Branch Selector, which is able to dynamically determine optimal expansion orders of branches for multi-branch nodes. Particularly, since the selection of expansion orders is a non-differentiable multi-step operation, we optimize the selector through reinforcement learning, and formulate the reward function as the difference of model losses obtained through different expansion orders. Experimental results and in-depth analysis on several commonly-used datasets demonstrate the effectiveness and generality of our approach. We have released our code at https://github.com/DeepLearnXMU/CG-RL.


Introduction
Code generation aims at automatically generating a source code snippet given a natural language (NL) description, which has attracted increasing attention recently due to its potential value in simplifying programming. Instead of modeling the abstract syntax tree (AST) of code snippets directly, most of methods for code generation convert AST into a sequence of tree-construction actions. This allows for using natural language generation (NLG) models, such as the widely-used encoder-decoder Joint work with Pattern Recognition Center, WeChat AI, Tencent Inc, China. * Equal contribution † Corresponding author models, and obtains great success (Ling et al., 2016;Lapata, 2016, 2018;Rabinovich et al., 2017;Yin and Neubig, 2017Hayati et al., 2018;Sun et al., 2019Sun et al., , 2020Wei et al., 2019;Shin et al., 2019;Xu et al., 2020;Xie et al., 2021). Specifically, an encoder is first used to learn word-level semantic representations of the input NL description. Then, a decoder outputs a sequence of tree-construction actions, with which the corresponding AST is generated through pre-order traversal. Finally, the generated AST is mapped into surface codes via certain deterministic functions. Generally, during the generation of dominant Seq2Tree models based on pre-order traversal, branches of each multi-branch nodes are expanded in a left-to-right order. Figure 1 gives an example of the NL-to-Code conversion conducted by a Seq2Tree model. At the timestep t 1 , the model generates a multi-branch node using the action a 1 with the grammar containing three fields: type, name, and body. Thus, during the subsequent generation process, the model expands the node of t 1 to sequentially generate several branches in a left-toright order, corresponding to the three fields of a 1 . The left-to-right order is a conventional bias for most human-beings to handle multi-branch nodes, which, however, may not be optimal for expanding branches. Alternatively, if we first expand the field name to generate a branch, which can inform us the name 'e', it will be easier to expand the field type with a 'Exception' branch due to the high co-occurrence of 'e' and 'Exception'.
To verify this conjecture, we choose TRANX  to construct a variant: TRANX-R2L, which conducts depth-first generation in a right-to-left manner, and then compare their performance on the DJANGO dataset. We find that about 93.4% of ASTs contain multi-branch nodes, and 17.38% of AST nodes are multi-branch Percentage Only TRANX 8.47 Only TRANX-R2L 7.66 Table 1: The percentages of multi-branch nodes, which can only be correctly handled by different models. TRANX-R2L is a variant of TRANX , which handles multi-branch nodes in a right-toleft order.
ones. Table 1 reports the experimental results. We can observe that 8.47% and 7.66% of multi-branch nodes can only be correctly handled by TRANX and TRANX-R2L, respectively. Therefore, we conclude that different multi-branch nodes have different optimal branch expansion orders, which can be dynamically selected based on context to improve the performance of conventional Seq2Tree models.
In this paper, we explore dynamic selection of branch expansion orders for code generation. Specifically, we propose to equip the conventional Seq2Tree model with a context-based Branch Selector, which dynamically quantifies the priorities of expanding different branches for multi-branch nodes during AST generations. However, such a non-differentiable multi-step operation poses a challenge to the model training. To deal with this issue, we apply reinforcement learning to train the extended Seq2Tree model. Particularly, we augment the conventional training objective with a reward function, which is based on the model training loss between different expansion orders of branches. In this way, the model is trained to determine optimal expansion orders of branches for multi-branch nodes, which will contribute to AST generations.
To summarize, the major contributions of our work are three-fold: • Through in-depth analysis, we point out that different orders of branch expansion are suitable for handling different multi-branch AST nodes, and thus dynamic selection of branch expansion orders has the potential to improve conventional Seq2Tree models. • We propose to incorporate a contextbased Branch Selector into the conventional Seq2Tree model and then employ reinforcement learning to train the extended model. To the best of our knowledge, our work is the first attempt to explore dynamic selection of branch expansion orders for code generation.  demonstrate the effectiveness and generality of our model on various datasets.

Background
As shown in Figure 1, the procedure of code generation can be decomposed into three stages. Based on the learned semantic representations of the input NL utterance, the dominant Seq2Tree model  first outputs a sequence of abstract syntax description language (ASDL) grammar-based actions. These actions can then be used to construct an AST following the preorder traversal. Finally, the generated AST is mapped into surface code via a user-specified function AST to MR( * ).
In the following subsections, we first describe the basic ASDL grammars of Seq2Tree models. Then, we introduce the details of TRANX , which is selected as our basic model due to its extensive applications and competitive performance (Yin and Neubig, 2019;Shin et al., 2019;Xu et al., 2020). 1

ASDL Grammar
Formally, an ASDL grammar contains two components: type and constructors. The value of type can be composite or primitive. As shown in the 'ActionSequence' and 'AST z' parts of Figure 1, a constructor specifies a language component of a particular type using its fields, e.g., ExceptHandler (expr? type, expr? name, stmt * body). Each field specifies the type of its child node and contains a cardinality (single, optional ? and sequential * ) indicating the number of child nodes it holds. For instance, expr? name denotes the field name has optional child node. The field with composite type (e.g. expr) can be instantiated by constructors of corresponding type, while the field with primitive type (e.g. identifier) directly stores token.
There are three kinds of ASDL grammar-based actions that can be used to generate the action sequence: 1) APPLYCONSTR[c]. Using this action, a constructor c is applied to the composite field of the parent node with the same type as c, expanding the field to generate a branch ending with an AST node. Here we denote the field of the parent node as frontier field. 2) REDUCE. It indicates the completion of generating branches for a field with optional or multiple cardinalities. 3) GEN-TOKEN [v]. It expands a primitive frontier field to generate a token v.
Obviously, a constructor with multiple fields can produce multiple AST branches 2 , of which generation order has important effect on the model performance, as previously mentioned.

Seq2Tree Model
Similar to other NLG models, TRANX is trained to minimize the following objective function: where a t is the t-th action, and p(a t |a <t , x) is modeled by an attentional encoder-decoder network .
For an NL description x=x 1 , x 2 , ..., x N , we use a BiLSTM encoder to learn its word-level hidden states. Likewise, the decoder is also an LSTM network. Formally, at the timestep t, the temporary hidden state h t is updated as 2 We also note that the field with sequential cardinality will be expanded to multiple branches. However, in this work, we do not consider this scenario, which is left as future work.
where E(a t−1 ) is the embedding of the previous action a t−1 , s t−1 is the previous decoder hidden state, and p t is a concatenated vector involving the embedding of the frontier field and the decoder hidden state for the parent node. Furthermore, the decoder hidden state s t is defined as where c t is the context vector produced from the encoder hidden states and W is a parameter matrix.
Here, we calculate the probability of action a t according to the type of its frontier field: • Composite. We adopt an APPLYCONSTR action to expand the field or a REDUCE action to complete the field. 3 The probability of using APPLYCONSTR[c] is defined as follows: where E(c) denotes the embedding of the constructor c. • Primitive. We apply a GENTOKEN action to produce a token v, which is either generated from the vocabulary or copied from the input NL description. Formally, the probability of using GENTOKEN[v] can be decomposed into two parts: where p (gen |a <t , x) is modeled as sigmoid (Ws t ).
Please note that our proposed dynamic selection of branch expansion orders does not affect other aspects of the model. The reinforced training of the extended TRANX model with branch selector. We first fed the information of field and parent node into branch selector. Then, from the policy probability distribution of branch selector, we sample an order o and infer an orderô. Finally, we calculate the reward based on the model loss difference between o andô, and use the gradients to update parameters of the extended model.

Branch Selector
As described in Section 2.2, the action prediction at each timestep is mainly affected by its previous action, frontier field and the action of its parent node. Thus, it is reasonable to construct the branch selector determining optimal expansion orders of branches according to these three kinds of information.
Specifically, given a multi-branch node n t at timestep t, where the ASDL grammar of action a t contains m fields [f 1 , f 2 , ...f m ], we feed the branch selector with three vectors: 1) E(f i ): the embedding of field f i , 2) E(a t ): the embedding of action a t , 3) s t : the decoder hidden state, and then calculate the priority score of expanding fields as follows: where W 1 ∈R d 1 ×d 2 and W 2 ∈R d 2 ×1 are learnable parameters. 4 Afterwards, we normalize priority scores of expanding all fields into a probability distribution: (7) Based on the above probability distribution, we can sample m times to form a branch expansion order o = [f o 1 , ..., f om ], of which the policy probability is computed as 4 We omit the bias term for clarity.
It is notable that during the sampling of f o i , we mask previously sampled fields f o <i to ensure that duplicate fields will not be sampled.

Training with Reinforcement Learning
During the generation of ASTs, with the above context-based branch selector, we deal with multibranch nodes according to the dynamically determined order instead of the standard left-to-right order. However, the non-differentiability of multistep expansion order selection and how to determine the optimal expansion order lead to challenges for the model training. To deal with these issues, we introduce reinforcement learning to train the extended Seq2Tree model in an end-to-end way. Concretely, we first pre-train a conventional Seq2Tree model. Then, we employ self-critical training with a reward function that measures loss difference between different branch expansion orders to train the extended Seq2Tree model.

Pre-training
It is known that a well-initialized network is very important for applying reinforcement learning (Kang et al., 2020). In this work, we require the model to automatically quantify effects of different branch expansion orders on the quality of the generated action sequences. Therefore, we expect that the model has the basic ability to generate action sequences in random order at the beginning. To do this, instead of using the pre-order traversal based action sequences, we use the randomly-organized action sequences to pre-train the Seq2Tree model.
Concretely, for each multi-branch node in an AST, we sample a branch expansion order from a uniform distribution, and then reorganize the corresponding actions according to the sampled order. We conduct the same operations to all multi-branch nodes of the AST, forming a new training instance. Finally, we use the regenerated training instances to pre-train our model.
In this way, the pre-trained Seq2Tree model acquires the preliminary capability to make predictions in any order.

Self-Critical Training
With the above initialized parameters, we then perform self-critical training (Rennie et al., 2017;Kang et al., 2020) to update the Seq2Tree model with branch selector.
Specifically, we train the extended Seq2Tree model by combining the MLE objective and RL objective together. Formally, given the training instance (x, a), we first apply the sampling method described in section 3.1 to all multi-branch nodes, reorganizing the initial action sequence a to form a new action sequence a o , and then define the model training objective as where L mle ( * ) denotes the conventional training objective defined in Equation 1, L rl ( * ) is the negative expected reward of branch expansion order o for the multi-branch node n, λ is a balancing hyperparameter, N mb denotes the set of multi-branch nodes in the training instance, and θ denotes the parameter set of our enhanced model.
More specifically, L rl ( * ) is defined as where we approximate the expected reward with the loss of an order o sampled from the policy π. Inspired by successful applications of selfcritical training in previous studies (Rennie et al., 2017;Kang et al., 2020), we propose the reward r( * ) to accurately measure the effect of any order on the model performance. As shown in Figure 2, we calculate the reward using two expansion orders of branches: one is o sampled from the policy π, and the other isô inferred from the policy π with the maximal generation probability: Please note that we extend the standard reward function by setting a threshold η to clip the reward, which can prevent the network from being overconfident in current expansion order of branches. Finally, we apply the REINFORCE algorithm (Williams, 1992) to compute the gradient:

Experiments
To investigate the effectiveness and generalizability of our model, we carry out experiments on several commonly-used datasets.

Datasets
Following previous studies Neubig, 2018, 2019;Xu et al., 2020), we use the following four datasets: • DJANGO (Oda et al., 2015). This dataset totally contains 18,805 lines of Python source code, which are extracted from the Django Web framework, and each line is paired with an NL description. • ATIS. This dataset is a set of 5,410 inquiries of flight information, where the input of each example is an NL description and its corresponding output is a short piece of code in lambda calculus. • GEO. It is a collection of 880 U.S. geographical questions, with meaning representations defined in lambda logical forms like ATIS. • CONALA . It totally consists of 2,879 examples of manually annotated NL questions and their Python solutions on STACK OVERFLOW. Compared with DJANGO, the examples of CONALA cover real-world NL queries issued by programmers with diverse intents, and are significantly more difficult due to its broad coverage and high compositionality of target meaning representations.

Baselines
To facilitate the descriptions of experimental results, we refer to the enhanced TRANX model as TRANX-RL. In addition to TRANX, we compare our enhanced model with several competitive models: • TRANX (w/ pre-train). It is an enhanced version of TRANX with pre-training. We  compare with it because our model involves a pre-training stage. • COARSE2FINE (Dong and Lapata, 2018).
This model adopts a two-stage decoding strategy to produce the action sequence. It first generates a rough sketch of its meaning, and then fills in missing detail. • TREEGEN (Sun et al., 2020). It introduces the attention mechanism of Transformer (Vaswani et al., 2017), and a novel AST reader to incorporate grammar and AST structures into the network. • TRANX-R2L. It is a variant of the conventional TRANX model, which deals with multibranch AST nodes in a right-to-left manner. • TRANX-RAND. It is also a variant of the conventional TRANX model dealing with multibranch AST nodes in a random order. • TRANX-RL (w/o pre-train). In this variant of TRANX-RL, we train our model from scratch. By doing so, we can discuss the effect of pre-training on our model training.
To ensure fair comparisons, we use the same experimental setup as TRANX . Concretely, the sizes of action embedding, field embedding and hidden states are set to 128, 128 and 256, respectively. For decoding, the beam sizes for GEO, ATIS, DJANGO and CONALA are 5, 5, 15 and 15, respectively. We pre-train models in 10 epochs for all datasets. we determine the λs as 1.0 according to the model performance on validation sets. As in previous studies (Alvarez-Melis and Jaakkola, 2017;Neubig, 2018, 2019), we use the exact matching accuracy (Acc) as the evaluation metric for all datasets. For CONALA, we use the corpus-level BLEU ) as a complementary metric. Table 2 reports the main experimental results. Overall, our enhanced model outperforms baselines across all datasets. Moreover, we can draw the following conclusions:

Main Results
First, our reimplemented TRANX model achieves comparable performance to previously reported results (Yin and Neubig, 2019) (TRANX). Therefore, we confirm that our reimplemented TRANX model are convincing.
Second, compared with TRANX, TRANX-R2L and TRANX-RAND, our TRANX-RL exhibits better performance. This result demonstrates the advantage of dynamically determining branch expansion orders on dealing with multi-branch AST nodes.
Third, the TRANX model with pre-training does not gain a better performance. In contrast, removing the model pre-training leads to the performance degradation of our TRANX-RL model. This result is consistent with the conclusion of previous studies Kang et al., 2020) that the pre-training is very important for the applying reinforcement learning.

Effects of the Number of Multi-branch Nodes
As implemented in related studies on other NLG tasks, such as machine translation (Bahdanau et al., 2015), we individually split two relatively large    datasets (DJANGO and ATIS) into different groups according to the number of multi-branch AST nodes, and report the performance of various models on these groups of datasets. Tables 4 and 5 show the experimental results. On most groups, TRANX-RL achieves better or equal performance than other models. Therefore, we confirm that our model is general to datasets with different numbers of multi-branch nodes.

Accuracy of Action Predictions for the Child Nodes
Given a multi-branch node, its child nodes have an important influence in the subtree. Therefore, we focus on the accuracy of action predictions for the child nodes. For fair comparison, we predict actions with pre-vious ground-truth history actions as inputs. Table  3 reports the experimental results. We observe that TRANX-RL still achieves higher prediction accuracy than other baselines on most groups, which proves the effectiveness of our model again. Figure 3 shows two examples from DJANGO. In the first example, TRANX first generates the leftmost child node at the timestep t 2 , incorrectly predicting GENTOKEN['gzip'] as REDUCE action. By contrast, TRANX-RL puts this child node in the last position and successfully predict its action, since our model benefits from the previously generated token 'GzipFile' of the sibling node, which frequently occurs with 'gzip'. In the second example, TRANX incorrectly predicts the second child node at the t 10 -th timestep, while TRANX-RL firstly predicts it at the timestep t 6 . We think this error results from the sequentially generated nodes and the errors in early timesteps would accumulatively harm the predictions of later sibling nodes. By comparison, our model can flexibly generate subtrees with shorter lengths, alleviating error accumulation.

Related Work
With the prosperity of deep learning, researchers introduce neural networks into code generation. In this aspect, Ling et al. (2016) first explore a Seq2Seq model for code generation. Then, due to the advantage of tree structure, many attempts resort to Seq2Tree models, which represent codes as trees of meaning representations (Dong and Lapata, 2016;Alvarez-Melis and Jaakkola, 2017;Rabinovich et al., 2017;Neubig, 2017, 2018;Sun et al., 2019Sun et al., , 2020. Typically,  propose TRANX, which introduces ASTs as intermediate representations of codes and has become the most influential Seq2Tree model. Then, Sun et al. (2019Sun et al. ( , 2020 respectively explore CNN and Transformer   2020) exploit external knowledge to enhance neural code generation model. Generally, all these Seq2Tree models generate ASTs in pre-order traversal, which, how-ever, is not suitable to handle all multi-branch AST nodes. Different from the above studies that deal with multi-branch nodes in left-to-right order, our model determines the optimal expansion orders of branches for multi-branch nodes.
Some researchers have also noticed that the selection of decoding order has an important impact on the performance of neural code generation models. For example, Alvarez-Melis and Jaakkola (2017) introduce a doubly RNN model that combines width and depth recurrences to traverse each node. Dong and Lapata (2018) firstly generate a rough code sketch, and then fill in missing details by considering the input NL description and the sketch. Gu et al. (2019a) present an insertionbased Seq2Seq model that can flexibly generate a sequence in an arbitrary order. In general, these researches still deal with multi-branch AST nodes in a left-to-right manner. Thus, these models are theoretically compatible with our proposed branch selector.
Finally, it should be noted that have been many NLP studies on exploring other decoding methods to improve other NLG tasks (Zhang et al., 2018;Welleck et al., 2019;Stern et al., 2019;Gu et al., 2019a,b). However, to the best of our knowledge, our work is the first attempt to explore dynamic selection of branch expansion orders for tree-structured decoding.

Conclusion and Future Work
In this work, we first point out that the generation of domainant Seq2Tree models based on pre-order traversal is not optimal for handling all multi-branch nodes. Then we propose an extended Seq2Tree model equipped with a context-based branch selector, which is capable of dynamically determining optimal branch expansion orders for multi-branch nodes. Particularly, we adopt reinforcement learning to train the whole model with an elaborate reward that measures the model loss difference between different branch expansion orders. Extensive experiment results and in-depth analyses demonstrate the effectiveness and generality of our proposed model on several commonlyused datasets.
In the future, we will study how to extend our branch selector to deal with indefinite branches caused by sequential field.