Code Generation from Natural Language with Less Prior Knowledge and More Monolingual Data

Training datasets for semantic parsing are typically small due to the higher expertise required for annotation than most other NLP tasks. As a result, models for this application usually need additional prior knowledge to be built into the architecture or algorithm. The increased dependency on human experts hinders automation and raises the development and maintenance costs in practice. This work investigates whether a generic transformer-based seq2seq model can achieve competitive performance with minimal code-generation-specific inductive bias design. By exploiting a relatively sizeable monolingual corpus of the target programming language, which is cheap to mine from the web, we achieved 81.03% exact match accuracy on Django and 32.57 BLEU score on CoNaLa. Both are SOTA to the best of our knowledge. This positive evidence highlights a potentially easier path toward building accurate semantic parsers in practice.


Introduction
For a machine to act upon users' natural language inputs, a model needs to convert the natural language utterances to machine-understandable meaning representation, i.e. semantic parsing (SP). The output meaning representation is beyond shallow identification of topic, intention, entity or relation, but complex structured objects expressed as logical forms, query language or general-purpose programs. Therefore, annotating parallel corpus for semantic parsing requires more costly expertise.
SP shares some resemblance with machine translation (MT). However, SP datasets are typically smaller, with only a few thousand to at most tens of thousands of examples, even smaller than most low resource MT problems. Simultaneously, because * Work done during internship at BorealisAI † Code at https://github.com/BorealisAI/code-gen-TAE Figure 1: TAE: the monolingual corpus is used both as source and target. The encoder is frozen in the computation branch on the monolingual data. the predicted outputs generally need to be exactly correct to execute and produce the right answer, the accuracy requirement is generally higher than MT. As a result, inductive bias design in architecture and algorithm has been prevalent in the SP literature (Dong and Lapata, 2016;Neubig, 2017, 2018;Dong and Lapata, 2018;Guo et al., 2019;Wang et al., 2019;Yin and Neubig, 2019).
While their progress is remarkable, excessive task-specific expert design makes the models complicated, hard to transfer to new domains, and challenging to deploy in real-world applications. In this work, we look at the opposite end of the spectrum and try to answer the following question: with little inductive bias in the model, and no additional labelled data, is it still possible to achieve competitive performance? This is an important question, as the answer could point to a much shorter road to practical SP without breaking the bank. This paper shows that the answer is encouragingly affirmative. By exploiting a relatively large monolingual corpus of the programming language, a transformer-based Seq2Seq model (Vaswani et al., 2017) with little SP specific prior could potentially attain results superior to or competitive with the state-of-the-art models specially designed for semantic parsing. Our contributions are three-fold: • We provide evidence that transformer-based seq2seq models can reach a competitive or superior performance with models specifically designed for semantic parsing. This suggests an alternative route for future progress other than inductive bias design; • We do empirical analysis over previously proposed approaches for incorporating monolingual data and show the effectiveness of our modified technique on a range of datasets; • We set the new state-of-the-art on Django (Oda et al., 2015) reaching 81.03% exact match accuracy and on CoNaLa  with a BLEU score of 32.57.

Previous Work on Semantic Parsing
Different sources of prior knowledge about the SP problem structure could be exploited. Input structure: Wang et al. (2019) adapts the transformer relative position encoding (Shaw et al., 2018) to express relations among the database schema elements as well as with the input text spans. Herzig and Berant (2020) proposed a spanbased neural parser with compositional inductive bias built-in. Herzig and Berant (2020) also leverages a CKY-style (Cocke, 1969;Kasami, 1966;Younger, 1967) inference to link input features to output codes. Output structure: The implicit tree or graph-like structures in the programs can also be exploited. Dong and Lapata (2016) proposed parent-feeding LSTM following the tree structure. Dong and Lapata (2018) proposed a coarse-to-fine decoding approach. Guo et al. (2019) crafted an intermediate meaning representation to bridge the large gap between input utterance and the output SQL queries. Neubig (2017, 2018) proposed TranX, a more general-purpose transition-based system, to ensure grammaticality of predictions. Using TranX, the neural model predicts the linear sequence of AST-tree constructing actions instead of the program tokens. However, a human expert needs to craft the grammar, and the design quality impacts the learning and generalization for the neural nets. Sequential models with less SP specific priors have been investigated (Dong and Lapata, 2016;Ling et al., 2016b;Zeng et al., 2020), However, they generally fell short in accuracy comparing to the best of structure-exploiting ones listed above.
The most closely related to ours is the work by Xu et al. (2020) for incorporating external knowledge from extra datasets, which used a noisy parallel dataset from Stackoverflow to pre-train the SP and fine-tuned it on the primary dataset. Their approach's main limitation is still the need for (noisy) parallel data, albeit cheaper than the primary labelled set. Nonetheless, as we shall see in the experiment section later, our approach achieves better results when using the same amount of data mined from the same source despite ignoring the source sentence.

Background and Methodology
BERT (Devlin et al., 2018) class of pre-trained models can make up for the lack of inductive bias on the input side to some degree. On the output side, we hope to learn the necessary prior knowledge about the target meaning representation from unlabelled monolingual data.

Target Autoencoding with Frozen Encoder
We assume a parallel corpus of natural language utterances and their corresponding programs, B = {x x x i , y y y i }. The goal is to train a translator model (TM) to maximize the conditional log probability of y y y i given x x x i , T θ θ θ (y y y i |x x x i ), over the training set: L sup = B T θ θ θ (y y y i | x x x i ) where θ θ θ is the vector of TM model parameters. Let M = {y y y i } denote the monolingual dataset in the target language. Currey et al. (2017); Burlot and Yvon (2019) demonstrated that in low resource MT, autoencoding the monolingual data besides the main supervised training is helpful. Following the same path, we add an auto-encoding objective term on monolingual data: L full = L sup + M T θ θ θ (y y y i | y y y i ). The target y y y i 's are reconstructed using the shared encoder-decoder model.
We conjecture that monolingual data autoencoding mainly helps the decoder, so we propose to freeze the encoder parameters for monolingual data. Writing the encoder and decoder parameters separately with θ θ θ = [θ θ θ e , θ θ θ d ], then θ θ θ e is updated using the gradient of the supervised objective L sup , whereas the decoder gradient comes from L full . We verify this hypothesis in section 4.1.
In terms of model architecture, our TM is a standard transformer-based seq2seq model with copy attention (Gu et al., 2016) (illustrated in Fig. 2 of C). We fine-tune BERT as the encoder and use a 4-layer transformer decoder. There is little SPspecific inductive bias in the architecture. The only special structure is the copy attention, which is not a strong inductive bias designed for SP as copy attention is widely used in other tasks as well.
We refer to the method of using copied monolingual data and freezing the encoder over them as target autoencoding (TAE). Unless otherwise specified in the ablation studies, the encoder is always frozen.

Experiments
For our primary experiments we considered two python datasets namely Django and CoNaLa. The former is based on Django web framework and the latter is annotated code snippets from stackoverflow answers. Additionally, we experiment on the SQL version of GeoQuery and ATIS from Finegan-Dollak et al. (2018) (with query split), WikiSQL (Zhong et al., 2017), and Magic (Java) (Ling et al., 2016b).
Python Monolingual Corpora: CoNaLa comes with 600K mined questions from Stackoverflow. We ignored the noisy source intents/sentences and just use the python snippets. To be comparable with Xu et al. (2020), we also select a corresponding 100K subset version for comparison. See Appendix A for details on the SQL and Java monolingual corpora.
Experimental Setup: In all experiments, we use label smoothing with a parameter of 0.1 and Polyak averaging (Polyak and Juditsky, 1992) of parameters with a momentum of 0.999 except for GeoQuery which we use 0.995. We use Adam (Kingma and Ba, 2014) and early stopping based on the dataset specific evaluation metric on dev set. The learning rate for the encoder is 1 × 10 −5 over all datasets. We used the learning rate of 7.5 × 10 −5 on all datasets except GeoQuery and ATIS which we use 1 × 10 − 4. The architecture overview is shows in Fig. 2. At the inference time we use beam search with beam size of 10 and a length normalization based on (Wu et al., 2016). We run each experiment with 5 different random seeds and report the average and standard deviation. WordPiece tokenization is used for both natural language utterances and programming code.

Empirical Analysis
First, we considered a scenario where the monolingual corpus comes from the same distribution as the bitext. We simulate this setup by using 10% of Django training data as labeled data while using all the python examples from Django as the mono-   lingual dataset of 10 times bigger. Results with "Authentic Dataset" in Fig. 3 shows the effectiveness of TAE vs other approaches.
Next, we used the monolingual dataset prepared for python (StackOverflow Corpus) which is from a different distribution. Fig. 3 shows even more considerable improvement, thanks to the larger monolingual set. We considered noisy intents provided in CoNaLa monolingual corpus and dummy source sentences where each monolingual sample is paired along with a random length array containing zeros. We also compared against other well-known approaches like fusion and back-translation, see experiments details in Appendix D. TAE outperforms all those approaches by a large margin. Now one important question is, what part of the model benefits from monolingual data most? In Sec. 3.1, we conjectured that auto-encoding of monolingual data should mostly help the decoder, not the encoder. To verify this, we perform an ablation by comparing freezing encoder parameters versus not freezing over the monolingual set. Fig. 3 shows that without freezing the encoder, performance drops slightly for TAE on authentic Django while dropping significantly when copying on Stackoverflow data. This confirms that the performance gain is due to its effect on the decoder, while the copied monolingual data might even hurts the encoder.  (2020) that also leverage the same extra data mined from StackOverflow (EK in Table 3). As mentioned in Sec. 2, they used the noisy parallel corpus for pre-training, whereas we only leverage the monolingual set. However, we obtain both larger relative improvements over our baseline (32.29 from 30.98) compared to Xu et al. (2020) (28.14 from 27.20), as well as better absolute results in the best case. In fact, with only the 100K StackOverflow monolingual data, our result is on par with the best one from Xu et al. (2020) that uses the additional python API bitext data. Finally, note that part of our superior performance is due to using BERT as an encoder.

Main Results on Full Data
Finally, TAE also yields improvements on other programming languages, as shown for GeoQuery (SQL), ATIS (SQL) and Magic (Java) in Table 4. We observe no improvement on WikiSQL. But it is not surprising given its large dataset size and the simplicity of its targets. As observed by previous works (Finegan-Dollak et al., 2018), more than half of queries follow simple pattern of "SELECT col FROM table WHERE col = value".
The main results in terms of improvement over previous best methods are statistically significant in Table 2-3. On Django, our result is better than Reranker (Yin and Neubig, 2019) (best previous method in Table 2) with a P-value < 0.05, under one-tailed two-sample t-test for mean equality. Since the previous state of the art on CoNaLa (EK + 100k + API in Table 3) did not provide the standard deviation, we cannot conduct a two-sample t-test against it. Instead, we performed a one-tailed two-sample t-test against the TranX+BERT baseline and observed that our improvement is statistically significant with P-value < 0.05. In Table 4,

Discussion
Thus far, we have verified that the decoder benefits from TAE and the encoder does not. For a better understanding of what TAE improves in the decoder, we propose two metrics namely copy accuracy and generation accuracy. Copy accuracy only considers tokens appearing in the source sentence. If the model produces all of the tokens that need to be copied from the source sentence, and in the right order, then the score is one otherwise zero for the example. Generation-accuracy ignores tokens appearing in the source intent and computes the exact match accuracy of the prediction. We show how to compute these metrics for the following example: Question: define the function timesince with d, now defaulting to none, reversed defaulting to false as arguments.
Ground Truth: "def timesince(d, now=none, re-versed=false): pass" We iterate over the ground truth script tokens one by one and remove those that can be copied from the source, leading to this code: Generation Ground Truth: "def (=none=):pass", and the removed tokens will be considered for copy ground truth. Copy Ground Truth: "timesince d , now , reversed false".
We would then use the copy and generation ground truth strings to compute each metric. Note that the order of tokens are still important and exact equality is required.
As shown in Table 5 both metrics are improved. Table 1 illustrates one example from each type and with more samples in the Appendix E. Copy accuracy is important for producing the right variable names mentioned, and it is improved as expected. It is also encouraging to see quantitatively and qualitatively that grammar mistakes are reduced, meaning that the lack of prior knowledge of target language structure is compensated by learning from monolingual data.

Conclusion
This work has shown the possibility to achieve a competitive or even SOTA performance on semantic parsing with little or no inductive bias design.
Besides the usual large-scale pre-trained encoders, the key is to exploit relatively large monolingual corpora of the meaning representation. The modified copied monolingual data approach from machine translation literature works well in this extremely low-resource setting. Our results point to a promising alternative direction for future progress. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464-468.

A Datasets
We used 6 datasets in total. Django includes programs from Django web framework and CoNaLa contains diverse set of intents annotated on python snippets gathered from Stackoverflow. WikiSQL, GeoQuery, and ATIS include natural language questions and their corresponding SQL queries . WikiSQL includes  single table queries while GeogQuery and ATIS requires queries on more than one table. Finally, Magic has Java class implementation of game cards with different methods used during the game. Table 6 summarises all the parallel datasets. For GoeQuery we used query split provided by (Finegan-Dollak et al., 2018).
Monolingual Corpus: CoNaLa comes with 600K mined questions from Stackoverflow. We ignored the noisy source intents/sentences and just use the python snippets. To be comparable with Xu et al. (2020), we also select a corresponding 100K subset version for comparison. For SQL, Yao et al. (2018) automatically parsed StackOverflow questions related to SQL and provided a set containing 120K SQL examples. We automatically parsed the SQL codes and removed samples with grammatical mistakes. We also filtered samples not starting with SELECT special token. Allamanis and Sutton (2013) downloaded full repositories of individual projects that were forked at least once; duplicate projects were removed. We randomly sampled 100K Java examples from more than 14K projects and use that as monolingual set. Table 7 summarises all the monolingual datasets.

C Architecture and Experiment Details
We selected the decoder learning rate based on linear search over [1 × 10 −3 − 2.5 × 10 −5 ]. Number of decoder layers has been decided based on search over {2, 3, 4, 5, 6} layers and 4 layer decoder shows superior performance (we used a single run for hyperparameter selection). Each model has 150M parameters optimized using a single GTX 1080 Ti GPU. With batch size of 16 each step takes 1.7s on GeoQuery dataset (other datasets have very similar runtime). On Django and CoNaLa, we followed Xu et al., 2020) on replacing quoted values with a "str#" where # is a unique id. On Magic dataset, we replaced all newline "\n" tokens with "#"; following (Ling et al., 2016a), we splitted Camel-Case words (e.g., class TirionFordring → class Tirion Fordring) and all punctuation characters. We filtered out Magic data with java code longer than 350 tokens in order to fit in GPU memory.

D Back-Translation and Fusion details
For fusion we follow equation 1 where TM stands for translation model and LM stands for language model. τ limits the confidence of the language model and λ controls the balance between TM and LM. figure 4 shows the performance of a base TM trained on 10% of Django training data with test exact match accuracy of 31.80 over different values of λ and τ . The LM is trained over full Django training set. For back-translation we first trained the model using the same architecture explained above in the backward direction. We used BLEU score as a evaluation metric and use early stopping based on that. Using greedy search we generate the corresponding source intent for each code snippet. In the end, the synthetic data is merged with the bitext and trained a forward model.