Natural Language to Code Translation with Execution

Generative models of code, pretrained on large corpora of programs, have shown great success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et al., 2022, inter alia). While these models do not explicitly incorporate program semantics (i.e., execution results) during training, they are able to generate correct solutions for many problems. However, choosing a single correct program from a generated set for each problem remains challenging. In this work, we introduce execution result–based minimum Bayes risk decoding (MBR-EXEC) for program selection and show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks. We select output programs from a generated candidate set by marginalizing over program implementations that share the same semantics. Because exact equivalence is intractable, we execute each program on a small number of test inputs to approximate semantic equivalence. Across datasets, execution or simulated execution significantly outperforms the methods that do not involve program semantics. We find that MBR-EXEC consistently improves over all execution-unaware selection methods, suggesting it as an effective approach for natural language to code translation.


Introduction
The recent success of large pretrained language models (Radford et al., 2019;Brown et al., 2020) has extended to translating natural language descriptions into executable code (Chen et al., 2021;Austin et al., 2021;Li et al., 2022, inter alia).After pretraining on large corpora of code with a simple language modeling objective, the models demonstrate the ability to follow few-shot prompts (Radford et al., 2019;Brown et al., 2020) to translate nat-

Few-shot Prompt Test Example
Figure 1: Illustration of MBR-EXEC on translating natural language to Python code: we (1) sample programs from Codex (Chen et al., 2021), (2) execute each program on one test case, and (3) select the example with the minimal execution result-based Bayes risk.Numbers around dotted lines denote the 0/1 matching loss between execution results, while the Bayes risk of a program is defined by the sum of the loss between itself and other examples.In the figure, either Code #1 or Code #3 can be selected.Ground-truth program output is not needed for selection.
ural language to various programming languages.While code sampled from such models obtains surprisingly good BLEU scores against ground-truth programs and relatively high execution accuracies, it often includes obvious mistakes, and is of much lower quality than the code written by intermediatelevel human programmers (Li et al., 2022).In addition, choosing a single correct one from a set of generated programs remains challenging.
In this work, we translate natural language to executable code with awareness of execution results on a limited number of test case inputs, which we require only at inference time.Our approach is built on the hypothesis that a pretrained code model spreads probability mass over multiple semantically-equivalent code forms that implement the same functionality.Given a text description of a desired program function, we (1) sample a set of programs from a pretrained code model ( §3.1) and ( 2) select a single candidate program using execution-result-based minimum Bayes risk (MBR) decoding ( §3.2).Intuitively, we score each sampled program using its agreement to other samples in terms of execution results, and select a program with maximal overall agreement.
Our evaluation focuses on a challenging setting where only a single program can be submitted as the solution to a given problem.We show that the execution result-based selection method (i.e., MBR-EXEC) significantly outperforms all noexecution baselines across all considered datasets, despite having never executed any code during training and even when it has no access to groundtruth outputs.In addition, we show that MBR decoding with a BLEU-based risk function performs consistently well across datasets, and can be considered as a promising alternative when we are not able to execute.

Language to Code with Neural Networks
With the progress of neural network-based language modeling and conditioned text generation, there has been much work exploring natural language to code generation with end-to-end neural model architectures (Xiao et al., 2016;Ling et al., 2016;Rabinovich et al., 2017;Dong and Lapata, 2018;Suhr et al., 2018;Xu et al., 2020;Lachaux et al., 2021, inter alia).Recently, large Transformer-based (Vaswani et al., 2017) pretrained code models have shown surprisingly strong generation performance across programming languages (Chen et al., 2021;Austin et al., 2021;Li et al., 2022, inter alia).In this work, we explore selection (i.e., inference) methods to apply to these pretrained models, showing that selecting programs using their execution results can greatly improve program generation.

Prompting Pretrained Language Models
The GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) models have shown strong prompting performance: after conditioning on a task-related prompt, the language models are often able to make accurate output predictions for unseen inputs.These results lead to prompt-based approaches for few-shot or zero-shot text classification (Shin et al., 2020;Gao et al., 2021;Min et al., 2021, inter alia), question answering (Khashabi et al., 2020), machine translation (Radford et al., 2019), and evaluation of generated text (Yuan et al., 2021), where no more than a few examples are used to construct the prompts.Few-shot examples are usually formatted into natural language prompts and continuations generated by the models for these prompts are then converted to taskspecific predictions.The prompt formatting can be either manually designed (Jiang et al., 2020) or automatically learned (Li and Liang, 2021;Lester et al., 2021).Recently, Wang et al. (2022) find that self-consistency based decoding improves chainof-thought prompting (Wei et al., 2022).We refer the readers to Liu et al. (2021) for a more comprehensive survey.
In this work, we prompt a pretrained code model (Codex; Chen et al., 2021) in a few-shot setting ( §3.1) and perform execution-based selection over the samples.We also find that the Codex model performs well with a fairly programming-languageagnostic prompt formatting (Table 1).

Minimum Bayes Risk Decoding
In structured prediction, Minimum Bayes risk (MBR) decoding (Bickel and Doksum, 1977) selects a structured output that minimizes the expected errors in the structure by introducing an explicit loss function to the decision criterion.This method has outperformed the maximum a posteriori (MAP) method on many tasks, including syntactic parsing (Titov and Henderson, 2006;Shi et al., 2019;Zhang et al., 2020), statistical machine translation (Kumar and Byrne, 2004;Zhang and Gildea, 2008), and neural machine translation (Eikema andAziz, 2020, 2021).
In machine translation, MBR decoding is usually implemented by reranking candidates (Goel and Byrne, 2000;Kumar and Byrne, 2004;Tromble  and [CODE] with natural language descriptions and corresponding code snippets respectively.We also provide compatibility for an optional [INFO] section to provide the model extra information (e.g., the desired function identifier and example function calls) that helps code generation.In general, we expect the pretrained code models to generate a </code> token at the end of each code snippet given its pattern following ability (Brown et al., 2020;Chen et al., 2021), otherwise we truncate the generated code to a maximum of 1024 tokens.et al., 2008, inter alia).Let F denote the input, and E denote the corresponding ground-truth translation.Given a loss function ℓ(•, •) between translations and a probability model P (E | F ), MBR decoding can be formulated as where E h is the hypothesis space, and E e is the evidence space: both are sets of possible translations.We define execution based MBR loss functions, and show that they are crucial in the sample selection processes for natural language to code with a pretrained large language model.

Proposed Approach: MBR-EXEC
Our execution-based framework consists of two parts: (1) collecting samples from a pretrained code model ( §3.1) and ( 2) selecting the best candidate using minimum Bayes risk decoding ( §3.2).

Sample Collection
To obtain the corresponding code, we query the pretrained code model with few-shot prompts followed by the text description, using a unified mark-up style few-shot prompting template (Table 1). 2 In addition to the generated programs 2 While existing work on prompting language models usually requires a task-specific design of prompts (Shin et al., 2020;Zhong et al., 2021;Gao et al., 2021, inter alia), we find that a fairly general pattern (Table 1), which does not involve any programming language-specific information, works well across programming languages on Codex.
themselves, most existing models also allow us to have the associated probability of generating each generated token w i conditioned on the prompt tokens C = ⟨c 1 , . . ., c n ⟩ and all the previously generated tokens w 1 , . . ., w i−1 , denoted by P (w i | C, w 1 , . . .w i−1 ).

Execution-Based MBR Decoding
Given a problem in its natural language description C, we sample a set of programs P = {p i } N i=1 using the method in §3.1.We formulate the executionbased MBR (MBR-EXEC) decoding by selecting as the best candidate, where L MBR (•; •) denotes the MBR loss of a program conditioned on a set of references and ℓ is a predefined, execution-based loss function that examines the discrepancy between two programs.Intuitively, this finds a consensus candidate which has a low loss relative to all other candidates.The above implementation is an unbiased estimation of Eq (1).We introduce the following execution resultbased loss function: where T is the set of available test inputs, 3  There may be multiple programs receiving the same MBR loss L MBR (•; P), which are all minima.We break any ties by selecting the program with the largest likelihood among them.

Experiments
We evaluate ( §4.3) and analyze ( §4.4) the performance of MBR-EXEC, starting with introducing the datasets and evaluation metrics ( §4.1), as well as non-execution-based baselines ( §4.2) for MBR-EXEC.Finally, we show and discuss oracle performances on the considered tasks ( §4.5).
MBPP.The MBPP dataset (Austin et al., 2021) 4 consists of 974 basic Python programming problems, with 500 of them used for testing and the rest for training or few-shot prompting.There are ground-truth program and three assertions (i.e., test cases with input and ground-truth output) associated with the description of each problem.When collecting the samples, we use one assertion as the ground-truth test case output, nor the ground-truth programs.This is compatible with many real scenarios, e.g., in a programming competition, where valid test input are easier to access than ground-truth output.
4 https://github.com/google-research/google-research/tree/master/mbpp extra information ([INFO]; Table 1). 5Programs are evaluated with execution accuracy, where a program is considered as passing if all three test cases are correct.
Spider.The Spider dataset (Yu et al., 2018) 6 is a text-to-SQL dataset, which requires a model to translate text descriptions into SQL commands.There are 7,000 examples for training and 1,034 for development.When prompting models to produce candidate commands, we concatenate the corresponding SQL table and column names as the [INFO].Commands are evaluated with the execution accuracy, where a command is considered as passing if it returns the same result as the groundtruth command when being executed on the same database.
NL2Bash.The NL2Bash dataset (Lin et al., 2018) aims to translate natural language to bash commands.We do not include [INFO] in the sample collection process.Because it is difficult to execute bash commands in a sandbox, we split a bash command with bashlex,7 a rule-based bash parser, and use the token-level BLEU-4 score between commands as the estimation of execution result similarity.We consider a command to be unexecutable when bashlex fails to parse it.Following Lin et al. (2018), commands are evaluated with character-level BLEU-4 score.
Across datasets, we use 15 examples from the training set for few-shot prompting.A detailed example showing prompt formatting can be found in Appendix A. Unless otherwise specified, we collect samples by querying Codex with five different prompts, each containing 3 examples, using temperature 0.3.We combine the candidates sampled across the five prompts to get a set of candidate samples to use in our selection methods.For execution on MBPP and Spider, we apply a memory limit of 128GB and a time limit of 10 seconds on a single Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz CPU, and consider the programs that exceed these limits as inexecutable; unless otherwise specified, we only execute each program on the first test input provided for the example, and use the output for calculating the Bayes risk in the inference process.Figure 2: Primary evaluation results: performance of the evaluated selection criteria (best viewed in color).For each sample size, we evaluate the methods on 5 different groups of samples and report the average performance (lines) and the standard deviations (shaded regions).All samples are collected from Codex with temperature 0.3.

Baselines
We compare the most basic baselines with no selection, prompting Codex with three examples in Table 1 format:8 • Greedy decoding.We perform token by token greedy decoding to generate the output.
• Sampling.We sample the output token by token with a fixed temperature, where we set the temperature as 0.3 in all of our experiments.
In addition, we consider the following baseline sample selection methods: • Maximizing likelihood (ML).Given a set of sampled candidate programs, we select the one with the largest log likelihood.Formally, we select p = arg max where n p denotes the number of tokens in a generated program p, and w p,i denotes its i-th token.
• Maximizing average log likelihood (MALL) across tokens.In order to address the practical issue that ML typically favors shorter sequences, we follow Chen et al. ( 2021) and propose another baseline that uses the average log likelihood across tokens as the selection criterion, where we select • BLEU score based MBR (MBR-BLEU).To study the effect of execution based MBR in sample selection, we consider BLEU score based MBR, where the Bayes risk is calculated using the following risk function: where BLEU(p i , p j ) is the BLEU score of the two programs.We use character-level (MBR-charBLEU) or token-level (MBR-tokenBLEU) BLEU-4 in all of our experiments.

Primary Results
We evaluate MBR-EXEC on the three datasets ( §4.1) with dataset-specific metric, where we use one test case for each problem.MBR-EXEC outperforms all baselines without a selection process by a significant margin (Table 2).In addition, we find that MBR-EXEC outperforms all baseline selection methods (Figure 2), and is especially effective on the two datasets (MBPP and Spider) that use execution-based evaluation.In addition, the MBR-BLEU metrics are also strong and robust across datasets, suggesting the effectiveness of finding a consensus candidate that has generally low discrepancy with other samples.While more samples lead to better performance for most methods, MALL consistently performs worse with a larger sample size, as we find that MALL generally favors programs with unnecessary repetitions,9 and a larger sample size generally leads to a larger chance to have such a sample.

Analysis
We analyze the performance of MBR-EXEC from the following perspectives: the effectiveness across different sample collection temperatures ( §4.4.1), the effectiveness of using groups of 3-shot prompts ( §4.4.2) and the contribution of using execution results instead of simply checking the executability of programs ( §4.4.3).

Effect of Sample Temperature
We first compare sampling with temperature 0.3 to greedy decoding (i.e., temperature τ = 0) from the Codex model (Table 3).When having the same number of examples, MBR-EXEC on sampled candidates with temperature 0.3 consistently reaches competitive or better performance than that on greedy decoded candidates.We plot the performance of MBR-EXEC for various sampling temperatures (Figure 3).Across datasets, we find that MBR-EXEC with a decoding temperature lower than 0.5 usually leads to rea- Figure 4: Performance with different types of prompts, where groups of 3-shot denotes the prompt formatting in Table 1, while concatenation of 15 denotes concatenating all available 15 examples as prompts for data collection.
sonably good performance.When the temperature approaches 1.0, the results rapidly drop for all considered selection methods on MBPP and Spider; however, MALL generally achieves higher performance on NL2bash with a higher temperature.
According to the evidences discussed above, we recommend to use sampling with a low temper- Figure 5: Comparison between applying methods to all possible candidates vs. applying methods to only executable candidates (best viewed in color), where executability-X denotes applying selection criteria X on executable candidates only.We did not include MBR-tokenBLEU and MALL and their combination with executability check in this figure for clarity -full analysis on execution vs. executability can be found in appendix B.
ature (specifically, lower than 0.5) for candidate sample collection, and perform MBR-EXEC for final program selection for better results.

Effect of Different 3-shot Prompts
We analyze the necessity of choosing multiple groups of 3-shot instead of simply concatenating the available 15 examples as the prompt (Figure 4). 10 We allow different orders of the 15 examples when collecting samples.On both MBPP and NL2Bash datasets, we find that using different groups of 3-shot prompts clearly outperforms concatenating all 15 examples, suggesting that different groups of fewer-shot prompts followed by post-hoc decoding may be more effective than us- 10 We only include MBPP and NL2Bash results here as concatenating 15 Spider examples usually results in exceeding the token number limit of the pretrained models.ing all available examples for all time.

Executability vs. Execution Results
We perform an ablation study to identify the contribution of execution results vs. program executability (Figure 5) on the MBPP and Spider datasets. 11e try to execute all candidates on the test cases, and perform baseline candidate methods only on the candidates that successfully execute within the time limit.On both datasets, we find that simply involving executability checking significantly helps improve the performance of all non-semantic feature-based selection methods; on Spider, applying ML over executable commands even outperforms MBR-EXEC across sample sizes.

Soft Loss as the Bayes Risk Function
While all the above evaluations are based on executing one test case per problem, more test cases can lead to more accurate judgments of semantic equivalence between programs (Zhong et al., 2020).Therefore, we introduce more test cases, and compare ℓ ( §3.2) with ℓ soft , a soft version of the loss function, as the Bayes risk function in MBR-EXEC.We define ℓ soft as follows: which assesses equivalence based on the number of test cases that receive the same output.If there Figure 7: Sample size-oracle performance curves on the considered datasets.We calculate each expected Pass@K with 5 different sets of candidates for each sample size, while using the same sets to perform MBR-EXEC for fair comparison.
is only one test case available, ℓ and ℓ soft are equivalent.
We experiment with the MBPP dataset (Figure 6) as it provides three test cases per problem.While multiple test cases clearly outperforms MBR-EXEC with one test case across sample sizes, we did not find significant difference between ℓ hard and ℓ soft , nor between using two or three test cases.

Oracle Performance
We report the upper bound performance of all inference methods (Figure 7).Here, we define the expected Pass@K on one problem q by ExPass@K(q) where G(t) denotes the ground-truth output for test case input t.Intuitively, to calculate the performance upper bound, a problem q is considered to be solved if there exists one program in the candidate sample set P that passes all associated test cases T q .The dataset-level expected Pass@K is defined as the average expected Pass@K over all problems.
In addition, we report the supervised performance on these datasets, where all available training data are used for model training or finetuning: for MBPP, the results are from Austin et al. (2021), where they use all 374 training examples to finetune their pretrained code model; for Spider, we compare to the current state-of-the-art result (Scholak et al., 2021); for NL2Bash, we finetune GPT-2 (Radford et al., 2019) with all training examples with the same prompting set up as Table 1.
However, it is worth noting that the upper bounds already outperform the state-of-the-art supervised performances on all datasets by a significant margin, when a reasonable amount of sample is given.This further demonstrates the effectiveness of the pretrained code models, and points out a potential next step in the direction: while such models are able to generate correct programs, designing effective inference algorithm may be a promising way towards translating natural language to code in real world applications.

Discussion
We presented and systematically analyzed MBR-EXEC, an execution-based inference algorithm for pretrained language to code models, on datasets that cover three representative programming languages.Our results showed that doing execution, even with access only to inputs (not outputs) for test cases, or with only access to an executability checker, substantially helps improve the quality of generated programs especially in the settings that use execution accuracy as the evaluation metric (MBPP and Spider).Given the consistently strong performance, we suggest future work on program synthesis with large pretrained models consider MBR-EXEC as an effective selection algorithm.When we are not able to execute programs, or there are no test inputs available, our results suggest considering an alternative MBR metric (e.g., MBR-BLEU) as the selection algorithm.

Figure 3 :
Figure 3: Performance of the evaluated selection criteria across temperatures (best viewed in color).For each temperature, we perform the methods on 5 different groups of 25 examples and report the average performance (lines) and the standard deviations (shaded regions).

Figure 6 :
Figure 6: Execution accuracies with respect to sample size on the MBPP dataset, where the number in the parentheses denotes the number of test cases per problem used for MBR-EXEC.Best viewed in color.

Table 1 :
Prompt formatting template for queries to pretrained code models.For instantiation, we substitute[TEXT] and p i (t) denotes the execution result of program p i

Table 3 :
MBR-EXEC performance on greedily decoded and sampled programs: for each problem, we use 25 groups of 3-shot prompts, decode or sample one program with each prompt, and use MBR-EXEC to select the best program.For sampling with temperature 0.3, we repeat the process for 5 times and report the average performance and standard deviations.The datasetspecific metric can be found at §4.1.The best number in each row is in boldface.Note that the greedy performances are different from those reported in Table2, as we perform MBR-EXEC here over greedy decoding outputs, while report the average performance in Table2.

Table 4 :
MBPP example prompt and response from Codex: we use the first assertion in the dataset as the extra information (i.e.,[INFO]in Table