Execution-Based Evaluation for Open-Domain Code Generation

To extend the scope of coding queries to more realistic settings, we propose ODEX, the first Open-Domain EXecution-based natural language (NL) to Python code generation dataset. ODEX has 945 NL-Code pairs spanning 79 diverse libraries, along with 1,707 human-written test cases for execution. Our NL-Code pairs are harvested from StackOverflow forums to encourage natural and practical coding queries. Moreover, ODEX supports four natural languages as intents, in English, Spanish, Japanese, and Russian. ODEX unveils intriguing behavioral differences among top-performing code language models (LM). While CODEX achieves better overall results, CODEGEN improves effectively via scaling -- CODEGEN 6.1B performs comparably with CODEX 12B. Both models show substantial gaps between open and closed domains, but CODEGEN gaps tend to decrease with model size while CODEX gaps increase. We release ODEX to facilitate research into open-domain problems for the code generation community.


Introduction
Evaluations of NL-to-code generation systems, especially for general-purpose programming languages such as Python, have put an increasing emphasis on methods that execute code to verify the results.The predominant approach for creating such test sets is to manually write test cases for canonical code solutions (Chen et al., 2021;Austin et al., 2021;Lai et al., 2022;Huang et al., 2022).The correctness of model predictions is then evaluated by seeing if generated code passes the test cases (Chen et al., 2021).Compared to execution-free metrics such as text match against reference solutions, execution-based methods more rigorously assess the functional correctness of code (Hendrycks et al., 2021;Chen et al., 2021).
We analyze and highlight three aspects of ODEX ( §3).First, ODEX has broad domain coverage of 79 libraries, with 53.4% of the problems employing at least one library.Second, ODEX contains queries in four different languages, with 439, 90, 164, and 252 samples in English, Spanish, Japanese, and Russian, as shown in Figure 1.Third, ODEX addresses three unique challenges in open-domain code execution: irreproducible runs (Figure 1 a ), randomized outputs (Figure 1 b ), and specialized equivalence checks (Figure 2).
We evaluate two state-of-the-art code LLM families, CODEX and CODEGEN, on ODEX ( §5).Our study shows that larger model sizes and augmented training data improve execution accuracy.Meanwhile, we observe satisfactory multilingual capabilities, despite that neither model was specifically designed for multilingual usage.However, we find that models face greater yet varied challenges with open-domain queries compared to closed-domain queries ( §5).Specifically, CODEX achieves higher overall results, while CODEGEN presents better parameter efficiency and more balanced open-closed domain performance as model size scales up.By comparing execution-based metric with a series of execution-free metrics ( §6), we further confirm the advantage of execution on allowing alternative solutions, but also show the potential of lexical metrics to identify simple bug fixes.ODEX jointly facilitates practical open-domain code generation and execution-based evaluation.It serves as a comprehensive data benchmark for NL-to-code systems, supporting diverse NL contexts, library usage, and evaluation methods.By addressing the unique challenges of test creation and execution, we hope to lay a foundation for evaluating open-domain code via execution.

The ODEX Dataset
In this section, we describe our four-step process of constructing the ODEX dataset.We first collect resources of natural, open-domain coding queries ( §2.1).Next, we establish the annotation standard and procedures for test case creation ( §2.2).We then describe the annotator hiring and working processes ( §2.3).Finally, we conduct checks to ensure data quality ( §2.4).

Resource Collection
We take two NL-to-code datasets, CoNaLa (Yin et al., 2018) and MCoNaLa (Wang et al., 2023), as sources for ODEX.We refer to them together as (M)CoNaLa.Their NL-Code pairs are collected from StackOverflow, which contains abundant coding queries that (1) naturally reflect practical program usage, and (2) cover diverse domains as measured by libraries used.These properties align well with our main focus on open-domain queries.(M)CoNaLa further proofs and clarifies its NL intents using human annotators to ensure data quality.

Annotation Standard and Procedures
Given each source NL-Code pair, our main annotation task is to write test cases to check code execution correctness, as illustrated by the four steps in Figure 2. A qualified test case should verify the main functionality of the canonical code solution.
In the case where annotators do not understand the language of the intent, we use translation tools such as the Google Translate API. 3tep 1: Wrapping Snippets into Functions Code solutions in (M)CoNaLa are often short snippets (e.g., x = np.zeros(5)) to ensure more pre-Calculate sum over all rows of 2D numpy array `aǸL

Code
Step 1 code wrapping Step 2 library import Step 3 write test case cise matches with NL intents, but to be executable they often need additional context such as variable assignments.We therefore wrap code into standalone functions by specifying input and output arguments as contexts.For example, Step 1 in Figure 2 identifies variable a as an input argument.
Step 2: Specifying Library Prerequisites Due to the open-domain coverage of (M)CoNaLa, some code snippets require extra library imports to execute correctly.Accordingly, our second step is to specify the prerequisite libraries for code solutions.
Step 3: Test Case Annotation Next, we write test cases that contain three parts: (1) input: passing values to input arguments, (2) output: stating expected execution outputs, and (3) assertion: checking if execution results match the expected outputs.
However, test case creation for open-domain code faces three challenges.First, safe and reproducible execution can be hard to achieve.As in Figure 1 a , it is impractical to send an HTTP request when evaluating this sample.Instead, we use mock to simulate the output (a success response status code 200).Second, some codes entail randomness (e.g., random.randint(3,5))and have no definite value.We instead make bounding assertions, e.g., checking that all elements are integers within the range of [3,5].Third, standard equivalence checks by == may be invalid, since library-specific objects often require specialized equality checks.For example, checking the equivalence of two NumPy arrays a and b uses np.array_equal(a,b), while a == b would cause execution errors.
Step 4: Self Verification In the last step, we perform self-verification to efficiently ensure the annotation quality.We execute the canonical code solution on each newly created test case.Unless the test case enables a successful pass of the solution, it should not be taken as a valid annotation.

Annotator Hiring and Task Fulfillment
As our data involves diverse functionalities from multiple libraries, our annotation task holds a relatively high standard for annotators.A qualified annotator should be proficient in Python and common libraries, and in writing workable test cases.
We chose to hire undergraduate students who have strong computer science backgrounds in Python.Of the 20 applicants who applied, we first conducted a resume screening to filter candidates with sufficient programming experience.Next, we gave each candidate an annotation test with five randomly selected NL-Code pairs.Since the test mirrors the official annotation process, we provided clear instructions about each step (as in §2.2) and code scripts for self-verification.Candidates were asked to finish their tests in three calendar days.Based on their test performance, we hired four candidates to officially participate in this job.

Quality Check
We put great effort into ensuring data quality throughout the annotation process.To assist annotators in more efficiently and accurately writing workable test cases, we require them to execute each written test case using the verification code that we provided, and explicitly report whether the canonical code solution can successfully pass all the annotated test cases that they created.
After the annotation, the authors performed posthoc verification to check if each test case reads reasonably and executes correctly.In our final rounds of automatic quality checks, we confirm that the pass rate for all canonical code solutions over their annotated test cases is 100%.
We collect a total of 945 samples with NLs in four languages, including 439 samples in English, 90 in Spanish, 164 in Japanese, and 252 in Russian.

Diversity
One unique property of ODEX is its broad domain coverage.We categorize codes that entail library usage (both built-in and third-party) as being in the open domain and those with none in the closed domain.Different libraries often serve specific functions and have unique capabilities.For instance, the datetime library is designed to handle date/time operations, while other libraries focus on various other fields such as data analysis or web requests.Therefore, in this work, we view the diversity in libraries as a representation of distinct domains.

Complexity
To measure dataset complexity, we first calculate the lengths of NL intents and code snippets.We tokenize NL intents with the spaCy4 tokenizers in respective languages; we follow Yin and Neubig (2018) to tokenize code.For code, we also parse the AST tree using the Python standard ast library,5 and count the number of input and output variables to quantify the complexity of execution contexts.execution environments, which could stem from relative simplicity of SO queries asked in English.

Execution Support
We systematically compare code generation datasets that concern execution or open-domain code in Table 3. ODEX is the first dataset that supports execution-based evaluation for open-domain code.While ODEX does not have the largest number of test cases, we discuss in §7 how these test cases can still reliably measure code correctness.

Experiment Setup
Code LLMs have achieved strong results on multiple code generation tasks, yet their open-domain proficiency is understudied due to the limited domain settings of past datasets.To examine model capabilities in the open domain, we evaluate two top-performing model families, CODEX and CODE-GEN, on ODEX.We perform evaluations using a prompting setting, without finetuning any model.
We introduce the baseline models, the prompt settings, and lay out the metrics for evaluation.The CODEGEN Family CODEGEN (Nijkamp et al., 2023)    Evaluation Metrics We follow Chen et al. ( 2021) and measure the execution accuracy using the pass@k metric, by computing the fraction of problems having at least one correct prediction within k samples.We also compare it with a series of execution-free metrics later in §5.

Experiment Results
We first present the overall performance of two model families on ODEX ( §5.1).Next, given the unique challenges of open-domain code, we study the variances between open-and closed-domain problems ( §5.2), and in individual domains ( §5.3).

Baseline Performance CODEX Results
As in Table 4, aligning to existing works and our intuition, larger DAVINCI 175B models outperform the smaller CUSHMAN 12B model, and the 002 version improves over 001.This trend holds for all languages and all sampling sizes.Somewhat surprisingly, all models attain decent results on non-English problems, even though CODEX is not designed for multilingual use.This high accuracy on non-English problems suggests the multilingual potential of CODEX models.

CODEGEN Results
We report results of MONO models in Table 4 given their superior performance over NL and MULTI variants (Nijkamp et al., 2023).
The pass rate increases as CODEGEN grows from 350M to 2.7B, and continues to increase in non-English languages when further scaling to 6.1B.CODEGEN exhibits multilingual capacity, as its results on non-English subsets are close to that on English, and consistently increase during scaling.
Although CODEX and CODEGEN have comparable performance on existing datasets such as HumanEval, ODEX effectively unveils the efficacy of CODEGEN on open-domain coding queries even with many fewer parameters, i.e., CODEGEN 6.1B yields similar pass@1 to the 176B CODEX DAVINCI-001 model, although not necessarily so when k increases.More detailed results (pass@k at 1 ≤ k ≤ 10) for both models are in §B.Gaps slightly shrink in Spanish, but increase in English and Japanese.While D2 performs the best, it also exhibits the most severe gaps.These findings suggest that common practices to improve LLMs may not address the complexities inherent in opendomain coding problems.It is hence imperative that more advanced strategies are employed.

CODEGEN Results
As shown in Figure 7 (right), CODEGEN also has substantial gaps between open and closed domains, however, smaller than CODEX gaps across all languages, by on average 6.0% points.As model size increases from 2.7B to 6.1B, the gaps reduce by about 6.3 points in English and 1.7 points in Spanish.This is in contrast to CODEX, which when scaling up to DAVINCI-002, these gaps continue to increase by 4.9 points on average, indicating that scaling up CODEGEN more effectively catches up on open-domain performance.

Domain Variance
We now dive deeper into the results within individual domains.We focus on the CODE-DAVINCI-002 model as it has the best performance across all models.In Figure 8, we plot accuracy with respect to the domain frequency, as approximated in §3.1.
Execution accuracy is not low on all open domains.For example, CODE-DAVINCI-002 achieves 50% pass@1 for several common libraries such as random and math.But high domain frequency does not ensure model proficiency.For example, 1276 p@1 p@2 p@3 p@4 p@5 p@6 p@7 p@8 p@9 p@10 ru / id   on libraries with complex functionalities such as matplotlib and tensorflow, pass@1 can go below 10%.See §C for more domain-wise results.

Comparing to Execution-Free Metrics
In this section, we study the alignment between execution-based evaluation and five execution-free metrics, identifying advantages for both types.
Model Ranking Using Different Metrics We evaluate models using five execution-free metrics using lexical, syntax, and semantic matches: BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), ChrF (Popović, 2015), andCodeBLEU (Ren et al., 2020).Refer to §D.1 for more descriptions.We analyze using CODEX, given its better per-formance.As shown in Figure 9, model rankings by execution-free metrics do not precisely correlate with their rankings by execution accuracy.
Even when the rankings align, their differences are largely not proportional.Comparing the metrics, ChrF and METEOR have smaller inter-model variances, while BLEU and ROUGE change more and correlate better with pass rates.Notably, Code-BLEU is low in most settings and might not be suitable for evaluating code in snippet-style.

Metric Correlation
We next evaluate whether execution-free metrics might be used to discriminate between passed and failed samples.We take BLEU as an example since it shows similar ranking patterns to execution.Figure 10 shows negligible variances in BLEU scores of passed and failed groups.The other four metrics exhibit similar patterns, as could be found in §D.3.

EN ES
JA RU 7 What Affects Model Performance?
Besides differences in model configurations, we study three factors that might affect performance.

Number of In-Context
Examples Models might benefit from example NL-Code pairs.We thus explore to few-shot prefixing N ∈ {1, 2, 3} inputoutput pairs in prompts.In Figure 11 Figure 11 (right) shows that injecting as few as one exemplar test case significantly improves the execution accuracy, yet adding more cases has little bonus.This potentially implies the sufficiency of one test case to show the main functionality.

Number of Evaluation Test Cases
Execution results could be more reliable if using more test cases for evaluation.However, there is a trade-off between evaluation effectiveness and annotation efficiency, due to the high cost of human effort.To study this tradeoff, we observe how results change with respect to the number of tests.Compared to using all cases in default, we also try using one randomly selected case.For simplicity, we do not include any test cases in prompts.
As shown in Figure 12, evaluating over one random test largely preserves the accuracy of using all tests, indicating that one case is sufficient to test the main functionality for most queries.Check §E for analysis on other factors such as function naming.

Related Work
Open Domain Code Generation Programs often use APIs from different Python libraries.Some datasets preserve natural coverage from interactive Jupyter Notebooks (Agashe et al., 2019) or Stack-Overflow posts (Yin et al., 2018;Wang et al., 2023), Figure 12: pass@1 when executing one or all test cases.but face challenges in enabling execution (Lai et al., 2022;Chandel et al., 2022).Our ODEX dataset addresses execution for open-domain code.
Execution-based Evaluation Evaluation by execution has long been used for SQL (Zhong et al., 2017) or logical forms (Dong and Lapata, 2016).Many datasets have begun to support Python execution via test cases, however focus on built-in functions (Chen et al., 2021;Austin et al., 2021;Hendrycks et al., 2021) or specific domains (Lai et al., 2022;Huang et al., 2022).Our test cases, in contrast, cover diverse libraries in the open domain.

Conclusion
We present ODEX, an open-domain code generation dataset supporting execution-based evaluation via human-written test cases.ODEX not only supports execution-based evaluation of code using test cases, but also extends the task to the open domain, covering 79 diverse Python libraries and four natural languages (English, Spanish, Japanese, and Russian).Comparing two state-of-the-art code generation models, CODEX and CODEGEN, our dataset effectively unveils their varied behaviors between program domains and language contexts.ODEX serves as a comprehensive NL-to-code benchmark given its open-domain coverage, multi-natural language queries, and multi-metric support.When bringing code execution to open domain scenarios, our explorations also reveal emerging challenges in test creation and reliable execution, which we hope that our dataset will enable future work to tackle.

Limitations
ODEX aims to serve as a comprehensive testbed, by enabling execution-based evaluation of code in the open domain, with flexible intent inputs in four natural languages.However, we should hold continuous awareness of execution security, multilingual support, and evaluation reliability.
First, execution supports in ODEX enables more rigorous evaluations than other execution-free methods.However, due to the increased complexity of open-domain codes, more inspections are required for execution safety, either for code solutions or test cases.We should always keep alert to avoid concealing malicious code (Wallace et al., 2021) or generating code with security vulnerabilities (Verdi et al., 2020;Pearce et al., 2021).
Second, in addition to English inputs, ODEX also includes intents specified in three other languages.Still, its language coverage is bounded by the available forums in StackOverflow.We hope our initiative can highlight the multilingual nature of program developers, encourage the emergence of similar data resources in other languages, and continuously promote AI programming assistance in languages worldwide.
Third, as ODEX covers wide-ranging code queries in the open domain, it is more suitable for less resource-demanding scenarios such as downstream evaluation or few-shot learning.Although ODEX is larger than many previous datasets with human-written test cases, it is still limited due to the intense human effort required by the curation process.Regarding this, we encourage users of the dataset to conduct significance testing (Dror et al., 2018) and report more substantial model improvements.

Ethics Statement
Our work has received IRB approval and is licensed under a Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 International License.The resulting ODEX dataset is built to serve as a benchmark for open-domain code generation, to further facilitate technological advances in AI programming assistance, meanwhile supporting multiple languages to encourage its universal accessibility.
We strive to ensure high data quality and optimize annotation efficiency.We build the ODEX dataset with natural and practical StackOverflow resources and hire annotators with qualified programming proficiency.We provide our annotators with clearly documented instructions, flexible annotation interfaces (Google Sheets, Jupyter Notebooks), and self-verification tools.We (authors) conduct pilot annotation to confirm the clarity of annotation standards and feasibility of the annotation task.We conduct posthoc examinations on the annotation results, both manually and automatically, to obtain assured data quality (100% pass rate).
We respect the contribution and privacy of our annotators.We offer competitive remuneration for their annotation job and treat each one of them fairly.All annotators possess the right to withdraw at any time.We secure that all their personal information is removed before public release.
We conduct systematic analysis from multiple perspectives in the paper, in an attempt to foster public awareness on generating and evaluating programs in the open domain, both in encouraging more advances in this direction, and raising more concerns about the robustness and security of such unique coding problems.

A.1 Library Distribution Statistics
Aside from the illustrations in § 3.1, we list out the detailed statistics of libraries in ODEX, the eight comparison datasets, and the approximated natural distribution.

A.2 More Annotation Details
Along with the NL-Code pair, we also provide IDs of the source StackOverflow post, using which annotators can trace back to the original post webpage 8 https://docs.github.com/en/search-github/searching-on-github/searching-code and get a better understanding of the question.If any errors or under-specification are spotted in the given NL or code, we ask the annotators to correct it by making the minimal change possible.
We encourage the annotators to use the language identical to the given NL intent when creating the test cases, especially if the code involves stringrelated operations (e.g., writing regular expressions in Japanese).We encourage the annotators to write reasonably more and diverse test cases, by varying the values or types of variables.
Please find the full instruction9 and examples10 for annotation in our code repository.

B Baseline Results
According to the baseline results in § 5.1, we provide more detailed evaluation results, on the execution pass rate ranging from the top-1 to top-10 model predictions.Table 8 and Table 9 show the zero-shot execution accuracy of CODEX and CODE-GEN models, respectively.

C Domain-Wise Execution Results
We list out detailed results for experiments in §5.

C.1 Open Domain Versus Closed Domain
Table 10 and Table 11 shows the execution accuracy for CODEX and CODEGEN on open-domain and closed-domain problems, respectively.

C.2 Domain-wise Execution Accuracy
As introduced in § 5.3, we take CODE-DAVINCI-002, and report its execution accuracy on each domain in Table 12.

C.3 Qualitative Error Analysis
To provide more intuitive explanations of the domain divergence aforementioned, we conduct error analysis over 60 randomly selected examples from ODEX dataset (15 for each language).By examining the error patterns from these examples, we aim to answer: what are the common error types on open-and closed-domain problems?What are the main differences between them?Similar to the previous section, we take the CODE-DAVINCI-002 since it scores the best and presents clear domain gaps, which might give more intuitive variances between domains.

Closed-Domain Errors
Of the 60 random samples we analyzed, 31 are closed-domain problems, and CODEX predicts erroneous code solutions for 22 of them.We identify four main types of errors from these samples: (1) 11 cases (50.0%) use the Python built-in functions incorrectly, mostly about strings manipulations and number calculations; (2) 7 cases (31.8%) failed at complex functions, which usually require multi-step implementations; (3) 4 cases (18.2%) received empty predictions, potentially because they involve unfamiliar topics to the model; (4) 2 cases (9.1%) imports extra library or add redundant implementations.
Note that the number of error cases in these four categories does not add up to 22. Since we analyze all of the error predictions among the model top-10 predictions, one case could present multiple error types in its different predictions.
Open-Domain Errors Of the other 29 problems belonging to the open domain, 26 of them have erroneous predictions.Errors in the open domain exhibit more diversity than in the closed domain.The major error enclosing 16 cases (61.5%) is the failure to use the prerequisite libraries, or missing part of them when multiple libraries are involved.The next major type is using incorrect functions, which happens in 9 cases (34.6%).Similarly to the closed-domain errors, 5 cases (19.2%) have error usage of correct functions, 4 cases (15.4%) struggle with complex multi-step implementations, and 3 cases (11.5%) face empty predictions.
OD and CD problems share some error categories such as function misuse and complex operations.Nonetheless, open-domain problems introduce extra challenges: correct selection and usage of libraries and functions in the wild.

D Evaluation Metrics
We describe each of the non-execution metrics ( § D.1) as introduced in § 6, report model performance with each ( § D.2), and visualize their correlations with the execution accuracy ( § D.3).

D.1 Metric Description
BLEU BLEU (Papineni et al., 2002) is a lexicalbased evaluation metric, which calculates the ngram overlap between text prediction and (multiple) references.Most default calculation processes calculate up to 4-grams and adopt the smoothing function introduced in Lin and Och (2004).ROUGE ROUGE (Lin, 2004) is another more recall-oriented lexical-based evaluation metric.It was originally designed for measuring text summarization, mainly by counting the number of overlapping units (n-gram, word sequences, and word pairs) between prediction and references.Among the multiple variants proposed (ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S), we use the most common ROUGE-L in our experiments.
METEOR METEOR (Banerjee and Lavie, 2005) is a unigram-based metric originally intended for machine translation.It builds on a generalized unigram concept by involving unigram precision, unigram recall, and word order measures.
ChrF ChrF (Popović, 2015) targets lexical match on the character level, by calculating the characterlevel n-gram F-score between predictions and references.ChrF is also originally proposed for the machine translation task, but later adopted for some code evaluation works (Evtikhiev et al., 2023).
CodeBLEU CodeBLEU (Ren et al., 2020) is specifically designed for code evaluation, by jointly considering the surface-form match, syntax similarly, and semantic data flows.

D.2 Evaluating with Non-execution Metrics
Table 13 and Table 14 shows the scores of CODEX and CODEGEN using non-execution metrics.

D.3 Visualizing Metric Correlations
Following the discussion in § 6, we visualize the non-execution metric metrics between samples that pass and fail during execution time.
All experiments use CODE-DAVINCI-002 predictions for evaluation.

D.4 Why is Execution Better?
To give more intuitive reasons for the advantages of execution, we randomly sample 15 cases from each language subset and identified two major benefits: it tolerates alternative solutions and allows execution results as outputs.
Alternative Code Implementation Probably the greatest advantage of execution is it only requires correct execution results, without limitations on alternative methods, as in Figure 17.Directly Generating Execution Results Another interesting category is directly generating the code execution results instead of the implementation steps.This often happens to simple coding queries such as basic string manipulation, where predicting the results might cost the model similar efforts to getting the programmatic solutions.In Figure 18, instead of the string decoding program, the model directly outputs the result string "JLK".While this is somewhat unexpected under the NL-to-Code task, execution effectively handles such cases and would judge them as correct.

D.5 Potential Benefit of Lexical-based Metrics
Lexical-based metrics, although relatively ineffective for functional correctness, still are potentially helpful for debugging and interpretation.They are effective in small errors of two types: (1) a single function misuse and (2) slight variance in complex strings.The high lexical match in such cases indicates less effort for fixing (Deng et al., 2021).
Function Misuse Some code predictions are correct except for a single place where a wrong function is used, or an argument is misplaced.For example, in Figure 19, the code imports the library and copies all strings correctly.But it uses the wrong function match instead of the correct findall.Although the execution fails, the code is similar to the solution.Given the sign of a high BLEU score of 92.5, we could readily spot such similarities and fix them with simple edits.
String Difference Another frequent error concerns string copying, where the code calls the correct functions but copies the string differently.
The example in Figure 20 gets a 100.0 BLEU score, but the string inside actually misses a single whitespace, which the BLEU tokenization would discard.Such code also resembles the solution and could be easily fixed by even rule-based methods.

E Ablation Studies
This section provides the results tables according to each ablation study section in § 7.

E.1.2 Number of Input Test Cases
Table 18 shows the effects on execution accuracy of adding one or more test cases to prompts.Experiments use CODE-DAVINCI-002 as an example.
# test NL Pass Rate @1 @2 @3 @4 @5 @6 @7 @8 @9 @10 0  We conjecture the gain brought by whitespace stripping to be better distributional alignment with CODEGEN training data.As CODEGEN might be pre-trained on whitespace-stripped text sequences, inputs without whitespaces are potentially more aligned with them, hence resulting in better testtime performance.Meanwhile, note that the tokenization processes for text (natural language) and code (programming language) differ in whitespacestyle tokens such as \n or \t.These tokens would be removed by text tokenizers by default, while preserved by code tokenizers since they imply structural information in code pieces.

E.2 Number of Evaluation Test Cases
Table 21 shows the effect when using different numbers of test cases for execution-based evaluation.

E.3 Semantics of Function Names
Because code is wrapped into functions to enable execution, how functions are named may affect model predictions.By default, we name functions using the post ID (e.g., f_3844801), which expresses little semantics of queries.So we try two other methods: (1) a constant string function; and (2) summary phrases from NL intents, e.g., find_max_value.
To do (2), we conduct a heuristic phrase extraction.We first cut the NL intent into words by whitespace, then remove the stop words ('in', 'of', 'a', 'to', 'and', 'for', 'with', 'that') and meaningless punctuations, lastly, concatenate the first M = 4 words with '_'.For example, given an intent "decode a hex string '4a4b4c' to UTF-8", the resulting function name would be "decode_a_hex_string".However, for languages that do not separate words with whitespace, this approach may produce less meaningful strings, hence contributing to the inferior performance as shown below.
To fairly compare with previous results, we do not add test cases in prompts.
From Figure 21 and Table 23, using more semantically meaningful functional names barely improves over the default setting.Intuitively, summarizing names from intents adds no extra semantics, but may cost information loss at the curation step, both contributing to the performance drop.Func Name NL Pass Rate @1 @2 @3 @4 @5 @6 @7 @8 @9 @10 task-id To further address the train-test data overlap issue in Codex and CodeGen models, we addition-ally evaluate the SantaCoder (Allal et al., 2023) and StarCoder (Li et al., 2023) models, which have not been trained on any SO data.Table 24 shows the pass@1 of 16B StarCoder and StarCoderBase models, where both models show significant gaps between open-and closed-domain queries, thanks to the broad domain coverage of ODEX.et al., 2019) or natural language intents from Stack-Overflow posts (Yin et al., 2018;Wang et al., 2023).Despite their natural coverage, enabling open-domain code execution has faced great challenges given its diversity and complexity (Lai et al., 2022;Chandel et al., 2022).To address this issue, our ODEX provides test cases as code execution contexts for evaluation.

G Related Work
Code Evaluation via Execution Executionbased evaluation has been long adopted for domain-specific programming languages such as SQL queries (Zhong et al., 2017) or logical forms (Dong and Lapata, 2016).This executionbased paradigm has not been introduced to generalpurpose languages until recently by the HumanEval dataset (Chen et al., 2021), where human-written test cases are provided for code execution.Many works afterward follow this approach, but focus more on closed-domain settings (Austin et al., 2021;Hendrycks et al., 2021) or specific libraries of interest (Lai et al., 2022;Huang et al., 2022).Toward broader execution environments, we provide executable test cases for as many as 79 libraries.
Coding Queries Versus Programming Challenges Programs from different sources are organized for various purposes.Coding contest websites such as LeetCode11 and Codeforces12 have been used to build many code generation benchmarks (Hendrycks et al., 2021;Li et al., 2022).However, they randomly align with how humans program in practical scenarios.To build datasets with natural and practical usage of code, many works use GitHub Jupyter Notebooks (Agashe et al., 2019;Huang et al., 2022) and StackOverflow forums (Yin et al., 2018;Wang et al., 2023;Lai et al., 2022) as a source of naturally-occurring code.We remain such naturalness by using StackOverflow posts, but uniquely from forums in various languages to also assist programmers worldwide.
Test Case Creation While most benchmarks use Python test cases annotated by human programmers (Chen et al., 2021;Nijkamp et al., 2023;Lai et al., 2022), challenge-style datasets adopt a more direct approach by crawling from the web (Hendrycks et al., 2021;Li et al., 2022).Another thread of work attempts to generate test cases automatically based on the Python grammar (Lukasczyk and Fraser, 2022), but is largely limited to basic Python functions.Some propose to leverage the power of neural LMs (Tufano et al., 2021;Li et al., 2022), even jointly considering solution and test case generation (Chen et al., 2023).However, the quality and diversity of test cases are not robustly ensured.We hence use high-quality human-written test cases for ODEX evaluation.

Figure 1 :
Figure 1: Examples in the ODEX dataset.Inputs on the left are function-formatted with (1) library import expressions; (2) function signatures that declares the function name and input arguments; and (3) natural language intents as part of the docstrings (English translations are not included in the actual non-English inputs during inference).Gray boxes indicate places for code solutions.As shown on the right, a code LM fills out the gray boxes with code solutions, which are then executed on the unit tests underneath.Notably, writing unit tests for open-domain queries is often more challenging: a requires simulated execution due to the difficulty of reproduction; b is verified through approximate equivalence.Prior work focuses more on basic assertions, as in c and d .

Figure 2 :
Figure 2: An example annotation comprising four steps.

Figure 5 :
Figure 5: Approximated natural distribution based on GitHub Python files in the open domain.
The CODEX Family At the time of this work, CODEX had three publicly available models.CODE-CUSHMAN-001 (C1) is a 12B CODEX model in Chen et al. (2021).CODE-DAVINCI-001/002 (D1, D2) are two 175B GPT-3 models. 6 Prompt Design For fair comparison, we use the same prompt for both model families.While prompting with few-shot in-context examples may improve, our experiments do not always find this helpful for both models.Therefore, we report zeroshot results as baselines and leave few-shot results to §7.Creating zero-shot prompts only requires content from the test sample.Following Chen et al. (2021), we construct prompts by concatenating function context and a docstring.A docstring includes the NL intent and optional unit tests (compared in §7).Figure6shows an example prompt.

Figure 6 :
Figure 6: Zero-shot prompt with one test case in docstring.The gray box notes the place for code solution.
Closed Domain CODEX Results Figure 7 (left) shows pass@1 on open-domain and closed-domain.All CODEX models score much lower in open than in closed domain.Such large gaps hold across all languages, ranging from 4.34 in Spanish to 38.57 in Japanese on the best DAVINCI-002 model.Model upgrades (C1 → D1 → D2) do not always reduce the gaps.

Figure 7 :
Figure 7: CODEX (left) and CODEGEN (right) pass@1 on open-and closed-domain problems in each language.

Figure 8 :
Figure 8: CODEX pass@1 for domains of varied frequencies.Domains are differently colored based on their frequency ranking: the 10 most frequent domains in red, the 10 least frequent domains in blue, and other domains in the middle in yellow.

Figure 10 :
Figure 10: BLEU scores on passed and failed samples.

Figure 13 ,Figure 15 :
Figure 13: ROUGE on passed and failed samples.

Figure 17 :
Figure 17: An alternative yet correct prediction, only has a low 4.8 BLEU score due to having little lexical overlap with the canonical solution.

Figure 18 :
Figure 18: An example output of a correct execution result, yet only achieving 0.6 BLEU.

Figure 19 :
Figure 19: Example that the model prediction uses the wrong function, having a very high BLEU score 0.925.

Figure 20 :
Figure 20: Example that the model prediction varies slightly in copied strings, but scores 100.0 in BLEU.

Open
Domain Code Generation Code written in general-purpose programming languages often uses classes or functions from external libraries.A few datasets for code generation preserve this open-domain nature.The CONCODE (Iyer et al., 2018) dataset tested generation of Java class methods.Later works target Python generation given the interactive context of Jupyter Notebooks (Agashe

Table 1 :
Number of open-and closed-domain examples, and number of libraries involved in each language.

Table 1
reports domain statistics and Figure3shows the library distribution.ODEX covers a diverse set of 79 libraries, which varies per language.Most samples, 53.4%, use at least one library.

Table 3 :
Comparing ODEX with other NL-to-code generation datasets, in terms of domain diversity (Domain), test-case execution support (Evaluation, Avg.Test Cases), and natural language contexts (NL).Since it is hard to calculate the exact number of libraries for some open-domain datasets that do not specifically import required libraries in the code, we mark their domains as open instead of providing the exact number of domains.

Table 4 :
Implementation Details We follow Chen et al. (2021) and use nucleus sampling (Holtzman et al., 2020) with top-p set to 0.95 and temperature set to 0.8.We set outputs to a maximum of 512 tokens.Execution accuracy of CODEX and CODEGEN-MONO models.
clear improvement over the zeroshot setting; but for the strongest DAVINCI-002, it brings minimal gains in English.See similar results in other languages in §E.1.
(left), for CUSHMAN-001 and DAVINCI-001, few-shot ex-amples yield a Number of Test Cases in the Docstring Including test cases in inputs adds execution hints of the expected functionality of the solution, and hence may improve execution accuracy.We test this hypothesis by experimenting with prompts that have varying numbers of test cases.Besides the default setting with zero tests, we compare adding one random test case and all annotated test cases.

Table 5 :
Table 5 lists the number and percentage of occurrences for each library in the ODEX dataset.ODEX library distribution.Domain Statistics of Comparison Datasets Table 6 lists the library frequency of eight comparison dataset mentioned in § 3: HumanEval, MBPP, APPS, MTPB, P3, DSP, DS-1000, and Exe-DS.

Table 6 :
Library statistics of eight comparison datasets.Approximated Natural Domain DistributionTo approximate the natural distribution of libraries in the open domain, we count the number of Python files on GitHub that imports the library of interest.Following the GitHub search syntax, 8 we use the query import ${library_name} to search files that import a certain library, and use NOT import to count files not using any libraries.Their frequencies are shown in Table7.

Table 7 :
Approximated natural domain distribution.

Table 10 :
CODEX pass rate in open and closed domains.

Table 13 :
CODEX results on non-execution metrics.

Table 14 :
CODEGEN results on non-execution metrics.

Table 18 :
CODE-test setting being trivialized into the 1-test case.Concretely, we filtered all examples with at least 3 test cases and got112, 17, 25, and 45examples in English, Spanish, Japanese, and Russian.Pass@1 results on 0/1/n-test settings are shown in Table19.While the input construction process may introduce whitespaces at the start and the end of the text sequence, we find CODEGEN model unexpectedly sensitive to trailing whitespaces.As shown in Table20, removing whitespaces from the prompt input increases the pass rate of all sized CODEGEN models by over 20 percent.
-DAVINCI-002 results when using zero (0), one (1), and all (n) test cases in the prompt input.Furthermore, we experiment on the subset of examples having sufficient test cases, to prevent the n

Table 20 :
CODEGEN results when inputting prompts with and without trailing whitespaces (WS).

Table 22 :
Results on examples with 3 or more test cases, using one (1) or all (n) test cases at evaluation.