Explain-then-Translate: An Analysis on Improving Program Translation with Self-generated Explanations

This work explores the use of self-generated natural language explanations as an intermediate step for code-to-code translation with language models. Across three types of explanations and 19 programming languages constructed from the MultiPL-E dataset, we find the explanations to be particularly effective in the zero-shot case, improving performance by 12% on average. Improvements with natural language explanations are particularly pronounced on difficult programs. We release our dataset, code, and canonical solutions in all 19 languages.


Introduction
Program translation (i.e., translating code from one language to another) has significant value in reallife applications, including in legacy software modernization and enabling programmers to quickly adapt to new languages.Within prompt-based approaches to code translation, Chen et al. (2023b) recently found that simply prompting an LLM to generate explanations of the source program before generating the target program can improve performance.However, this conclusion is drawn on a single translation direction from C++ to Python (Lachaux et al., 2020), and lacks evaluation on a broader set of programming languages including low-resource languages-a key component of code-to-code translation tasks in a software modernization setting.
This paper systemtically evaluates this "Explainthen-Translate" approach to code translation through the MultiPL-E dataset (Cassano et al., 2022).As the original dataset was constructed for the NL-to-code setting, we repurpose this dataset into a code-to-code, "MultiPL-C2C" dataset.We analyze our results in 36 different translation directions over different types of explanations.We find that Explain-then-Translate improves zero-shot performance consistently in 18 Python-to-X translation directions, but much less so in the few-shot setting.We observe detailed explanations to be more useful when translating into high-resource PLs and from low-resource into other low-resource PLs.In contrast, translating from high-to lowresource PL's benefits from more abstract explanations.To aid future research in code-to-code translation across diverse language, we release our evaluation system, as well as canonical solutions in all languages, providing a 19-way parallel program translation evaluation set.

Explain-then-Translate for Code Translation
In code translation, we are given code x in a source language and must generate a program y in a target language that is functionally equivalent to x.
In this paper we are interested in whether a selfgenerated natural language explanation z can be used to improve this translation process.2plain individual fragment of the line, and then summarize the purpose for the entire line.This prompting method allows us to decompose compositionally difficult fragments of the code down, re-use individual fragments of explanation before explaining the whole line, similar to Zhou et al. (2022).

Prompt Variations
When generating explanations, we treat the token sequence \n# as a stopping sequence in order to prevent models from generating target translations (since we condition target program with translated signatures in addition to explanations).Sometimes, a model might generate target-language-specific details (equivalent classes, attempted translation, etc.).In order to control for inconsistencies caused by the target-language-specific explanations, we re-use the same explanations (from Python-Java) for all Python-to-X translation experiments (Section 3.1).Before reusing, we also remove any target specific information with programmatic rules so it can be reused across experiments.For completeness, in Apx G.1 we show the impact of removing target-language-specific details for the exp experiments: the effects are generally insignificant, but are more pronounced in low-resource languages.
Additional details on language-specific stop tokens and how few-shot programs are selected are described in Apx E and Apx F, respectively.

Dataset: MultiPL-C2C
MultiPL-E (Cassano et al., 2022) is a benchmark that was recently introduced in an effort to evaluate NL-to-code generation capabilities of language models in 19 different programming languages. 3 Cassano et al. ( 2022) groups these languages by resource level: • High-resource: JavaScript (js), Python(py), Java*4 (jv), C++* (cpp), TypeScript* (ts) • Medium-resource: PHP (php), Ruby (rb), C#* (cs), Go* (go) • Low-resource: Perl (pl), R (r), Rust* (rs), Scala* (sc), Swift* (sw) • Extremely-low-resource: Bash (sh), Lua (lua), Racket (rkt), Julia* (jl), D* (d) 3 Concretely, taking the original HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021) datasets (where models are prompted with problem description and are tasked to generate a Python program that solves the problem), MultiPL-E built transpilers for the unit tests as well as code generation prompts such that models can be evaluated from NL-to-code in 19 different languages (Python + 18 additional languages).To repurpose MultiPL-E into a code-to-code translation dataset, we change the task formulation by including canonical Python programs in the prompt and removing the NL problem descriptions.We dub this version of the dataset as MultiPL-C2C, and release it for future work in this area. 5

Metrics
We evaluate our methods using unit test pass rate (Chen et al. 2021;Cassano et al. 2022) as string match-based evaluations do not capture the diverse ways in which a program can be translated and still be functionally equivalent to the source.We calculate the pass rate as: where n is the total number of generations, and c is the number of correct generations.The best sampling temperature t (or top-p) (Holtzman et al., 2020) is often dependent on n/k, where smaller temperatures are best for small n/k, while larger temperatures increase the generation diversity (better recall) and can improve pass rate with large n/k.We prioritize precision and calculate pass@1 with n = 20, t = 0.2, and top-p= 0.95 following Cassano et al. (2022).

Models
We evaluated four models of varying sizes.
We main report the results from  in the main paper unless otherwise specified, and defer the results from open source models (CodeGen2-1B, CodeGen2-16B, and Llama2CodeInstruct-34B (Nijkamp et al., 2023;Rozière et al., 2023)) to the appendix.5 We also considered into CodeXGLUE (Lu et al., 2021) andTransCoder (Lachaux et al., 2020) for the unit tests evaluations, but initial studies suggested a significant number of examples (more than 25%) contain mistakes in gold programs or inadequate unit tests (see Apx A, B).

Experiments and Discussion
In our study we focus on two main sets of translation directions: Python-to-X, where we translate from Python to 18 other target languages ranging from high to extremely-low-resource ( §3.1), and Xto-X, where we target a representative set of translation directions varying source and target language resource levels and typing characteristics ( §3.2).We analyze translation improvements across models of 4 different sizes ( §3.3) and discuss improving individual explanations through heuristics ( §3.4).Finally we show our method improves more on difficult-to-translate examples ( §3.5) and provide ablations to understand what NL explanations improves performance and whether alternative selfgenerated contexts could help ( §3.6).

Python-to-X Translation
In Table 1 we present results of the Python-to-X experiments in the zero-and four-shot settings with .Results with open-source models results show similar trends and are shown in Apx G.9.
Natural language explanations improve performance in the zero-shot setting, and this effect is more pronounced in low-resource languages.Providing explanations improves relative performance by 11.5% on average across 18 target languages.Regardless of the target language resource level, the best explanation improves translation with average relative improvement of 6.2% in highresource languages and 14.5% in extremely-lowresource languages.There is no significant difference between improvements on translating into statically vs. dynamically typed languages.Selfgenerated explanations even slightly outperform human-written doc-string instructions that are part of the original HumanEval dataset (see Apx J).
High-resource target languages benefit from detailed explanations while low-resource alternatives benefit from abstract explanations.We hypothesize that high-resource languages benefit from more detailed explanations due to higher cooccurrences of NL and PL in the pretraining corpora; whereas in low-resource languages we speculate the additional detail may introduce spurious correlations.Since we re-use explanations across translation directions, the translation performance difference can be attributed only to the code generation step.
Natural language explanations are less helpful in the few-shot setting, but good few-shot examples are crucial.In the four-shot setting, the average improvement is much less at 1.1%, although some language pairs observe as much as a 10.1% improvement.Average improvement in high-resource languages (1.2%) is smaller than that in extremely-low-resource languages (3.4%).The most detailed explanations perform the best in 12 out of 18 language directions amongst explanation types.This is likely due to the carefully curated few-shot examples, which are semantically and syntactically complex enough to benefit from decomposition and explanations (see in Apx F for more details).
Few-shot explanations result in worse performance than zero-shot explanations.The most abstract explanation (exp) performs the worst (best in only 3 out of 18 directions) in the few-shot setting.Since we source the few-shot explanations from minimally modified zero-shot explanations, including these self-generated explanations simply restricts the model's explanation to follow stylistic patterns and decreases the diversity of the explanations.In Apx G.2, we disentangle the effect of target specific explanation and zero/four-shot setting to provide further evidence to this point.
Improvements in the zero-shot setting correlate with improvements in the few-shot setting.Except for a few outliers, Fig 2 shows a good correlation.This is interesting because few-shot is manually curated and written in PL, while explanation is self-generated and written in NL.In our ablation 3.6 and Apx J we further analyze to what extent the source of information provides the structure of the output, and whether the correctness of the sequence actually matters.

Alternative Translation Directions
To understand whether our findings only hold for Python (or only high-resource languages), we experiment on additional translation directions from different resource groups and typing characteristics, and present our results in Table 2. Since source languages are different, we do not re-use explanations.
In the four-shot explanation, we use zero-shot generated explanations (3.1).In the following section, we have High=high-resource languages and Ext-Low=extremely-low-resource languages.Results from open-source model results are in Apx H.
High-to-ExtLow and High-to-High follow a similar patterns as Python-to-X.In zero-shot, Highto-High has varied performance across different explanation types, whereas High-to-ExtLow benefits mostly from simple explanations (exp).In fourshot, there is little to no improvements in High-to-High, but some improvements in High-to-ExtLow.
ExtLow-to-High: Models are poor at explaining low-resource language programs.The improvement in ExtLow-to-High trials is limited in both zero-and four-shot directions.Across explanation methods, we can see a general decrease in performance from high-level (exp) to detailed (exp-lbl-d) explanations.We speculate that LLMs generally struggle to understand and explain lower-resource PLs; more details may introduce more errors which   Table 3: Explanation selection heuristics performance.We estimate heuristics performance (pass@1, n=1) and validate the best method by generating 20 programs and calculate pass@1 (n = 20).No heuristic is able to outperform baseline with exp-lbl so we did not verify with pass@1.
Figure 3: Py-to-X translation (pass@1, zero-shot) improvements (best explanation over baseline) across models grouped by target language resource level.
may compound into the translation phase.

Comparisons Across Different LMs
Improvements are robust across models.From Fig 3 and 9, we can see that in general, the larger the model, the larger the absolute improvement with self-generated explanations.In terms of improvement over resource levels, our method improves low-resource language generations more with larger models, while improving high-resource languages more with smaller models.See detailed result tables in Apx H and G.9. CodeGen2-16B is the only model that does not improve consistently with explanations.
Better explanations are transferable and lead to better translations.We also experimented with CodeGen2-1B by using GPT-3.5 generated explanations (Fig 3) and found it to improve performance further, outperforming self-generated explanations in 12 out of 18 directions.Comparing absolute improvements against CodeGen2-1B with self-explanations, we find that GPT-3.5-generatedexplanations improve more when generating higher resource than lower resource languages, indicating that smaller models are less sensitive to improvements.More analyses are given in Apx G.9.

Explanation Selection with Heuristics
In the context of chain-of-thought prompting, Wang et al. ( 2022) demonstrate the importance of sampling diverse "reasoning" paths.It is difficult to ensemble sampled programs from language models, but we find sampling diverse explanations, where we first sample 20 explanations and then sample one program each, to improve recall for correct programs (pass@10) than sampling 20 programs from 1 explanation, or direct translation in zero/four-shot settings.This indicates that there is significant room for improvement if we are to be able to select the best explanation that can gener-ate a correct program with the highest probability (oracle column in Table 3).Motivated by the potential of diverse explanations to improve translation results, we explore five explanation re-ranking heuristics: 1) length of explanation (in characters) excluding code; 2) lines of source code explained (line-e); 3) number of lines of explanations; 4) number of code fragments enclosed in '';6 5) with logprob (Zhang et al. 2022;Min et al. 2022a), ranking the explanations with a weighted combination of α * p(code|explanation)+ (1−α) * p(explanation|code) using CodeGen2 (Nijkamp et al., 2023) (more details in Apx L). 7For each explanation type, we generate 20 explanations and 1 program from each explanation (train set).We estimate each heuristics' performance by averaging the pass rates of its selected (argmax) explanations for each individual problem in the train set.8For random baseline, we select 1 explanation for each program randomly;9 and for oracle, we select the explanations with the highest pass rates in the train set.For each explanation type, we pick the heuristics with the best estimated pass@1 (n = 1), and generate 20 programs from these explanations for pass@1 (n = 20) score (right most column in Table 3).We use zero-shot explanations for exp (see Sec 3.1) and four-shot for exp-lbl and exp-lbl-d.Our main results are shown in Table 3, from which we observe the following.
Heuristics can improve performance, and this is robust across different target languages.With exp, logprob improves upon random by absolute 2.54% (p = 0.055), 10 and frag improves explain-lbl-d upon random baseline by absolute 2.2% (p = 0.033) with simulation.Both improvements can be reproduced with pass@1, so we include these heuristically selected explanations as two additional rows in Table 1.With logprob selected exp, we improve or match performance in 15/18 directions, with an average improvement of 1.7% (p < 0.001).With frag selected exp-lblsimp, we improve or match performance in 13/18 directions, averaging 0.48% (p = 0.022).See Apx G.3 for more comparisons.
Figure 4: We count explanation improvement cases over direct pass@1.Results include all trials between Pythonto-X and X-to-X directions.For better contrast, all problems with the same exp pass@1 and direct pass@1 are removed.
Some heuristics generalize across explanation types.Only frag and logprob perform robustly.Intuitively, frag makes sense because data containing a piece of code and an explanation is more likely to be correct if the author refers to the code more frequently.With logprob, since we are effectively measuring the mutual information between codes and explanations (Zhang et al., 2022).
There is still ample room for improvement.As we can see in the difference between oracle and pass@1, no heuristics is able to to come close to this oracle upper bound.This gap is much larger in high-to-low-resource translation direction (py-rkt).Qualitatively, we could not distinguish a good explanation from a bad one manually (Apx L.3 and L.4), suggesting that the distinction between "good" and "bad" explanations may be hidden due to stylistic noise (wording, spacing, etc.), or potentially due to chance.

Which programs benefit from explanations?
To understand where natural language explanations benefit most, we investigate how exp improvement varies across problem hardness, which is approximated through direct translation pass@1.Through Fig 4, we discovered that explanation improves difficult problems (left of x-axis), but could hurt easy problems (right of x-axis).This potentially suggests we could determine which program to explain using a hardness threshold, and improve performance further.We verified the validity of such approach through our oracle metric (direct pass@1), and show the full results in Apx I. We found selective explanation to improve performance over direct and exp in 35/36 directions.We leave building such difficulty-based problem selector for future work.

Ablation Studies
We perform additional ablation studies to understand what aspects of the explanations improve translation ( §3.6.1),whether explanations are robust to variations in surface semantics/readability of the source code ( §3.7, Apx N), and if selfgenerated context in PL could help few-shot examples ( §3.8, Apx J).Additionally, we explore the relationship between context length and improvements in Apx G.6.

Explanation Ablation
We select 4 target languages of different resource levels where explanations provide the most improvements (zero-shot) for Python-to-X.With each explanation, we ablate in following ways: swap-s: We randomly reorder sentences (exp) or code-explanation segments (exp-lbl) to test if explanation provides structural supervision.
obf-exp: We obfuscate source programs (see examples in Apx N), where function and variable names are replaced systematically with templates (FUNC_0, VAR_1, etc.).This tests whether an explanation uses specific variable references (structural supervision at token level).
ret-exp, rand-exp, no-exp: We replace the explanation with another program's explanation randomly (rand-exp), or through retrieval (ret-exp, details in Apx O), or with an empty string (no-exp) to verify if explanations need to be correct/relevant.

del-w:
We randomly remove some of the words and see if fluency (i.e.high logprob) is necessary.
del-s: We randomly remove a percentage of sentences (exp) or code-explanation paragraphs (explbl) to see the dependency of the translation on the completeness of the explanation.(del-w).Lastly, when models receive completely irrelevant explanations (rand-exp), they are able to recover performance to some extent; but if the explanations are convincingly misleading (ret-exp) performance deterioates.
Models rely on semantics of explanations less when generating lower-resource languages.
Different types of ablations affect lower-resource languages more uniformly than higher-resource languages.Relative to exp/exp-lbl, ablations that completely alter the semantics of the explanations (delw) decreases improvements less in lower-resource languages than higher counterparts, while ablations that keep overall semantics of the explanation (swap-s) decreases improvements less in higherresource languages.
Semantics of explanation is not the only picture.fuscate variables and funciton names source programs, removing all surface form semantic information (Apx N).When comparing direct translation vs. exp, in Table 18, we find explanations to be robust regardless of surface semantics of the code.
In half the trials, relative improvements using explanation are even higher for obfuscated source code than original code.This is potentially due to the fact that explanations become more reliant on actual syntax of the program, rather than hallucinating on the program semantics from surface variable names.This is promising because when using models in the real world, such as for app modernization, there is no guarantee of having readable code.

Few-shot Ablations
Since few-shot improvements correlate with explanation improvements ( §3.1) we conduct ablations to check how sensitive the models are to the correctness of few-shot examples, and whether unverified self-generated few-shots can also improve as well as explanation does.Here, we replace our correct few-shot examples with naturally generated programs from GPT-3.5 (high logprob, but formally unverified (unk) or incorrect (bad)), and observe how much self-generated few-shots improve translation and models' sensitivity towards their correctness.We experiment with a fixed one-shot example as well as retrieval one-shot to observe the improvement/sensitivity when the exemple program is similar or different from the testing program.
When the few-shot program is similar, verification is important.In Table 5, we observe that when the retrieved one-shot is paired with a wrong target program, the decrease in performance is much more significant than in the fixed-shot setting. 11In other words, curated few-shots are robust to label noise.This is consistent with the earlier conclusion in Table 4 that an "almost-correct" explanation (ret-exp) could influence generation more than when it is obviously incorrect (randexp).If verification is available, retrieve-gold shows that a formally correct (similar) program is more useful than a natural language explanation.However, on average, self-generated unverified explanations (exp) still outperform one-shot in all directions (fixed-unk by 0.7-4.5%;retrieve-unk by 0.5-2.0%),indicating that NL generations often have higher quality than programs and can serve as a better medium for intermediate reasoning step.
To further compare NL explanations with other formal/non-formal reasoning steps, in Apx J, we experiment with translating to a pivot programming language before translating to the target language (e.g.Python-Java-PHP).By controlling the pivot language correctness, we can more closely verify the model's translation performance sensitivity to correctness in context.The result indicates mistakes in intermediate PL steps corrupt translation performance more than imperfect NL explanations.This indicates that using self-generated NL as an intermediate step often helps more than self-generated PL, and reasoning in the NL space is advantageous for language models.

Related Work
Explanation in deep learning.Many works have explored using explanations to improve language models.Hase and Bansal (2022) investigate various ways explanations can be introduced in modeling and find it most useful for retrieving similar data.Joshi et al. (2022) find explanation regularization to improve OOD performance.Most works in LLMs generate explanation using zeroshot, few-shot, or finetuning, before generating the target response (Ling et al., 2017;Nye et al., 2021;Wei et al., 2022;Mukherjee et al., 2023;Hsieh et al., 2023).A few works have also explored posthoc explanations (Lampinen et al., 2022;Krishna et al., 2023).Wiegreffe et al. (2021) and Chan et al. (2022) propose metrics to quantify rationale-label association.We refer readers with further interest to surveys (Miller, 2019;Hartmann and Sonntag, 2022;Zhao et al., 2023).
Intermediate state prompting.The Explainthen-Translate approach is an instance of chainof-thought prompting (Wei et al. 2022;Nye et al. 2021), where the model is prompted to generate reasoning steps before the final answer.Follow-up works have found it to be useful on numerous tasks outside of niche reasoning tasks (Wang et al., 2022;Zhou et al., 2022;Chowdhery et al., 2022;Suzgun et al., 2022;Yao et al., 2023).In our setting, we find most improvements to come from the zeroshot setting (Kojima et al., 2022).Different from previous works, our task focuses on program translation, with significant token level correspondence between the source and target.Ghazvininejad et al. (2023) andLu et al. (2023) improve NL translation by augmenting prompts with dictionary translations, but their contexts are not self-generated.It would be interesting to explore whether other forms of "explanations" (e.g., BNF grammars (Wang et al., 2023a)) could further improve performance, especially on low-resource languages which may not have been encountered frequently during pretraining.
Code prompting and feedback.In the codegeneration space, Zelikman et al. ( 2022 2023b) briefly mentions in their ablation studies that explanation improves translation in Python→C++, but our analysis reveals a more nuanced settings in which explanations improve code translation.

Conclusion
This work conducts a thorough analysis of the performance of large language models in program translation by using different types of selfgenerated explanations as an intermediate step.Models generate higher quality detailed explanations for high-resource languages, while still generating good enough abstract explanations for lowresource languages.With simple heuristics, we have also demonstrated the potential to improve explanation quality and consequently translation quality.We identify key requirements for explanation and find that on average, mistakes in NL are less detrimental to performance, and do not require verification to perform well compared to using PL as self-generated contexts.

Limitations
There are several limitations to our work.First, while we focus on the (adapted) MultiPL-E benchmark due to its widespread use, it is unclear whether programs in this benchmark are representative of programs that are targets for code-tocode translation.Second, while we saw our conclusions to largely hold across GPT-3.5 and other open-source models, it is unclear whether they will hold for more powerful LLMs such as GPT-4.Finally, somewhat disappointingly we found natural language explanations to be not as helpful in the few-shot setting, and insofar as obtaining several demonstration examples for each language pair is quite practical, natural language explanations for code-to-code translation may not actually be useful for many applications of interest.

B TransCoder Evaluation
We report the full evaluations on TransCoder eval+test set with   We do not have direct measurement of the selfdebug method on the fixed dataset.We can, however, infer that self-debug method's improvements over baseline is from the additional access to unit test and compiler feedback.When unit tests or gold programs are wrong, relying on these feedback information is the only way to improve translation performance.In another word, translation models start to deviate from a faithful translation and "over-correct" itself to pass unit tests.Our methods, however, assumes no such access to such feedback signals, and slightly under-perform.
When looking at gold-sig trials, we see that by asking model to explain the program and then translate consistently improve over baseline.In no-sig trials, we see a consistent under-performance of our methods.Most of the errors are result of generated program having incorrect signature (incorrect number of input variables).Often, these source C++ programs contain non-intuitive or redundant input variables.For example, in the following example FREQUENT_ELEMENT_ARRAY_1, the input variable n is non-intuitive.If a program were to find the most frequent element in a array, it should normally process the whole input array arr.By explaining before translating, the model aligns the generation to the explanation, which often follows the distribution of "typical programs" in the wild.
In typed-sig, we see general improvements over other trials.This is expected because we can provided more accurate information regarding the input-output types.The slight under-performance of exp compared to direct could be due to small experiment trial size.

### Explanation
This function takes an integer 'n' as input and returns a tuple of two integers.The first integer is the number of even palindromes between 1 and 'n', and the second integer is the number of odd palindromes between 1 and 'n'.A palindrome is a number that reads the same forwards and backwards.
The function first defines a helper function 'is_palindrome' that takes a number and returns 'True' if it is a palindrome and 'False' otherwise.This is done by converting the number to a string and checking if the string is equal to its reverse.
The function then initializes two counters, 'even_palindrome_count' and ' odd_palindrome_count', to 0. It then loops through all numbers between 1 and 'n' (inclusive) and checks if each number is a palindrome and even or odd.If the number is a palindrome and odd, it increments 'odd_palindrome_count'.If the number is a palindrome and even, it increments 'even_palindrome_count'. odd_palindrome_count += 1 elif i%2 == 0 and is_palindrome (i): even_palindrome_count += 1 return (even_palindrome_count, odd_palindrome_count)

### Explanation
The code is an implementation of finding the number of even and odd palindrome numbers between 1 and 'n' ( inclusive).It takes an integer 'n' as input and returns a tuple of two integers: even and odd palindrome number counts.A palindrome is a number that reads the same forwards and backwards.
def is_palindrome(n): return str(n) == str(n)[:: -1] The function first defines a helper function 'is_palindrome' that takes a number and returns 'True' if it is a palindrome and 'False' otherwise.This is done by converting the number to a string and checking if the string is equal to its reverse.even_palindrome_count = 0 odd_palindrome_count = 0 The counter variable ' even_palindrome_count' and ' odd_palindrome_count' are used to record the result.
for i in range(1, n+1): The loops goes through all numbers between 1 and 'n' (inclusive) if i%2 == 1 and is_palindrome(i): odd_palindrome_count += 1 elif i%2 == 0 and is_palindrome(i): even_palindrome_count += 1 Within the for-loop, the program checks if each number is a palindrome and even or odd.If the number is a palindrome and odd, it increments ' odd_palindrome_count'.If the number is a palindrome and even, it increments ' even_palindrome_count'.
The code is an implementation of sorting an array of integers.It takes a list of integers 'array' as input and returns a sorted list of integers.
return [] if len(array) == 0 else sorted(array, reverse= (array[0]+array [-1]) % 2 == 0) The function first checks if the input list is empty.If it is, it returns an empty list.Otherwise, it sorts the list using the 'sorted' function.The ' reverse' parameter is set to 'True' if the sum of the first and last elements of the array is even (and the array will be sorted in descending order), otherwise, it is set to 'False' (and the array will be sorted in ascending order).Although nested/local function is allowed in C#, this increases the variety of generations which can be accepted by unit tests.

E.14 Bash
• original stops: \n} • modified stops: \n#, \nAnswer, \necho, \n'''.Although nested functions are technically allowed, it is quite conventional to write helper functions in a separate function.We also use post-processing to truncate additional unwanted generations.

E.15 Lua
• original stops: \nlocal, \nfunction, \n--, \n\n • modified stops: \n--, \n#, \nend.The original stop \n\n is problematic because it prematurely stops translation as soon as the source program has an extra empty line in the program.We add base indentation level \nend to truncate after the function ends.

E.18 D
• original stops: \n\n, \nvoid, \nbool, \nint This function receives a list of strings and returns the string with the highest number of unique characters, and if there are ties, the string that comes first in lexicographic order.

### Explanation (exp-lbl)
The code is an implementation of finding the longest word in a list of words that has the fewest unique characters.It takes a list of strings 'words' as input and returns the longest string in the list that has the fewest unique characters.sorted(words, key = lambda x: (-len(set (x)), x))[0] The function first sorts the list of words using the 'sorted' function.The 'key' parameter is set to a lambda function that returns a tuple of two values: the negative length of the set of characters in the word (to sort by fewest unique characters first), and the word itself (to break ties by alphabetical order).
The '[0]' at the end of the line returns the first (i.e., longest) word in the sorted list.

### Explanation (exp-lbl-d)
The code is an implementation of a function that finds the word with the maximum number of unique characters.
The function takes a list of strings as input and returns the string with the maximum number of unique characters.sorted(words, key = lambda x: (-len(set (x)), x)) [0] This statement sorts the input list ' words'.Let's break it down step by step: set(x) The 'set' function returns a set of unique characters in the string 'x'.len(set(x)) The 'len' function returns the number of unique characters in the string 'x'.
-len(set( x)) The '-' sign is used to sort the list in descending order of the number of unique characters.
x The 'x' parameter is used to sort the list in ascending order of the string itself.Hence, the 'key' parameter of the ' sorted' function is set to a lambda function that returns a tuple of two values: the negative length of the set of unique characters in the string and the string itself.The 'sorted' function sorts the input list based on this tuple.The '[0]' at the end of the statement returns the first element of the sorted list, which is the string with the maximum number of unique characters.
Above is an example explanation of a program in Humaneval (humaneval_158_find_max).With exp, the explanation is very high-level and does not go into details with the Python implementation.With exp-lbl, each line of the program is referenced along with an explanation.However, we can see that in the second paragraph exp-lbl explains that words sorted "by fewest unique characters first" when it should be the other way.This led to the wrong translation where the negative sign is ignored.On the other hand, with exp-lbl-d, complicated line like ones above is decomposed first, explained separately and then combined together.This particular explanation emphasizes the negative sign and improves translation pass rate.

F How did we select with few-shot programs and write explanations
To be consistent across trials, and due to the fact that not every program has gold translations, we use fixed few-shot in our main experiments.In this case, the few-shot examples we pick is crucial to the performance: they need to be representative of the dataset characteristics and contain features that may demonstrate the usefulness of explanations.
Once we select the few-shot programs, we simply use the model to generate zero-shot explanations, and modify for correctness and structural preferences (e.g.code lines followed by explanations with exp-lbl).
Our "development" translation direction is Python → Java.We quickly notice that GPT-3.5 is bad at translating nested functions, which occur many times in the canonical solutions in HumanEval.Compounded with the fact that the original MultiPL-E stop tokens do not allow models to generate multiple functions and that Java does not allow nested functions, the only way for GPT-3.5 to generate a correct translation of these program is to use lambda expressions, which can be extremely convoluted if the nested function is long.Therefore, after loosening the constraint of single function only (see Java section in Apx E), we decide to use the first few-shot example humaneval_107_even_odd_palindrome as a demonstration to show the model how to translate these type of functions (by adding private helper function after the main function).
Second and third few-shot examples were selected quite arbitrarily.The only criteria we had in-mind was that these programs need to be somewhat difficult, on the longer side length-wise, and that the semantics of the program is not immediately clear after looking at the function name or first sentence explanation of the function.Lastly, the programs also should not be close to each other in the problem sequence (just in case when designing the dataset, (Chen et al., 2021) decided to stress-test different aspect of code-generation in batches).Hence, we picked humaneval_126_is_sorted and humaneval_1_separate_paren_groups.
Last example is often the most influential due to the proximity to test examples (Lu et al., 2022).We decide on this program after careful error analysis of exp and exp-lbl.With few-shot exp-lbl, we found the explanations to be of great qualities already: the explanation chunks the model into several semantically independent segments and explain each of them separately.We notice that the quality of the explanations (through manual inspection) are worse when the segments are longer or more complicated.In many cases, these coincides with programs that contain one extremely semantically complicated line.This is because the model has to spend paragraphs explaining these lines, and often produces confusing/wrong statements.Many of the remaining 17 assertion errors (semantic errors) in four-shot exp-lbl are due to in-correct/insufficient explanation).Section E.19 shows an example program.
These errors, however, can be effectively mitigated if they were further decomposed into smaller chunks of code statements.Due to the inherent tree-like structure (AST) of program, asking models to decompose is another way of learning finegrain parsing of the code.Surprisingly, models like GPT-3.5 is able to parsing very well.Hence, we developed our third explanation method exp-lbld: we ask the model to explain line-by-line, and breakdown the line into smaller parts if the line is too long or complicated.To demonstrate the usefulness of such method, we pick a program that is short but contains a long and complicated line (humaneval_88_sort_array). See E.19 for the result of explanations generated with few-shots across 3 explanation methods.With exp-lbl-d fourshot, we find the explanations generated for almost all 17 previously failing programs (in exp-lbl) to be correct, and the remaining 12 assertion errors are all results of wrong program generation given correct explanations.

G Python-to-X Detailed Result Investigation
All experiments detailed here are conducted with GPT-3.5 unless otherwise specified.
G.1 Removing target language specific information in zero-shot explanation Above is an example of what target specific information in generated explanations looks like.
Since different types of explanations generate different types of kind of information, we remove all of them in Python-to-Java direction to disentangle the effects of target-specific information vs the level of detail in the explanation.The removal is done through a script, where we remove from the first sentence that mentions anything about the target language.
Here, we compare the pass@1 before and after removing such information for all explanation types.As seen, there is only a slight decrease in the performance.When translating into low-resource languages, the occurrence of target-specific information is much less frequent, leading to much less difference between regular explanations vs. explanations with target information removed.
In addition, we compare using targetindependent explanations (exp (java)) vs. target-dependent explanations (exp (tgt-specific)) in Table 8.Target specific explanations do not impact performance significantly in zero-shot (p = 0.090).Target-specific explanations tend to decrease performance in lower-resource languages.

G.2 Four-shot explanation variations
To observe the difference between the Pythonto-Java four-shot explanation, four-shot target language-specific explanation, and re-using zeroshot target-language specific explanation in fourshot translation, we compare their Python-to-X translation performance in 18 target languages (Table 9).Just by observing best trials across language directions in Table 9, exp (java) wins in 6/18 directions, whereas exp (tgt-specific) and exp (zero-shot) each win 5/18 and 11/18 directions.2-tail paired t-test indicates that both alternative trials are significantly different from re-using 4 shot examples from Java (p tgt_specif ic = 0.036) and p zero−shot = 0.024)).Four-shot explanations are worse than zero-shot generated explanation.This is intuitive because the explanations with exp method in the zero-shot setting are good enough.By incorporating mostly its own explanation in a few-shot setting, the model is not obtaining more information, but restricting its potential to generate diverse types of explanation.

G.3 heuristically selected explanations
For more fair comparisons, we include here the heuristically selected explanations with respective baselines.Since exp selection was done over zero-shot explanations, we compare exp (logprob) against exp (zero-shot).There is still a mean improvement of 0.73% with standard deviation of 1.8%.

G.4 Error types breakdown in Python-to-X
For each program, in addition to determine unit tests pass rate, we also use string matching on stderr and stdout to categorize the error type.In order to generalize across different languages, we group the errors into the following 6 types: • Type Error includes all errors related to interactions between variables with the wrong types.For example, in Python, floats cannot be used to index list, and a string cannot be used to multiply with another string.
• Undeclared Error includes all errors calling methods or variables that do not exists.It ranges from undeclared variable, to unable to find equivalent built-in function such as string.swapcase() in Python Table 9: Comparing different four-shot translation with different explanations: Python-Java explanation, targetspecific 4 explanation and target-specific 0 shot explanation across 18 languages.2-tail paired t-test between first and second row is p = 0.036 and between first and third row is p = 0.024 output.This indicates that the program runs, but is not functionally the same as the source.
• Runtime Error generally includes all errors that do not occur for every unit tests.For instance, index out of bound error may only occur with input of long lists.
• Other Syntax Errors includes all other type of errors not captured by a specific groups from above.
• Unhelpful includes cases where the generated program contain exclusively comments like TODO, Your Code Here.
For better generalization, we also combine assertion error and unhelpful into semantic error and Type, Undeclared and other syntax errors into syntactic error.Here are some of our main conclusions: Less syntax error across the board in zero-shot In general we see a decrease of syntactic error across all target language resource level (Fig 5).Specifically, there is a significant decrease in unhelpful generations in trials with explanations.This is similar to effects of having few shot examples Min et al. (2022b), except in this case we do not actually provide the actual format of target translation.Other than reducing unhelpful generations, self-generated explanations also decrease undeclared and type error (more so in higher resource directions).This is intuitive because as model reasons through program explanations, it may generated sequences that specify variable type or specific methods used within source program, which in term provides more information for the translation step to generate appropriate method to call.Surprisingly, there is no sign of decrease in semantic error.This is likely due to the fact that by resolving syntactic errors, those programs switched to having semantic errors.In Fig 6 we look-into this phenomenon specifically.
No significant difference in four-shot Errors seem to be distributed very similarly across all trials.There are two exceptions.First, in highresource target languages, other syntax errors seem to drop significantly in exp-lbl-d than the other explanations, which both contain more error than direct baseline.In extremely low-resource target language, there also seem to be a somewhat significant drop in other syntax errors.

G.5 Error conversion between direct translation and with explanations
To observe the program status with and without using explanations, we track each problem's status in direct and explanation trials.In Fig 6, we plot direct status on the x-axis and corresponding status with explanation on the y-axis.Here are some key take-aways: More detailed explanations decrease semantic error rate In the top two rows of Fig 6, we can observe the differences between three explanation methods.In zero-shot setting (row 1), we can see that exp-lbl converts more semantic errors in direct to pass, and slightly more syntactic errors to semantic error, which are both indication of improvements.In four-shot setting (row 2), with more detailed explanations, we see consistent decrease in pass→semantic error (explanation misleading translation), semantic error->pass but an increase of pass→syntactic error.These all indicate that a more detailed explanation indeed decreases the amount of semantic errors Higher target language resource, proportionally more improvement with explanation, less misleads In the bottom two rows of Fig 6, we can see the effectiveness of explanation across target languages of different resource level.In zero-shot (row 3), we see a majority of the improved cases (not pass->pass) come from improving syntactic error.However, if we count the improvements of syntactic error->semantic error, the effect becomes similar.Proportionally, high-resource benefits the most in improving syntactic errors.In lower resources, there's proportionally more chance of explanation misleading the translation (pass->no pass.In four-shot (row 4), the effect of explanation is much smaller (pass->no pass or no pass->pass)

G.6 Translation pass rate with different program lengths
Typically, generating longer programs is harder.
We look into the success rate of each our trials with respect to the source program length to observe if there are any patterns.We find that explanation affects translation across length uniformly, with better performance in high-resource long programs.In the top left box plot of Fig 7 , we can see a more significant improvement for longest set of programs with explanation.This effect dampens slightly as we translate to lower-resource languages.
G.7 Python-to-X translation Pass@10 (n=20) In the main result table 1, we presented Pass@1 results with GPT-3.5.For convenience and cost, we also report @10 results from the same trial (Table 11), but note that for more accurate and optimal estimation @10 should be estimated with n = 200 and t = 0.8 (Cassano et al., 2022).
Less relative improvements than pass@1 Overall, from Table 11, we can see less improvements in pass@10.This is reasonable because ultimately adding explanation restricts the generation space and lowers the diversity of the output generations.Still, we see consistent improvements with explanations.
Zero-shot exp provide best coverage In the top 4 rows of Table 11, we can see exp outperforms the rest in the most directions (9/19).This is probably because there are countless ways of explaining a program in free-form natural language, and abstract explanation provide the least constraints on generating a diverse set of programs (better recall) Four-shot better/detailed explanation leads better coverage In the bottom 4 row of Table 11, we can see that either direct translation, exp-lbl-d, or heuristically selected explanation wins.Indicating that with a good quality explanation, we can still obtain improvements in few-shot setting.
G.8 Python-to-X translation performance vs.

NL-to-Code performance
To investigate whether NL-to-code performance correlates to python-to-X translation performance, we compare our zero-shot results with Cassano et al. ( 2022) with code-davinci-002.In Fig 8, we can see a strong correlation between the two.On top of direct translation, explanations (best explanation for each target language) improve translations (absolute difference) uniformly across source languages, and a higher relative improvements in languages which are hard for NL-to-code task (lowerresource languages) G.9 Python-to-X for Opensource models H X-to-X Opensource Model Results All experiments detailed here are conducted with GPT-3.5 unless otherwise specified.
There are still improvements with self-generated explanations across most directions In weaker opensource models CodeGen2-1B improves more consistently (than 16B) using self-generated explanations (baseline is the worst in all X-to-X directions, and in 17/18 Python-to-X directions), as much as 300%+ improvement (12,Lua→Python,Python→JavaScript).In Python→Ruby, model with explanations obtains a pass rate of 5.9, while baseline does not generate any single correct translation (pass rate of 0).CodeGen2-16B shows weaker results, with baseline outperforming in 10/18 directions in Table 1 and 2. Perhaps it has a weaker alignment between natural language and programming language, resulting in worse explanations generated for each problem.The majority of errors from translation with explanation are syntactic.For Llama2CodeInstruct-34B, there are consistent absolute improvements of 5%-10% and maximum relative improvements of up to 40% (Java → C++ in table 2).plot per-problem direct pass@1 and whether exp improves over direct (Fig 4).direct pass@1 rate serves as a good approximation of how difficult the problem is given the model.In main text we discuss that the exp improves difficult problems more often than easy problems.For easy problems (the right-most column), explanations can often decrease performance.Perhaps this is a result of redundant or inconsistent information leading to confusion.This indicates that a potential way to improve performance further is to automatically pick the difficult problems to explain.
To show that such strategy works, we assume access to oracle metric (direct pass@1) and leverage our cached generations from direct and exp translations.For each problem, if the direct pass@1 rate is smaller than threshold (i.e.difficult problem), we use explanation, otherwise we use direct translation.We repeat this for all 36 translation directions in Python-to-X and X-to-X and present full results in Table 14.Immediately we can see that 1) low-resource languages typically require more explanations.2) select almost always outperform direct and exp (only lost in 1 case with D→C++).In best case scenario (Racket → Julia), we see as much as 9.6% relative improvement over exp with select, while explaining less than half of the problems.This is however still impractical for inference during test time.Having to approximate hardness through direct translation requires more computation than generating a single explanation.Ideally, one could build classifiers or use heuristics to select programs to translate.We leave this for future directions.

J Alternative latent language guidance
In addition to asking the model to generate explanations, we experimented with various forms of latent languages (in the order of more structurally formal to more free-form natural language).We report here their pass@1 (n=1) for Python-Java • Pivot language: Instead of generating target program language directly, we also asked the program to translate to a pivot language and then translate to the target language.For initial experiment, we take the first generation from direct translation to the pivot language as intermediate step regardless of their accuracy.pass@1 C++ = 0.732, pass@1 Bash = 0.81, pass@1 R = 0.703.More experiments in the next section.
• Pseudocode: An intermediate form of program sketch described with a mix of mathematical operations and natural language.To ensure the format of the pseudocode, we prompt with \\ begin{algorithm} and use \\ end{algorithm} as stop token.pass@1 = 0.861 • CoT: In Chain-of-thought (CoT) prompting, we break down the the input program space and translate each sub-components before combining all results together as a whole.In the decomposition phase, we try decomposing through model's perception of "steps" within algorithm, as well as programmatically extracting function calls within source programs that are often hard to translate (especially in low resource languages) pass@1 = 0.734 • Steps: ordered list of natural language steps describing major steps of the program following work by Jiang et al. (2023).pass@1 = 0.824 • Summary: free-form natural language sentences summary of what the program does.pass@1 = 0.854 • Gold summary: We use the original human written docstring instructions (from Hu-manEval) as gold summaries for the program and ask the model to translate given the program and summary.pass@1 = 0.813

J.1 model's dependency on pivot program accuracy
Within pivot program experiments (Python-C++-Java, Python-Bash-Java, Python-R-Java), we further analyzed the Java accuracy by measuring the subset accuracy: we split the set of source problems into those with a correct pivot translation, vs those with an incorrect pivot translation.Here is the result: As seen in Table 15, there is no clear differences between the subset in which the pivot language passes or fails: • For C++ and R, regardless of the pivot accuracy, Java translation accuracy drops with pivot.
• For Bash, regardless of the pivot accuracy, Java translation accuracy improves with pivot.
Although the improvement given pivot language seem monotonic on an aggregated level, this is not to say that pivot language has no effects on translation accuracy because the programs that are correctly translated in pivot language likely has some characteristic that can confound the translation accuracy.
To further investigate whether we can change individual behavior in an individual problem setting, we pick either only correct or incorrect programs sampled from ChatGPT and observe translation performance (Table 16).If we do not have such a correct/incorrect pivot program we discard that problem.Since there could be bias in the dataset where for a specific language, harder problems might have more likelihood of having incorrect problems than correct problems, we experiment with various high-/low resource combinations of target language and pivot, to be able to make conclusions overcoming such bias.
Formal language as intermediate step can achieve equivalent or better results than natural language.In Table 16, comparing exp (bottom row) against Correct Pass@1, we can see that sometimes using formal language as intermediate step can indeed reach or surpass using explanation as intermediate steps.Using higher-resource pivot language than the target language always seem to help more than using lower-resource language, except for rkt-java, which could just be evaluated on an easier subset.This is intuitive because higher resource language generations in general are of better quality, and if the benefit of obtaining more information and generation length out-weights the noise, this is a valid way of boosting performance.Natural language can be thought of an extreme case of this with highest level of resource, with high probability of quality self-generated context.
Formal intermediate steps are highly unpredictable.By glancing at the difference between Correct Pass@1 and Incorrect Pass@1, we can see incorrect pivot programs lead to drastically worse performance.If we observe the breakdown of the errors, we see a lot of Incorrect Unhelpful@1, indicating that the pivot programs themselves are unhelpful.Even if we assume models do not generate any unhelpful generations and combine Incorrect Unhelpful@1 with Incorrect Pass@1, we still see a significant gap between incorrect and correct pivot programs.Specifically, Incorrect Semantic Error@1 tends to be much higher than Correct Semantic Error@1.In ablation studies Table 4 we learned that when the wrong intermediate step is highly related to the source program in semantics, it decreases the translation performance more.In this experiment, since the semantics of source program and pivot program is almost identical, the mistakes in pivot program can have deleterious effects on translation.
Natural language mistakes are taken less seriously To compare the effect of having mistakes in natural language vs pivot programs, we included Incorrect retrieved exp Pass@1 from Table 4. Since swapping an explanation with a closely related that of a similar problem guarantees the explanation to be wrong, we can compare this with Incorrect Pass@1.We found that on average, mistakes in natural language explanations do not decrease translation performance as much as programming language mistakes do.

K GPT-3.5 score heuristics
In addition to two heuristics mentioned in Sec 3.4, we also try prompting GPT-3.5 to select the best explanation.We follow works in automatic generation evaluation with LLMs (Kocmi and Federmann, 2023;Wang et al., 2023b;Chen et al., 2023c;Fu et al., 2023;Wang et al., 2023b)  All entries indicate 0-shot Pass@1(n=20).The column label indicate translation direction, pivot language, and the available problems generated.For example, rkt-java refers to translating Python to Java, with Racket as pivot language.Semantic Error is equivalent to assertion error, Unhelpful generations include incomplete code with comments like "// TODO", "// Write your code here".All other errors are grouped under Syntax Errors multiple explanations.None of these methods outperformed random selection, so we do not include this method in Table 3. GPT-3.5-scores(direct assessment) of the explanations almost always fall between 90-100.

L Coder-reviewer details
Coder-Reviwer is a re-ranking method introduced by (Zhang et al., 2022) to re-rank code generations without verifying through symbolic tools (i.e.compilers) in NL-code tasks.The method found that averaging the logprob score from "coder" (which estimates a length-normalized p(code|NL)) and "reviewer" (which estimates a length-normalized p(NL|code)) can be used as a good metric to rerank code generations.Formally the score is defined as: where x represents the natural language description of the code, y represents generated code, and α is the hyperparameter that weighs the importance between the two terms.
In our problem, we have the inverse task of trying to find the best explanation x given y.Since the score is symmetric, we use the same formula during re-ranking.
To calculate the logprobs, we used CodeGen2-12B (Nijkamp et al., 2023).We use prompt in L.1 and L.2.To obtain the best performance in estimated pass rate (Table 3), we try 0,1, or 2-shots (if GPU memory allows), and vary α between 0-1 with 0.1 interval (except for Python-Racket exp, which we tried 0.02 between 0.8-1.0 in addition to the rest).In Figure 10 we plot the best performing setting for each experiment trials across α.
In section L.3 and section L.4, we show an example of correct and incorrect selection of explanations by coder-reviewer (To provide an idea of what explanations look like, we include only 5 out of 20 total explanations).

L.1 Coder prompt
Can you write a Python program given this explanation?

### Explanation
This function takes in a list of integers and returns a boolean indicating whether the list is sorted in non-descending order and has no more than two occurrences of any integer.The function first creates a dictionary where the keys are the integers in the list and the values are the number of times that integer appears in the list.

M Alternative explanation-selection setting
In Sec 3.4, we sample 20 explanations for each problem and generate 1 program from each explanation.By sampling more than 1 programs for each explanation, one could obtain variances of the performance estimates (by simulating pass status of each program according to the pass rates of the selected explanations in the train set), but we find sampling one program to be good enough at estimating final performance.Plus, given the same budget, it is also much better to sample 20 X 1 (Table 3) than 4 X 5 to maximize explanations diversity, and have the potential of sampling the best explanation.Table 17 shows the result for 4 X 5 experiments.
Comparing results in Table 17 to the main results in Table 3, we see much less improvements using heuristics.It is likely that 4 is not a large enough sample size to obtain the the correct explanation for some problems, resulting in low coverage and small improvements from heuristics.This function takes a string as input and returns a string.If the input string is empty, it returns an empty string.Otherwise, it finds the longest suffix of the input string that is also a prefix of the reverse of the input string.It then returns the input string concatenated with the reverse of the remaining part of the input string.

N Program obfuscation
Above is an example of the program (humaneval_10_make_palindrome) before and after obfuscation using tools from (Lachaux et al., 2020).After obfuscation, function and variable names are all replaced with respective surface forms, as the functionality of the program remains unchanged.As the example indicates, explanation quality does not really decrease.In fact, explanations often become more detailed just because there is not a generic way of describing some operation/term like "palindrome".
To qualitatively examine explanations' effect on translating semantically confusing programs, we translated obfuscated Python programs using direct, exp, and exp-lbl (Table 18).Similar to Python-to-X experiment, we generate explanations with Python-Java, remove Java specific explanations, and re-use explanations across the rest of the Target-specific information improves performance In zero-shot or four-shot settings (Table G.1, G.2), we see slight improvement with target specific explanations.However, length-wise, we do not see a pattern between target-specific explanation vs. target-independent explanation in 0 and four-shot setting.
Heuristically selected explanations are longer Compare Python-to-X exp (four-shot, coderreviewer) vs. Python-to-X exp (four-shot) and Python-to-X exp-lbl-d (four-shot, frag) vs. Python-to-X exp-lbl-d (four-shot), we can see both heuristically selected explanations are longer than their random baselines.However, as seen in Table 3, len heuristics do not do nearly as well as winning heuristics.This indicates that length is important, but is not all the signal in determining the success in translation.

Formal intermediate steps can be more efficient
In Table 16, we see a similar scale improvements from using correct pivot programs as intermediate steps.We conclude from the table that using higher resource language as pivot works better, and in this case we do see higher-resource language tend to be longer than lower-resource languages.It would be interesting to understand how does the verbosity of a language correlate to their usefulness as an intermediate reasoning step.

Fig 1
Fig 1 shows an example of our prompts for program translation.In addition to the direct translation baseline (Fig 1, left), we experiment with 3 types of explanations (full prompts are given in Apx C): 1. exp: We ask the model to explain the source program in a few sentences (Fig 1, right).2. exp-lbl: We ask the model to explain the source program line by line.This roughly mirrors the setup in Chen et al. (2023b).3. exp-lbl-d: We ask the model to explain the source program line by line in additional detail.In particular if an individual line is complicated, we ask it to break it down, ex-

Figure 1 :
Figure 1: Compared to direct code translation prompt, exp (ours) prompts models to explain the code before translating.Blue highlights are model completions, and red highlights point out the crucial difference between the two translations.Example prompts and explanations for exp-lbl and exp-lbl-d in Apx C, E.19
) incorporate model-generated rationales given questionanswer pairs as part of fine-tuning to improve model reasoning capabilities.Jiang et al. (2023) use few-shot examples to teach models to create NL steps from NL instructions before generating the code.Zelikman et al. (2023) and decomposes complex problems in NL and generated/verified subproblems to achieve high performance in NLto-code.Chen et al. (2023a) finetune policy models to correct code given human critique.Wang et al. (2023c) searches multiple hypothesis in NL before generating PL targets.Our method uses selfgenerated context without overly relying on feedback, few-shot examples, or complicated frameworks, and is targeting code-translation specifically instead of NL-to-code generation.Chen et al. (

Figure 6 :
Figure 6: Python-to-X translation status conversion between baseline and explanation.X-axis indicate baseline status and y-axis indicate translation status with explanations.In the top figure, results are aggregated across target languages.In the bottom figure, results are aggregated across exp, exp-lbl, and exp-lbl-d

Figure 9 :
Figure9: X-to-X translation with GPT-3.5 (pass@1, zero-shot) improvements (best explanation over baseline) across models grouped by source-target resource level

Figure 10 :
Figure 10: Coder-Reviewer best few-shot setting varying α hyper-parameter.Black dotted lines are average baseline performance where explanations are selected randomly.
a string as input and returns a palindrome by appending the reverse of the string's suffix to the string.The suffix is the smallest substring that needs to be appended to the string to make it a palindrome.If the input string is empty, the function returns an empty string.###Obfuscated Python version def FUNC_0(VAR_0: str) -> str:if (not VAR_0): ). 3. exp-lbl-d: We ask the model to explain the source program line by line in additional detail.In particular if an individual line is complicated, we ask it to break it down, ex-

Table 1 :
Translation pass@1 from Python to X. * uses heuristically selected explanations (Sec 3.4).Parenthesis in trial indicates # of shots.Best within same-shot (no heuristics) is underscored and overall best is in bold.

Table 2 :
Translation pass@1 between 16 different pairs of languages.Resource indicates the language resource levels of the source and target.Type indicates the source and target language typing characteristics (D/S=dynamically/statically typed).The best runs within the same-shot setting are in bold.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi.2020.The curious case of neural text degeneration.In International Conference on Learning Representations.Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister.2023.Distilling step-by-step!outperforming larger language models with less training data and smaller model sizes.arXiv preprint arXiv:2305.02301.Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom.2017.Program induction by rationale generation: Learning to solve and explain algebraic word problems.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158-167.Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021.Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114.Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou.2022.Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903.Inadequate test: exact comparison Python tests on integers and float types are often inadequate in establishing equality.Depending on the library and specific function called, integers or floats can be rounded with different precision.Because tests in TransCoder evaluation dataset work by comparing a gold Python program's output with that of the generated Python program, if the generated program does not use the exact function call, exact equality (==) is not sufficient.Thus, we changed all cases of such violation from checking exact equality to approximate equality like below: We found 100+ examples of this type of error.We stress that we have only fixed existing issues we have noticed in C++ to Python direction.With sampled programs from Java, we have also observed errors in gold programs and tests frequently.Leakage to training data One of the other reasons that we do not formally evaluate our methods on TransCoder dataset is because that the TransCoder dataset is very likely to have been included in GPTs' training corpora.We are often able to generate the entire program in test/valid split by only providing the signature.The unique detokenized program pattern makes it easy for the model to detect and regurgitate from training sequences.
in Table6.The model The function first checks if the input list is empty.If it is, it returns an empty list.Otherwise, it sorts the list using the 'sorted' function.The 'reverse' parameter is set to 'True' if the sum of the first and last elements of the array is even (and the array will be sorted in descending order), otherwise, it is set to 'False' (and the array will be sorted in ascending order).

•
Assertion Error catches all cases where the function output does not match the expected

Table 8 :
Comparing zero-shot target-independent explanation vs target-specific explanations across 18 languages.2-tail paired t test p = 0.090

Table 11 :
Translation pass@10 from Python to X. * indicate heuristically selected explanations (Section 3.4).Parenthesis in trial indicates # of shots.Res=target language resource level.Best within same shot setting (no heuristics) is underscored and overall best is in bold

Table 12 :
Translation pass@1 from Python to X (0-shots) with open source models.Best within same the same model is in bold.Between CodeGen2-1B and CodeGen2-1B with explanation from ChatGPT, best amongst using same model's explanation is underlined.

Table 15 :
Translation using formal language as pivot intermediate "explanation".Results are obtained with pass@1(n=1).direct indicate translation accuracy with direct translation (no explanation), and exp means explain then translate.T indicate the pivot program is correct, and similarly for F. Hence, pass(direct|T) means the pass rate for direct translation within the subset of programs that the pivot program is correct.

Table 16 :
We use correct/ incorrect pivot program translation to observe the sensitivity models have in formal intermediate reasoning steps.
It then checks if any integer appears more than twice in the list, and returns False if so.Finally, it checks if the list is sorted in nondescending order, and returns True if so, and False otherwise.This program checks if two strings concatenated together form a valid string of balanced parentheses.It does this by iterating through each character of the concatenated string and keeping track of the number of open and closed parentheses.If the number of closed parentheses ever exceeds the number of open parentheses, the string is invalid.If the number of open and closed parentheses is equal at the end of the string, the string is valid.The function returns 'Yes' if either the original concatenated string or the reversed concatenated string are valid, and 'No' otherwise.This Python program takes a list of two strings as input, and checks if the two strings can be concatenated in any order to form a string with balanced parentheses.If the parentheses are balanced, the function returns 'Yes', otherwise it returns 'No'.The function 'check' takes a string as input and checks if the parentheses in the string are balanced.It does this by iterating through the string and keeping track of the number of open and closed parentheses encountered so far.If the number of closed parentheses encountered is greater than the number of open parentheses encountered, the function returns False.If the number of open and closed parentheses is equal at the end of the string, the function returns True, otherwise it returns False.The function then concatenates the two input strings in both possible orders and checks if either of the resulting strings have balanced parentheses.If at least one of the strings has balanced parentheses, the function returns 'Yes', otherwise it returns 'No '.To rewrite this program in Java, you would need to replace the 'List' type with an array of strings, and use a for loop to iterate through the characters in the input strings.You would also need to use the 'StringBuilder' class to concatenate the strings.Here is an example Java implementation: This program checks if it is possible to create a valid string of parentheses by concatenating two strings from a list of strings.The function ' match_parens' takes a list of strings as input and returns 'Yes' if it is possible to create a valid string of parentheses by concatenating two strings from the list, and 'No' otherwise.The function 'check' checks if a string of parentheses is valid by counting the number of opening and closing parentheses.If the number of closing parentheses is greater than the number of opening parentheses at any point, the string is invalid.If the number of opening and closing parentheses is equal at the end of the string, the string is valid.The function 'match_parens' concatenates the first two strings in the list in two different orders and checks if either of the resulting strings is valid.If either string is valid, the function returns 'Yes', otherwise it returns 'No'.This program takes a string as input and returns a list of words.The input string is first converted into a list of characters.If a comma is encountered in the string, it is replaced with a space.The list of characters is then joined back into a string and split into a list of words using the default whitespace separator.If the input string is empty, an empty list is returned.###Explanation2(pass 1/1)This program takes a string as input, and returns a list of words.It does this by first checking if the string is empty, and if it is, it returns an empty list.If the string is not empty, it creates an empty list called ' s_list', and then iterates over each letter in the input string.If the letter is a comma, it appends a space to 's_list', otherwise it appends the letter itself.After iterating over all the letters in the input string, it joins all the elements in 's_list' into a single string, and then splits that string into a list of words using the default whitespace separator.This program takes in a string as input and returns a list of words.The input string is first split into a list of characters.If a character is a comma, it is replaced with a space.The list of characters is then joined back into a string and split into a list of words.If the input string is empty, an empty list is returned.This function takes a string as input and returns a list of words.It does this by first converting all commas in the string to spaces, and then splitting the string into a list of words.If the input string is empty, it returns an empty list.

Table 19 :
Intermediate step (explanations, pivot program) length and their ratio to source Python program More detailed explanation is longer However, as we have noted in Table1, more detailed explanations do not always lead to more improvements.In high-low-resource directions, more generic (shorter) often works better.This is one of the examples where length does not correlate well with performance.