CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation

Recent code translation techniques exploit neural machine translation models to translate source code from one programming language to another to satisfy production compatibility or to improve efficiency of codebase maintenance. Most existing code translation datasets only focus on a single pair of popular programming languages. To advance research on code translation and meet diverse requirements of real-world applications, we construct CodeTransOcean, a large-scale comprehensive benchmark that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, MultilingualTrans supporting translations between multiple popular programming languages, NicheTrans for translating between niche programming languages and popular ones, and LLMTrans for evaluating executability of translated code by large language models (LLMs). CodeTransOcean also includes a novel cross-framework dataset, DLTrans, for translating deep learning code across different frameworks. We develop multilingual modeling approaches for code translation and demonstrate their great potential in improving the translation quality of both low-resource and high-resource language pairs and boosting the training efficiency. We also propose a novel evaluation metric Debugging Success Rate@K for program-level code translation. Last but not least, we evaluate LLM ChatGPT on our datasets and investigate its potential for fuzzy execution predictions. We build baselines for CodeTransOcean and analyze challenges of code translation for guiding future research. The CodeTransOcean datasets and code are publicly available at https://github.com/WeixiangYAN/CodeTransOcean.


Introduction
Early software systems are developed using programming languages such as Fortran and COBOL, which have a significantly smaller user base compared to modern mainstream programming languages (e.g., Python and Java).Hence maintaining and modernizing early software systems are expensive (Opidi, 2020).Moreover, the readability and compatibility of the mixed multitude of programming languages are challenging when migrating existing software systems to new technology ecosystems or integrating software systems using different programming languages.The code translation task aims to convert source code from one programming language to another and is of great value in industry.
Code translation methods evolve from the inefficient, costly, and error-prone manual rewriting method to automatic methods.Automatic code translation methods can be categorized into compilers and transpilers, rule-based methods, and neural network based methods.Neural models (Feng et al., 2020;Wang et al., 2021Wang et al., , 2023b) have become dominant in code translation.Details of code translation methods are presented in Appendix A.1.The performance of neural models relies heavily on large-scale high-quality parallel data.However, existing code translation datasets are limited by insufficient coverage of programming languages and mostly focusing on a single pair of popular programming languages, limited scale, and uneven data distribution.The widely used CodeTrans (Lu et al., 2021) is a small dataset containing only Java-C# parallel data for quite short code samples.Other datasets (Ahmad et al., 2023;Rozière et al., 2020;Zhu et al., 2022b;Nguyen et al., 2013;Chen et al., 2018) suffer from the same limitations.Consequently, existing code translation models (Feng et al., 2020;Wang et al., 2021;Ahmad et al., 2021) are confined to a narrow range of one-to-one code  (Liu et al., 2019).Length is the number of characters.translation scenarios.Moreover, deep learning has been broadly used and achieved unprecedented success.However, there are barriers between different deep learning frameworks during the actual production process.Existing code translation datasets also neglect important demands from real-world applications, including modernizing early software systems developed in niche programming languages and migrating code across different deep learning frameworks.
To address these limitations and advance neural code translation models, we construct a largescale comprehensive multilingual code translation benchmark CodeTransOcean, summarized in Table 1.CodeTransOcean is an innovative benchmark that aims to provide a unified platform for evaluating various models on a comprehensive set of code translation tasks that reflect real-world demands.Based on this goal, each dataset in Code-TransOcean is specifically designed to tackle a key challenge in the field of code translation.Code-TransOcean includes three multilingual datasets, namely, the MultilingualTrans dataset (including eight popular programming languages), the NicheTrans dataset (translating between thirtyseven niche programming languages and the eight popular ones 1 ), and a specialized dataset LLM-Trans (including 350 data samples and their executed results) to evaluate executability of code translated by large language models (LLMs), and a cross-framework dataset DLTrans facilitating our proposed task for translating code between deep learning frameworks to enhance code reusability. 1 We define popular and niche programming languages based on the TIOBE Programming Community Index, which is a metric of the popularity of programming languages.
DLTrans includes 408 samples covering four mainstream deep learning frameworks.
Multilingual modeling shows great potential in neural machine translation (Aharoni et al., 2019;Wang et al., 2020;Zhu et al., 2023), but it has not been systematically explored for code translation.We investigate multilingual modeling for code translation using our MultilingualTrans, NicheTrans, and DLTrans datasets.Experimental results demonstrate that multilingual modeling significantly improves translation quality for both high-resource and low-resource language pairs and improves the model training efficiency.
Recent research indicates that the proficiency of the LLM ChatGPT in natural language translation is on par with commercial-grade translation systems (Jiao et al., 2023).To the best of our knowledge, our work is the first to systematically investigate the potential of ChatGPT in code translation.We develop a fully automated translationexecution-evaluation pipeline AutoTransExecuter to support this study.Note that match-based metrics and execution-based metrics have been used for evaluating code translation methods, with details in Appendix A.1.In order to accurately evaluate the usability of translated code from ChatGPT, we propose a novel execution-based evaluation metric Debugging Success Rate @K (DSR@K), which is the percentage of samples with translation results that successfully execute and produce the expected functionality after K debugging rounds.On our LLMTrans dataset, the baseline ChatGPT setting achieves 48.57% DSR@0.We find that self-debugging and one-shot improve the performance while chain-of-thought strategies degrade the translation accuracy.Since our AutoTransEx-ecuter still cannot cover arbitrary programming languages, we also propose a novel metric fuzzy execution, attempting to address the limitations of existing evaluation metrics for code translation.Our preliminary study using ChatGPT shows that Chat-GPT is still inadequate to predict fuzzy execution for any arbitrary programming language, which demands future research.
Our contributions can be summarized as follows: • A large-scale multilingual code translation benchmark: CodeTransOcean covers the largest number of popular and niche programming languages so far with the largest scale.It also includes an unprecedented dataset for translating code across different deep learning frameworks and a dataset and an automated pipeline for evaluating LLMs on code translation.We establish baselines for all datasets in CodeTransOcean.• Multilingual modeling for code translation: We are the first to systematically evaluate multilingual modeling on code translation for both high-resource and low-resource language pairs.Experimental results demonstrate that multilingual modeling significantly improves translation quality for both high-resource and low-resource language pairs and improves training efficiency.• ChatGPT on code translation: We conduct the first comprehensive study of the potential of Chat-GPT on code translation, investigating efficacy of prompting strategies, hyperparameters, selfdebugging, One-shot, and Chain-of-Thought.• New evaluation metrics: We propose DSR@K to evaluate translation and debugging capabilities of LLMs.We also propose a fuzzy execution metric based on LLMs and conduct a preliminary study using ChatGPT on this metric.

Related Work
Code Translation Datasets The success of neural models for code translation relies heavily on large-scale high-quality parallel data.However, existing code translation datasets are plagued by issues such as insufficient coverage of programming languages, limited scale, and imbalanced data distribution.The widely used code translation dataset CodeTrans (Lu et al., 2021) (Zhu et al., 2022b), making it unsuitable for code translation tasks.With the limitations of existing code translation datasets, neural models trained on them may encounter overfitting, underfitting, and poor generalizability.Clearly, these issues impede the development of neural models for code translation.Therefore, constructing datasets that effectively address these problems is critical to enhance performance of code translation algorithms.
Code Translation Methods and Evaluation Metrics Details of code translation methods and evaluation metrics are presented in Appendix A.1.

The CodeTransOcean Benchmark
In this section, we provide detailed descriptions and analyses of our CodeTransOcean benchmark, including the code translation tasks, their associated datasets, and dataset statistics.Details of data collection methods and licensing information as well as quality control and quality assessment are presented in Appendix A.2.Note that the vast There is no overlap between CodeTransOcean datasets and existing code translation datasets.

Multilingual Code Translation
With the increasing need to unify the language variety when implementing system integration or extensions with multilingual programming environments, we construct the MultilingualTrans dataset for multiple popular programming languages6 .Among programming languages in the rankings, we select the Top-10 languages as popular ones except JavaScript and SQL7 and construct the MultilingualTrans dataset based on the 8 pro-gramming languages.We treat the other languages in the rankings as niche languages and construct the NicheTrans dataset for translating between niche languages and popular languages.Additionally, in order to quantitatively evaluate the execution capabilities of the code generated by LLMs (e.g., ChatGPT, PaLM2 (Anil et al., 2023)), we construct LLMTrans, which includes the execution results for a subset of MultilingualTrans and facilitates evaluating LLMs for multilingual code translation.
MultilingualTrans Dataset This dataset contains 30,419 program samples covering eight popular programming languages, namely, C, C++, C#, Java, Python, Go, PHP, and Visual Basic.Table 11 shows the statistics of each language pair.Note that XLCoST (Zhu et al., 2022a) is the only existing multilingual code translation dataset.Compared to XLCoST, MultilingualTrans is advantageous in more balanced data distribution across various programming languages, practicality of language pairs, and data quality.For example, the real-world requirement for translating Java into JavaScript as in XLCoST is quite limited.As to data quality, our MultilingualTrans originates from a programming chrestomathy website, with all data already reviewed and verified by the website.

NicheTrans Dataset
The NicheTrans dataset contains 236,468 program samples, covering code translation pairs from thirty-seven niche programming languages, including Ada, COBOL, Pascal, Perl, Erlang, Fortran, Scala, Julia and others, to the eight popular ones.Table 12 shows statistics of each niche language.Although many studies have highlighted the practical necessity of code translation for modernizing niche programming languages (Chen et al., 2018;Zhu et al., 2022b;Rozière et al., 2020), our NicheTrans dataset is the first dataset for code translation between these niche languages and popular ones.We believe this dataset will not only facilitate modernization of outdated programming languages more effectively, but also augment and evaluate generalizability of neural models.

LLMTrans Dataset
The LLMTrans dataset aims to provide a benchmark for evaluating the performance of LLMs on code translation.The dataset translates seven popular programming languages to Python, totaling 350 program samples.We compile and test these samples and record the execution results.Based on this dataset, we design and implement an automated pipeline, AutoTransExecuter8 , automatically using LLMs to conduct code translation, execution, debugging, and calculating the success rate.This dataset and the automated pipeline ease investigation of the actual debugging success rate of LLMs on code translation and effectively measure the practical usability of LLMs.Details of the LLMTrans dataset are in Table 1.

Cross-framework Code Translation
Cross-Deep-Learning-Framework Translation Task The widespread applications of deep learning (DL) has spawned emergence of various DL frameworks, such as PyTorch, TensorFlow, MXNet, and Paddle.However, there are significant differences in syntax and dependency libraries between different frameworks, severely impeding reusability of projects9 .Moreover, studies illustrate significant disparities in energy consumption and economic costs during training and inference between various frameworks (Georgiou et al., 2022).Selecting an appropriate DL framework for green AI has become paramount in an era of large models (Ananthaswamy, 2023)

Experiments
We present experiments of multilingual training for code translation (Section 4.1).We then introduce a novel evaluation metric Debugging Success Rate@K for program-level code translation (Section 4.2) and the first comprehensive exploration of ChatGPT for code translation (Section 4.3).

Multilingual Modeling
Multilingual modeling has been pivotal in broadening the applicability of neural machine translation (Aharoni et al., 2019;Wang et al., 2020;Zhu et al., 2023;Johnson et al., 2017).This is primarily evidenced in enhancing the performance of low-resource languages and cross-language transfer learning (Mohammadshahi et al., 2022;Zoph et al., 2016;Nguyen and Chiang, 2017;Johnson et al., 2017).CodeTransOcean covers nearly fifty Table 3: Average BLEU scores of the four multilingual modeling strategies, One-to-One, Many-to-One, Many-to-Many, and One-to-Many, for All language pairs, High-resource language pairs, and Low-resource language pairs.programming languages and deep learning frameworks.We use its datasets to explore multilingual modeling on code translation tasks.
Experimental Setups In this work, we use pretrained CodeT5+ (Wang et al., 2023b) 11 as the backbone based on its superior performance on code understanding and generation evaluations reported in (Wang et al., 2023b).We use the Multilin-gualTrans dataset to investigate four multilingual modeling strategies based on data sharing in the source or target language or both, namely, One-to-One, One-to-Many, Many-to-One, and Many-to-Many, with One-to-One as the baseline.Details of the four strategies are in Appendix A.5.To understand the strengths and weaknesses of the four strategies, we compare their average performance on all language pairs and focus on low-resource and high-resource pairs.Since the CodeBLEU metric (Ren et al., 2020) does not cover all eight languages in MultilingualTrans, we use BLEU to measure translation accuracy for the four strategies.Then, we establish baselines for the DLTrans and NicheTrans datasets.
We rank the resource richness of the eight programming languages in MultilingualTrans in descending order based on their amounts in the CodeT5+ pre-training data, as Java, PHP, C, C#, Python, C++, and Go (Visual Basic is not covered by the CodeT5+ pre-training data).Based on this ranking, we consider Visual Basic, C++, and Go as low-resource languages and Java, PHP and C as high-resource languages.
Results and Analysis Detailed experimental results are shown in Table 14 in Appendix.For All language pairs, the performance of the four strategies is ranked as One-to-Many > Many-to-Many > Many-to-One > One-to-One.(1) Under One-to-Many strategy, the model encoder can provide more comprehensive information for source language translation due to its ability to absorb more source language features, thereby improving generalizability of the model.( 2) Many-to-Many can be considered as expanding the One-to-Many strategy by employing a greater volume of non-source language data for training.Since the encoder must be attuned to the features of various languages simultaneously under Many-to-Many, parameter sharing may potentially undermine the performance.(3) Manyto-One helps the model to learn from a broader range of data than the baseline.Specific patterns or expressions in diverse source languages assist the model in more precisely comprehending how to translate into the target language.The shared semantic representations across different source languages allow the model to implement effective transfer learning strategies.Furthermore, increase in training samples enables the model to optimize the loss function more stably.These results are consistent with previous findings on multilingual modeling for natural language translation (Aharoni et al., 2019): Many-to-Many models, trained across multiple target languages instead of just one target language, can function effectively as a regularization strategy for Many-to-One, thereby reducing the possibility of over-matching.
For High-resource and Low-resource languages, as shown in Table 3, the ranking of the four strategies is the same as for All, but there is notable difference in their adaptability across languages of varying resource scales.High-resource languages can take advantage more effectively from the shared information across multiple source languages; whereas, low-resource languages are relatively less equipped to handle the additional uncertainty and noise introduced by shared parameters, and thus often have to rely on a larger volume of source language data to optimize their benefits.
Results from the Many-to-Many strategy on DL-Trans and NicheTrans datasets are shown in Tables 4 and 5.The experimental results suggest that significant improvements in translation accuracy can be achieved by swapping the source and target languages in the training set to facilitate data augmentation and training a bidirectional model.Notably, prior studies on multilingual neural machine translation often overlook the comparison between One-to-Many and other strategies.Nevertheless, One-to-Many demonstrates superiority over the One-to-One baseline across all our experiments.Overall, our results strongly recommend a targeted multilingual modeling strategy for code translation, as it not only can translate multiple language pairs with a single model, but also achieves better and more stable accuracy than baselines.

Debugging Success Rate@K
For evaluations, we adopt existing code translation evaluation metrics in our experiments, including Exact Match (EM), BLEU, and CodeBLEU (details are in Appendix A.1.2).However, all these metrics are based on surface-form matching (or with some adaptations as for CodeBLEU) and are not suitable for our program-level translation tasks since they cannot reliably evaluate functional correctness of translated code.Moreover, in real-world software development scenarios, developers typically ensure the functionality of code by testing and debugging upon completion, rather than writing and testing multiple versions of the code to achieve the expected functionality as measured by the existing pass@k (Kulal et al., 2019) metric.
Meanwhile, recent research shows that LLMs such as ChatGPT demonstrate preliminary code debugging capabilities (Chen et al., 2023b,a).Hence, we propose a novel and robust evaluation metric for LLM on code translation, Debugging Success Rate@K (DSR@K), by measuring whether the translated code can be compiled and executed with the same behavior as the input source code, with K rounds of debugging.To the best of our knowledge, DSR@K is the first metric designed to accurately reflect real-world software development scenarios.
DSR@K is the percentage of the samples that successfully execute and produce the expected results among all samples.Each sample is given K generation and debugging attempts by an LLM.If the generated code successfully executes and produces the expected results with these K rounds, the sample is marked as successful.DSR@K is computed as 1 N N i=1 S(i, K), where N denotes the total number of samples.If the i th code sample succeeds within K attempts, then S(i, K) = 1; otherwise, S(i, K) = 0. Note that DSR@0 can be used for program-level code translation evaluation for any models.In this work, we employ DSR@K to evaluate the ability of LLMs such as ChatGPT for debugging code and translating code with debugging results.

ChatGPT for Code Translation
The recent LLM ChatGPT demonstrates competitive performance on language generation tasks such as summarization and machine translation (Yang et al., 2023;Peng et al., 2023;Gao et al., 2023).However, ChatGPT for code translation has not been systematically explored.We study the effectiveness and potential of ChatGPT on code translation and investigate strategies to improve its performance.We use DSR@K as the principal evaluation metric since we focus on the practical usability of ChatGPT.We use the ChatGPT API and gpt-3.5-turboas the default model and evaluate on the LLMTrans dataset for all experiments.We investigate the efficacy of prompts and hyperparameters and context in zero-shot setting, then compare oneshot versus zero-shot and study Chain-of-Thought.

Effect of Prompts and Hyperparameters
Prior works show that prompts can influence the performance of ChatGPT (Zhong et al., 2023;Peng et al., 2023;Jiao et al., 2023).We set an initial prompt "Translate [SL] to [TL]: [SC]." as the baseline, where [SL] and [TL] denote the source language and the target language respectively and [SC] denotes the source code.We also add "Do not return anything other than the translated code." for each prompting strategy to require ChatGPT to return only code in order to ease code execution.We design three prompt variants.Details of the experimental settings and prompt variants are in Appendix A.6.We also investigate the effect of hyperparameters on code translation performance.
As shown in Table 6, implementing role assignments, clarifying usage, and polite inquiry in prompts all degrade the performance compared to the baseline prompt.These results show that the baseline with the most straightforward prompt produces the best performance, possibly because it provides clear, short, and unambiguous instructions for the task to the model.More intricate prompting strategies may introduce noise and confuse Chat-GPT.The performance of polite inquiry prompt is comparable to but still worse than the baseline performance.We speculate that the improvement from polite inquiries in prior studies (Akın, 2023) may stem from their explicit and comprehensive formulations which make it easier for the model to understand the task requirements.We also observe in Table 6 that same as prior findings, BLEU and CodeBLEU have no obvious positive correlations with the debugging success rate (DSR@0).Since the reference target code exhibits the same functionality as the source language code but their execution results could differ slightly, EM also does not correlate with DSR@0.Therefore, in subsequent experiments, we only report DSR@0.We also evaluate the CodeT5+_220M model on LLMTrans with the Many-to-Many strategy and find that DSR@0 is 0, suggesting that CodeT5+_220M Zero-shot is unable to generate executable translation results.
ChatGPT selects the token with the highest probability during generation.The hyperparameter temperature influences the randomness of the generated text, while top_p controls the range of vocabu-   lary considered during generation.Higher temperature or top_p could increase diversity in the generated results from ChatGPT.However, as shown in Table 16 in Appendix, independently varying temperature or top_p does not notably change the performance of ChatGPT; hence for the other Chat-GPT experiments, we set both temperature and top_p as 0 to ensure stability an reproducibility.

Effect of Context
We explore a Divide-and-Conquer strategy, which segments the source language code into snippets (e.g., functions and subfunctions), translate each snippet independently, then merge their outputs as the final result.As shown in Table 6, Divide-and-Conquer significantly degrades the performance.We hypothesize that lack of the global context in Divide-and-Conquer could prevent ChatGPT from considering the overall structure and variable configurations of the code for translation.
Effect of Self-debugging Since ChatGPT has shown preliminary capability in error detection and correction during code generation (Shinn et al., 2023;Chen et al., 2023b;Kim et al., 2023;Nair et al., 2023;Madaan et al., 2023), we use Chat-GPT to perform multiple rounds of self-debugging and investigate the impact on DSR.Specifically, ChatGPT first translates the source language code into the target language (which is Python as in our AutoTransExecuter) and then attempts to execute the translated code.If the execution passes and executing the translated code exhibits the same functionality as the source code, it is regarded as a successful execution.Otherwise, feedback from the compiler will be also fed to ChatGPT for the next round of translation, and this process is repeated until reaching a pre-defined number K of debugging rounds.The whole process is shown in Table 17 in Appendix.As shown in Table 7, DSR improves significantly with multiple rounds of selfdebugging.The first self-debugging improves DSR by 3% absolutely.Each subsequent round of selfdebugging brings further gain but DSR begins to plateau after the second debugging round.This suggests that ChatGPT has limitations in its capacity to rectify errors after multiple debugging cycles, which is consistent with human behaviors.
Effect of One-shot In-context learning (Brown et al., 2020) allows the model to learn from input examples, enabling it to understand and manage each new task.This method has been validated as an effective strategy for enhancing the performance of model inference (Peng et al., 2023;Liu et al., 2023a).Therefore, we explore one-shot learning for ChatGPT on code translation.We investigate three one-shot learning sample selection strategies.Descriptions of the strategies and the corresponding prompts are in Appendix A.7. Table 8 shows that all three One-shot learning strategies effectively improve DSR@0 of ChatGPT over the Zero-shot baseline.The Experiment#2 strategy (provided contextual example has both same source and target languages as the original task) achieves the best performance, yielding 1.72% absolute gain in DSR@0, with Experiment #1 (example has the same target language but different source language) and #3 (example has different source and target languages) following closely with 1.14% and 0.29% absolute gains, respectively.These results show that One-shot learning entirely tailored to the translation requirements is most effective in boosting code translation performance for ChatGPT.The results corroborate previous findings in natural language translation (Peng et al., 2023) that the performance of ChatGPT is sensitive to the provided contextual example in One-shot learning.
Effect of Chain-of-Thought Chain-of-Thought (CoT) allows the model to simulate an orderly and structured way of thinking by sorting out the thinking process.It helps guide the model to output the final answer step by step (Wei et al., 2022;Peng et al., 2023;Kojima et al., 2022)  In Experiment #2, DSR@0 even declines by 6% absolutely.We study the translation results of Chat-GPT and find that when CoT strategies are applied, the model tends to translate the source code line by line, neglecting compatibility issues between libraries and functions in different languages.CoT also compromises the global planning ability of the model.These observations are consistent with the findings in (Peng et al., 2023) that CoT may lead to word-by-word translations of natural language, thereby degrading the translation quality.
Fuzzy Execution To address the limitations of existing evaluation metrics and our AutoTransExecuter, we propose another novel code translation evaluation metric fuzzy execution using LLMs in Section Limitations, inspired by recent progress in using LLMs as evaluation metrics for NLP tasks.
Our preliminary studies evaluates the performance of ChatGPT for predicting whether a given code can be executed or not, and if executable, also for predicting the executed output.Experimental results show that using ChatGPT for fuzzy execution is not yet practical and demands future research.

Conclusion
We construct CodeTransOcean, a comprehensive code translation benchmark that includes multilingual and cross-framework datasets.We demonstrate that multilingual modeling has remarkable potential in enhancing code translation quality.We also reveal the superior code translation capability of ChatGPT and advanced strategies lead to significant performance gains.Moreover, we introduce fuzzy execution that may overcome limitations of existing metrics but requires future research.In summary, we provide a comprehensive suite of resources, tools, and baselines for code translation.

Limitations
Existing match-based evaluation metrics for code translation (Papineni et al., 2002;Ren et al., 2020;Eghbali and Pradel, 2022;Zhou et al., 2023;Tran et al., 2019) focus solely on semantics, overlooking executability of the code and the functional equivalence under different implementations.Executionbased metrics (Kulal et al., 2019;Hao et al., 2022;Hendrycks et al., 2021;Rozière et al., 2020;Dong et al., 2023) that require providing test cases are expensive to conduct in practice, and the significant overhead of executing numerous test cases and the heightened security risks during the execution process remain unresolved.It is crucial to establish an evaluation metric that overcomes these limitations.
Our proposed DSR@K and the automated Auto-TransExecuter aim to measure the executability of the code and reflect the real-world software development scenarios.However, AutoTransExecuter currently only supports Python as the target language.This is mainly due to the fact that different programming languages necessitate distinct runtime environments and libraries, making it particularly challenging to automatically detect and install the required dependencies for each code.While certain existing tools, such as Dynatrace12 , can carry out dependency detection, the range of supported programming languages remains limited.Moreover, the configuration methods for compilers vary substantially among different programming languages, which further complicates automated configuration.In addition, fully automated execution systems could be exploited by malicious code, thus necessitating further security measures.Therefore, achieving this goal requires overcoming many technical and practical difficulties.
To address limitations of existing evaluation metrics and limitations of AutoTransExecuter, we propose another novel code translation evaluation metric fuzzy execution.
Recent studies have begun to utilize LLMs as evaluation metrics in the field of NLP (Chen et al., 2023c;Wang et al., 2023a;Fu et al., 2023;Kocmi and Federmann, 2023;Ji et al., 2023).Inspired by these works, we create a new dataset ExecuteStatus by randomly selecting 300 executable samples from MultilingualTrans and 300 non-executable samples from the translation results of ChatGPT.Each entry in this dataset includes the execution status and, if executable, the result of the execution.
We use ExecuteStatus and AutoTransExecuter to evaluate the performance of ChatGPT for predicting whether a given code can be executed or not, and if executable, also predict the executed output.The Zero-shot prompts are shown in Table 18 in Appendix.For the Few-shot strategy, in addition to the Zero-shot baseline, we include an example of executable code and an example of non-executable code, as detailed in Table 18.
We define fuzzy execution as first testing the consistency between the actual pass rate and the predicted pass rate of ChatGPT, followed by further testing the accuracy in predicting execution results using ChatGPT without relying on a compiler.Since we are interested in the ability of Chat-GPT to identify samples that cannot actually be executed accurately, we present the confusion matrix in Table 9 based on the results.To evaluate the performance of ChatGPT on the fuzzy execution prediction task, we use the standard accuracy, precision, recall, and F1 scores.Experimental results based on these evaluation metrics are in Table 10.The low accuracy, recall and F1 scores show that ChatGPT still has difficulty in identifying errors in the code, exhibiting about an 88% tendency to predict that the code is executable.Overall, ChatGPT has low accuracy in the binary classification task of "whether it can be executed", and its ability to predict execution results, being at a scant 4%, clearly requires further enhancement.Thus, using Chat-GPT for fuzzy execution is not yet practical (Liu et al., 2023b).Despite this, fuzzy execution with LLMs holds the potential to overcome the deficiencies of current code translation evaluation metrics.We will continue this exploration in future work.

A Appendix
A.1 Related Work A.1.1 Code Translation Methods Naive Copy directly duplicates the source code as the target code without making any modifications.Given that the results produced by this method are often unusable, it is treated as the lower bound of performance for code translation.Early code translation relies heavily on manual rewriting, which requires developers to have a deep understanding of both source and target languages along with the ability to navigate various complex programming structures and semantic challenges.This method is inefficient, costly, and prone to errors.
Automatic code translation methods fall into several categories.Compilers and transpilers13 can automatically translate the source code into a target language, significantly saving time and effort.However, these methods cannot fully preserve all the linguistic features and behaviors of the source code, nor can they comprehend the intent and semantics inherent to the source code as humans do.Rule-based methods (Weisz et al., 2021(Weisz et al., , 2022;;Rozière et al., 2020) treat the code translation task as a program synthesis problem.They define a set of transformation rules and employ the rules or pattern matching for code translation.Research on rule-based methods is quite scarce, mainly because they overly rely on the completeness of the rules and also require a considerable amount of manual preprocessing.
Neural network based methods have become dominant in the field of code translation in recent years.These methods mainly treat code translation as a sequence-to-sequence generation problem.Among them, Chen et al. (Chen et al., 2018) are the first to successfully apply neural networks to code translation, designing a tree-to-tree neural model.CodeBERT (Feng et al., 2020) significantly improves code translation accuracy by pretraining models with masked language modeling and replaced token detection.GraphCode-BERT (Guo et al., 2021) further improves code translation accuracy by introducing two additional pre-training tasks as edge prediction and node alignment.CodeT5 (Wang et al., 2021), based on the Transformer encoder-decoder architecture, achieves excellent performance on code translation through four pre-training tasks, namely, masked span prediction, identifier tagging, masked identifier prediction, and bimodal dual generation.With a similar architecture as CodeT5, PLBART (Ahmad et al., 2021) adopts three tasks of token masking, token deletion and token infilling for denoising seq2seq pre-training, which enables PLBART to infer language syntax and semantics and to learn how to generate language coherently.Nat-Gen (Chakraborty et al., 2022) forces the model to learn to capture intent of the source code by setting up "Code-Naturalization" tasks during pre-training, and forces the model to make the generated code closer to the human-written style.
In the line of neural network based methods, recently released large language models (LLMs) (e.g., ChatGPT (OpenAI, 2023)) have shown remarkable performance in a wide range of NLP tasks with instructions and a few in-context examples.ChatGPT is built upon GPT and is optimized with Reinforcement Learning from Human Feedback.ChatGPT can efficiently understand and generate code sequences, and can self-learn from human feedback to improve the quality and accuracy of its outputs.This significant advancement has markedly propelled progress in the field of code translation.

A.1.2 Code Translation Metrics
Match-Based Evaluation Metrics These evaluation metrics are based on the similarity between the translation output and the reference translation.Among them, the Exact Match (EM) metric calculates the percentage of translation outputs that exactly match the reference translation, which overlooks the fact that the same function can be implemented in various ways.The Bilingual Evalu-ation Understudy (BLEU) (Papineni et al., 2002) metric evaluates the similarity between the translation output and the reference translation by multiplying the geometric average of n-gram precision scores with a brevity penalty.The CodeBLEU (Ren et al., 2020) metric extends BLEU by considering syntactic and semantic characteristics of programming languages; it not only considers shallow matching but also pays attention to syntactic and semantic matching.CrystalBLEU (Eghbali and Pradel, 2022) focuses more on the inherent differences between source code and natural language, such as trivial shared n-gram syntax.Code-BERTScore (Zhou et al., 2023) uses pre-trained models to encode the translation output and reference translation, then calculates the dot product similarity between them, enabling comparisons of code pairs with distinct lexical forms.However, CodeBLEU, CrystalBLEU, and CodeBERTScore have limitations as they only support a limited range of programming languages and cannot be used in general multilingual scenarios.Ruby (Tran et al., 2019), a new method for evaluating code translation, considers the lexical, syntactic, and semantic representations of source code.However, its codebase has not yet been open-sourced.These match-based evaluation metrics can only evaluate the surface form and semantic differences of the code, while neglecting the executability of the code and the functional equivalence of implementation variations.

Execution-Based
Evaluation Metrics Execution-based evaluation metrics mainly compare the executed result of the generated code with the expected result.The PASS@k score (Kulal et al., 2019) is evaluated by unit tests: if any of the k samples meets the expected result, the generated result is deemed successful.AvgPassRatio (Hao et al., 2022;Hendrycks et al., 2021) evaluates the overall executable result of code by calculating the average pass rate of test cases.Computational accuracy (Rozière et al., 2020) measures the quality of the generated code snippet by comparing the output of this snippet with the reference code snippet when given the same input.Additionally, CodeScore (Dong et al., 2023) claims that it can estimate the PassRatio of test cases for the generated code without executing the code, but its codebase has not yet been open-sourced.These execution-based evaluation metrics require construction of executable test diversity ensures that CodeTransOcean reflects a wide variety of real-world scenarios.

A.3 Specific Challenges in Implementing
Cross-framework Translation Firstly, there are significant design differences between frameworks, including data processing methods, model-building strategies, and network connection techniques.Secondly, the inherent complexity of DL code increases the difficulty of conversion, as these codes usually contain various components such as neural network layers, loss functions, optimizers, and learning rate schedulers.
Thirdly, there are significant inconsistencies in the code structure of different frameworks, such as code organization and variable naming rules.Lastly, cross-platform compatibility must be considered because DL code may encounter compatibility issues when executing on different hardware platforms (e.g., GPUs, CPUs, TPUs) and operating systems.

A.4 Code Examples on Different Deep Learning Frameworks
Figures 1 and 2 show the implementation of two different deep learning components in various deep learning frameworks.

A.5 Multilingual Modeling
One-to-One For each language pair in the dataset, we train an independent model, e.g., translating C++ to Java.
One-to-Many We train individual models from one language to many other languages, e.g., translating Python to all other languages.Many-to-One We train individual models from multiple languages to one language, e.g., translating all other languages to Python.

Many-to-Many
We train a unified model for the multiple to multiple languages in the dataset, which can handle translations between all languages.
We ensure all experiments are performed under the same hyperparameters and environment for comparison.Table 13 shows these in detail.

A.6 Prompt Variations
Role Assignment (Peng et al., 2023;AlKhamissi et al., 2023;Wu et al., 2023;Akın, 2023) We configured two distinct roles for the model, each with unique skills.This arrangement empowers the  model to simulate more domain-adaptable and specialized expert roles.
Polite inquiry (Akın, 2023) These strategies add polite expression and set up imperative and interrogative requests.Given that ChatGPT is designed to simulate human conversation styles as closely as possible, including understanding and simulating polite language expressions.Therefore, we expect these strategies to boost the comprehension of the model and augment the quality of its generated results.
Clarify usage This strategy aims to make the model clearly aware of its requirements during the code translation process -the generated code needs to be guaranteed to execute without issues.The translation prompts of the above four strategies  (Wang et al., 2023b).Naive denotes Naive Copy, which directly duplicates the source code as the target code without making any modifications.Method OtO, OtM, MtO, and MtM denote One-to-One, One-to-Many, Many-to-One, and Many-to-Many, respectively.The rows correspond to the source language while the columns correspond to the target language.We run each experiment with three different random seeds and report the mean and standard deviation of BLEU scores.[Return to Section 4.1.] Self-debug@0 Yes, the Python code executes without errors.
Please predict the executed output of the Python code above.
The predicted execution result of the Python code above is [output].
Translate [source_language] to [target_language]: [source_code].Here is the [target_language] code equivalent of the given [source_language] code: [translated_code].Self-debug@n The above python code executes with the following errors, please correct them.[Compiler reports errors] Here is the modified [target_language] code: [translated_code].Table 17: A simple demo: Translation prompting of ChatGPT in the multi-round debugging strategy.The content in red is returned by the compiler.[Return to section 4.3.]Zero-shot prompting Does the following Python code execute?[python_code].Yes, the Python code executes without errors.Please predict the executed output of the Python code above.The predicted execution result of the Python code above is[output].Few-shot promptingThis is a executable Python code [python_code], and this is a Python code [python_code] that cannot be executed.Does the following Python code execute?[python_code].

Table 1 :
Summary of our CodeTransOcean.We report #Samples, Avg.#Tokens/Sample and Avg.Length for Train/Dev/Test sets of each dataset.Note that LLMTrans is only for testing.#Samples are on the program-level.#Tokens are based on RoBERTa tokenizer

Table 4 :
Results on DLTrans of Naive and CodeT5+_220M with Many-to-Many strategy.We run each experiment with 3 random seeds and report the mean and standard deviation of EM, BLEU, and CodeBLEU scores.

Table 5 :
BLEU scores on NicheTrans of Naive and CodeT5+_220M with Many-to-Many strategy.Oneway denotes training models only from niche to popular, while Two-way denotes training in both directions.

Table 6 :
Zero-shot performance of ChatGPT with different prompt variants and contextual strategies.Baseline denotes ChatGPT with the baseline prompt.Details of the prompt variants (Expt #num) are in Appendix A.6.

Table 7 :
ChatGPT performance at the K th debugging.
. For code translation, we investigate four CoT strategies.Detailed

Table 8 :
Performance of ChatGPT with One-shot and CoT strategies compared to the Zero-shot Baseline.Details of Expt #num are in Appendix A.7 and A.8. descriptions and translation prompts for each strategy are in Appendix A.8.As shown in Table 8, CoT degrades executability of the translated code.

Table 9 :
Confusion matrix of fuzzy execution prediction by ChatGPT with Zero-shot and Few-shot settings.

Table 10 :
Performance of ChatGPT on predicting fuzzy execution.

Table 14 :
of the source code, then predict the output of the source code, and finally translate it, with the condition that the translated code must successfully execute.The specific translation prompts are shown in Table19.BLEU scores from different multilingual modeling strategies by fine-tuning the pre-trained CodeT5+_220M model (220M is the model size) tion

Table 18 :
Two simple demos: prompting in fuzzy execution experiments.[Return to Section 6.]