CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation

Weixiang Yan; Yuchen Tian; Yunzhe Li; Qian Chen; Wen Wang

doi:10.18653/v1/2023.findings-emnlp.337

CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation

Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, Wen Wang

Abstract

Recent code translation techniques exploit neural machine translation models to translate source code from one programming language to another to satisfy production compatibility or to improve efficiency of codebase maintenance. Most existing code translation datasets only focus on a single pair of popular programming languages. To advance research on code translation and meet diverse requirements of real-world applications, we construct **CodeTransOcean**, a large-scale comprehensive benchmark that supports the largest variety of programming languages for code translation. CodeTransOcean consists of three novel multilingual datasets, namely, **MultilingualTrans** supporting translations between multiple popular programming languages, **NicheTrans** for translating between niche programming languages and popular ones, and **LLMTrans** for evaluating executability of translated code by large language models (LLMs). CodeTransOcean also includes a novel cross-framework dataset, **DLTrans**, for translating deep learning code across different frameworks. We develop multilingual modeling approaches for code translation and demonstrate their great potential in improving the translation quality of both low-resource and high-resource language pairs and boosting the training efficiency. We also propose a novel evaluation metric **Debugging Success Rate@K** for program-level code translation. Last but not least, we evaluate LLM ChatGPT on our datasets and investigate its potential for fuzzy execution predictions. We build baselines for CodeTransOcean and analyze challenges of code translation for guiding future research. The CodeTransOcean datasets and code are publicly available at https://github.com/WeixiangYAN/CodeTransOcean.

Anthology ID:: 2023.findings-emnlp.337
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5067–5089
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.337
DOI:: 10.18653/v1/2023.findings-emnlp.337
Bibkey:
Cite (ACL):: Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. 2023. CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5067–5089, Singapore. Association for Computational Linguistics.
Cite (Informal):: CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation (Yan et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.337.pdf

PDF Cite Search