Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Yiqing Xie; Atharva Naik; Daniel Fried; Carolyn Rose

doi:10.18653/v1/2023.findings-emnlp.917

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

Yiqing Xie, Atharva Naik, Daniel Fried, Carolyn Rose

Abstract

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code pairs with similar functionality), and another that augments existing parallel data with multiple reference translations. Specifically, we build and analyze multiple types of comparable corpora, including programs generated from natural language documentation using a code generation model. Furthermore, to reduce overfitting to a single reference translation, we automatically generate additional translation references for available parallel data and filter the translations by unit tests, which increases variation in target translations. Experiments show that our data augmentation techniques significantly improve CodeT5 for translation between Java, Python, and C++ by an average of 7.5% Computational Accuracy (CA@1), which verifies the correctness of translations by execution. The code is available at https://github.com/Veronicium/CMTrans.

Anthology ID:: 2023.findings-emnlp.917
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13725–13739
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.917
DOI:: 10.18653/v1/2023.findings-emnlp.917
Bibkey:
Cite (ACL):: Yiqing Xie, Atharva Naik, Daniel Fried, and Carolyn Rose. 2023. Data Augmentation for Code Translation with Comparable Corpora and Multiple References. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13725–13739, Singapore. Association for Computational Linguistics.
Cite (Informal):: Data Augmentation for Code Translation with Comparable Corpora and Multiple References (Xie et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.917.pdf

PDF Cite Search