ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu


Abstract
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.
Anthology ID:
2023.findings-acl.676
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10628–10650
Language:
URL:
https://aclanthology.org/2023.findings-acl.676
DOI:
10.18653/v1/2023.findings-acl.676
Bibkey:
Cite (ACL):
Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu. 2023. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10628–10650, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages (Chai et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.676.pdf