CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang; Hung Le; Akhilesh Gotmare; Nghi Bui; Junnan Li; Steven Hoi

doi:10.18653/v1/2023.emnlp-main.68

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, Steven Hoi

Abstract

Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks, lacking the flexibility to operate in the optimal architecture for a specific task. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some tasks and hence result in substantial performance degrade. To address these limitations, we propose “CodeT5+”, a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives, which cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) performance on various code-related tasks, and our instruction-tuned CodeT5+ 16B achieves new SoTA results of 35.0% pass@1 and 54.5% pass@10 on the HumanEval code generation task against other open code LLMs, even surpassing the OpenAI code-cushman-001 model.

Anthology ID:: 2023.emnlp-main.68
Volume:: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1069–1088
Language:
URL:: https://aclanthology.org/2023.emnlp-main.68/
DOI:: 10.18653/v1/2023.emnlp-main.68
Bibkey:
Cite (ACL):: Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore. Association for Computational Linguistics.
Cite (Informal):: CodeT5+: Open Code Large Language Models for Code Understanding and Generation (Wang et al., EMNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.emnlp-main.68.pdf
Video:: https://aclanthology.org/2023.emnlp-main.68.mp4

PDF Cite Search Video Fix data