CoTexT: Multi-task Learning with Code-Text Transformer

Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Annibal, Alec Peltekian, Yanfang Ye


Abstract
We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both “bimodal” and “unimodal” data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.
Anthology ID:
2021.nlp4prog-1.5
Volume:
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)
Month:
August
Year:
2021
Address:
Online
Editors:
Royi Lachmy, Ziyu Yao, Greg Durrett, Milos Gligoric, Junyi Jessy Li, Ray Mooney, Graham Neubig, Yu Su, Huan Sun, Reut Tsarfaty
Venue:
NLP4Prog
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
40–47
Language:
URL:
https://aclanthology.org/2021.nlp4prog-1.5
DOI:
10.18653/v1/2021.nlp4prog-1.5
Bibkey:
Cite (ACL):
Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Annibal, Alec Peltekian, and Yanfang Ye. 2021. CoTexT: Multi-task Learning with Code-Text Transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), pages 40–47, Online. Association for Computational Linguistics.
Cite (Informal):
CoTexT: Multi-task Learning with Code-Text Transformer (Phan et al., NLP4Prog 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.nlp4prog-1.5.pdf
Code
 justinphan3110/CoTexT
Data
CONCODECodeSearchNetCodeXGLUE