Unified Pre-training for Program Understanding and Generation

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, Kai-Wei Chang


Abstract
Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., “if“ block inside an “else“ block is equivalent to “else if“ block) that are crucial to program semantics and thus excels even with limited annotations.
Anthology ID:
2021.naacl-main.211
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2655–2668
Language:
URL:
https://aclanthology.org/2021.naacl-main.211
DOI:
10.18653/v1/2021.naacl-main.211
Bibkey:
Cite (ACL):
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2668, Online. Association for Computational Linguistics.
Cite (Informal):
Unified Pre-training for Program Understanding and Generation (Ahmad et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.211.pdf
Video:
 https://aclanthology.org/2021.naacl-main.211.mp4
Code
 wasiahmad/PLBART
Data
CONCODECodeSearchNet