CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Xin Wang; Yasheng Wang; Yao Wan; Jiawei Wang; Pingyi Zhou; Li Li; Hao Wu; Jin Liu

doi:10.18653/v1/2022.findings-naacl.80

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Xin Wang, Yasheng Wang, Yao Wan, Jiawei Wang, Pingyi Zhou, Li Li, Hao Wu, Jin Liu

Abstract

Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the natural-language description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODE-MVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.

Anthology ID:: 2022.findings-naacl.80
Volume:: Findings of the Association for Computational Linguistics: NAACL 2022
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1066–1077
Language:
URL:: https://aclanthology.org/2022.findings-naacl.80/
DOI:: 10.18653/v1/2022.findings-naacl.80
Bibkey:
Cite (ACL):: Xin Wang, Yasheng Wang, Yao Wan, Jiawei Wang, Pingyi Zhou, Li Li, Hao Wu, and Jin Liu. 2022. CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1066–1077, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training (Wang et al., Findings 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.findings-naacl.80.pdf
Video:: https://aclanthology.org/2022.findings-naacl.80.mp4
Data: CoNaLa

PDF Cite Search Video Fix data