CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search

Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, Nan Duan


Abstract
In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training. Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level).These results demonstrate the effectiveness and robustness of CodeRetriever.The codes and resources are available at https://github.com/microsoft/AR2/tree/main/CodeRetriever.
Anthology ID:
2022.emnlp-main.187
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2898–2910
Language:
URL:
https://aclanthology.org/2022.emnlp-main.187
DOI:
10.18653/v1/2022.emnlp-main.187
Bibkey:
Cite (ACL):
Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2898–2910, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search (Li et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.187.pdf