%0 Conference Proceedings
%T CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search
%A Li, Xiaonan
%A Gong, Yeyun
%A Shen, Yelong
%A Qiu, Xipeng
%A Zhang, Hang
%A Yao, Bolun
%A Qi, Weizhen
%A Jiang, Daxin
%A Chen, Weizhu
%A Duan, Nan
%Y Goldberg, Yoav
%Y Kozareva, Zornitsa
%Y Zhang, Yue
%S Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
%D 2022
%8 December
%I Association for Computational Linguistics
%C Abu Dhabi, United Arab Emirates
%F li-etal-2022-coderetriever
%X In this paper, we propose the CodeRetriever model, which learns the function-level code semantic representations through large-scale code-text contrastive pre-training. We adopt two contrastive learning schemes in CodeRetriever: unimodal contrastive learning and bimodal contrastive learning. For unimodal contrastive learning, we design an unsupervised learning approach to build semantic-related code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build code-text pairs. Both contrastive objectives can fully leverage large-scale code corpus for pre-training. Extensive experimental results show that CodeRetriever achieves new state-of-the-art with significant improvement over existing code pre-trained models, on eleven domain/language-specific code search tasks with six programming languages in different code granularity (function-level, snippet-level and statement-level).These results demonstrate the effectiveness and robustness of CodeRetriever.The codes and resources are available at https://github.com/microsoft/AR2/tree/main/CodeRetriever.
%R 10.18653/v1/2022.emnlp-main.187
%U https://aclanthology.org/2022.emnlp-main.187
%U https://doi.org/10.18653/v1/2022.emnlp-main.187
%P 2898-2910