Rethinking Negative Pairs in Code Search

Haochen Li, Xin Zhou, Anh Luu, Chunyan Miao


Abstract
Recently, contrastive learning has become a key component in fine-tuning code search models for software development efficiency and effectiveness. It pulls together positive code snippets while pushing negative samples away given search queries. Among contrastive learning, InfoNCE is the most widely used loss function due to its better performance. However, the following problems in negative samples of InfoNCE may deteriorate its representation learning: 1) The existence of false negative samples in large code corpora due to duplications. 2). The failure to explicitly differentiate between the potential relevance of negative samples. As an example, a bubble sorting algorithm example is less “negative” than a file saving function for the quick sorting algorithm query. In this paper, we tackle the above problems by proposing a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. In our proposed loss function, we apply three methods to estimate the weights of negative pairs and show that the vanilla InfoNCE loss is a special case of Soft-InfoNCE. Theoretically, we analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation. We furthermore discuss the superiority of proposed loss functions with other design alternatives. Extensive experiments demonstrate the effectiveness of Soft-InfoNCE and weights estimation methods under state-of-the-art code search models on a large-scale public dataset consisting of six programming languages.
Anthology ID:
2023.emnlp-main.786
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12760–12774
Language:
URL:
https://aclanthology.org/2023.emnlp-main.786
DOI:
10.18653/v1/2023.emnlp-main.786
Bibkey:
Cite (ACL):
Haochen Li, Xin Zhou, Anh Luu, and Chunyan Miao. 2023. Rethinking Negative Pairs in Code Search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12760–12774, Singapore. Association for Computational Linguistics.
Cite (Informal):
Rethinking Negative Pairs in Code Search (Li et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.786.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.786.mp4