Ivan Sedykh
2024
Searching by Code: A New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets
Ivan Sedykh
|
Nikita Sorokin
|
Dmitry Abulkhanov
|
Sergey I. Nikolenko
|
Valentin Malykh
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Code search is an important and well-studied task, but it usually means searching for code by a text query. We argue that using a code snippet (and possibly an error traceback) as a query while looking for bugfixing instructions and code samples is a natural use case not covered by prior art. Moreover, existing datasets use code comments rather than full-text descriptions as text, making them unsuitable for this use case. We present a new SearchBySnippet dataset implementing the search-by-code use case based on StackOverflow data; we show that on SearchBySnippet, existing architectures fall short of a simple BM25 baseline even after fine-tuning. We present a new single encoder model SnippeR that outperforms several strong baselines on SearchBySnippet with a result of 0.451 Recall@10; we propose the SearchBySnippet dataset and SnippeR as a new important benchmark for code search evaluation.