Bag-of-Words Baselines for Semantic Code Search

The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language. The semantic gap between natural language and programming languages has for long been regarded as one of the most significant obstacles to the effectiveness of keyword-based information retrieval (IR) methods. It is a common assumption that “traditional” bag-of-words IR methods are poorly suited for semantic code search: our work empirically investigates this assumption. Specifically, we examine the effectiveness of two traditional IR methods, namely BM25 and RM3, on the CodeSearchNet Corpus, which consists of natural language queries paired with relevant code snippets. We find that the two keyword-based methods outperform several pre-BERT neural models. We also compare several code-specific data pre-processing strategies and find that specialized tokenization improves effectiveness.


Introduction
Community Question Answering forums like Stack Overflow have become popular 1 methods for finding code snippets relevant to natural language questions (e.g., "How can I download a paper from arXiv in Python?"). Such forums require community members to provide answers, which means that potential questions are limited to public code, and a large portion of questions cannot be answered in real time. The task of semantic code search removes these limitations by treating a code-related natural language question as a query and using it to retrieve relevant code snippets. In this way, novel questions can be immediately answered whether in public or private code repositories.
Consequently, the semantic code search task is receiving an increasing amount of attention. Several early efforts showed promising results applying neural networks models to various code search datasets (Gu et al., 2018;Sachdev et al., 2018;Cambronero et al., 2019;Zhu et al., 2020;Srinivas et al., 2020). To facilitate research on semantic code search, GitHub released the CodeSearchNet Corpus and Challenge (Husain et al., 2019), providing a large-scale dataset across multiple programming languages with unified evaluation criteria. This dataset has been utilized by multiple recent papers Gu et al., 2021;Arumugam, 2020).
Work on semantic code search has focused on neural ranking models under the assumption that such methods are necessary to bridge the semantic gap between natural language queries and relevant results (i.e., code snippets). Such approaches usually design a task-specific joint vector representation to map natural language queries and programming language "documents" into a shared vector space (Gu et al., 2018;Sachdev et al., 2018;Cambronero et al., 2019). Inspired by progress in pretrained models (Devlin et al., 2019), researchers proposed CodeBERT , a pretrained transformer model specifically for programming languages, which yields impressive effectiveness on this task.
Beyond utilizing the raw text of code corpora, another thread of research conducts retrieval using structural features parsed from code, which are believed to contain rich semantic information (Srinivas et al., 2020). Multiple papers have also proposed incorporating structural information with neural ranking models (Gu et al., 2021;Ling et al., 2021;. In contrast to these comparatively sophisticated methods, in this work we explore the effectiveness of traditional information retrieval (IR) methods on the semantic code search task. This exploration is of interest for two reasons: First, while neural methods can take advantage of distributed representations (i.e., static or contextual embeddings) to model semantic similarity, Yang et al. (2019) found that pre-BERT neural ranking models can underperform traditional IR methods like BM25 with RM3 query expansion, especially in the absence of large amounts of data for training. Prior work has claimed that traditional IR methods are unfit for code search (Husain et al., 2019), but there is a lack of empirical evidence supporting this claim. In fact, in one of the few comparisons with traditional IR methods available (Sachdev et al., 2018), BM25 performed well in comparison to the proposed neural methods on an Android-specific dataset.
Second, neural approaches are often reranking methods that rerank candidate documents identified by a first-stage ranking method. Even dense retrieval methods that perform ranking on shared vector representations directly can benefit from hybrid combinations with keyword-based signals as well as another round of reranking (Gao et al., 2020). It is thus useful to identify the best-performing traditional IR methods in this domain, so that they can provide a complementary source of evidence.
Thus, our work has two main contributions: First, we provide strong keyword baselines for semantic code search, demonstrating that traditional IR methods can in fact outperform several pre-BERT neural ranking models even without a semantic matching ability, which extends the conclusions drawn by Yang et al. (2019) on ad hoc retrieval to the semantic code search task. Second, we investigate and quantify the impact of specialized pre-processing for code search.

Related Work
As discussed above, joint-vector representations have been widely used in recent work on code search. NCS (Sachdev et al., 2018) proposed an approach integrating TF-IDF, word embeddings, and an efficient embedding search technique where the word embeddings are learned in an unsupervised manner. CODEnn (Gu et al., 2018) developed a neural model based on queries and separate code components. UNIF (Cambronero et al., 2019) investigated the necessity of supervision and sophisticated architectures for learning aligned vector representations. After concluding that supervision and a simpler network architecture are beneficial, the authors further enhanced NCS by adding a supervision module on top. In addition to introducing the dataset, the CodeSearchNet paper also proposed joint-embedding models as baselines, where the embeddings may be learned from neural bag of words (NBoW), bidirectional RNN, 1D CNN, or self-attention (SelfAtt). In this work, we compare against the best-performing of these baselines, NBoW and SelfAtt.
Unlike attempts to learn aligned vector representations from each dataset, CodeBERT  built a BERT-style pre-trained transformer encoder with code-specific training data and objectives, and then fine-tuned the model on downstream tasks. This approach has been highly successful.
Another line of work tries to enhance retrieval by incorporating structural information. In work where queries and code snippets are encoded separately, this is usually achieved by merging the encoded structure into the code vector.  extracted paths from the abstract syntax tree (AST) of the code and directly used the encoded path to represent the code snippet. Gu et al. (2021) built a statement dependency matrix from the code and transformed it into a vector, which is then added to the code vector prepared from the text. Ling et al. (2021) utilized a graph neural network to embed the program graph into the code vector. Adopting a different approach,  extended CodeBERT by adding two structure-aware pre-training objectives, and showed that the benefits of structural information are orthogonal to the benefits of large-scale pre-training.
While neural ranking models are popular approaches to the code retrieval task, we found few papers that compared them with traditional algorithms. To the best of our knowledge, only Sachdev et al. (2018) compared their embedding model with BM25, finding that BM25 performed acceptably.

Models
In this section, we describe the traditional IR methods that we used in our experiments and the neural ranking models that have been evaluated on the CodeSearchNet Corpus in previous work (Husain et al., 2019;.

Traditional IR Baselines
To test the effectiveness of traditional IR methods, we chose two well-known and effective retrieval methods as our baselines: BM25 (Robertson and Zaragoza, 2009) and RM3 (Lavrenko and Croft, 2001;Abdul-Jaleel et al., 2004). Both have been widely used for ad hoc retrieval and have been demonstrated to be strong baselines compared to multiple pre-BERT neural ranking models (Yang et al., 2019).
BM25 is a ranking method based on the probabilistic relevance model (Robertson and Jones, 1976), which combines term frequency (tf) and inverse document frequency (idf) signals from individual query terms to estimate query-document relevance. RM3 is a query expansion technique based on pseudo relevance feedback (PRF) that can be combined with another ranking method such as BM25. It expands the original query with selected terms from initial retrieval results (e.g., results of BM25) and applies another round of retrieval (e.g., with BM25) using the expanded query. We omit a comprehensive explanation of these two methods here and refer interested readers to the cited papers.

Neural Ranking Models
We compare the traditional IR methods described above with three neural ranking models: neural bag of words (NBoW), self-attention (SelfAtt), and CodeBERT. Results of the first two models are reported by Husain et al. (2019), and the last model by . We use their reported scores in this paper.
According to Husain et al. (2019), both NBoW and SelfAtt encode natural language queries and code into a joint vector space, and then aggregate the sequence representation into a single vector. The models are trained with the objective of maximizing the inner products of the aggregated query vectors and code vectors. The two models only differ in the encoding step, where NBoW encodes each token through a simple embedding matrix and SelfAtt encodes the sequence using BERT (Devlin et al., 2019).  pre-trained a bi-modal (natural language and programming language) transformer encoder based on RoBERTa (Liu et al., 2019), with the hybrid objectives of Mask Language Model (MLM) and Replaced Token Detection (RTD). The model is then fine-tuned for the code search task on each programming language dataset. We refer readers  to the original papers (Husain et al., 2019; for further model details and hyperparameters.

Dataset and Pre-processing
In this section, we introduce the CodeSearchNet Dataset (Husain et al., 2019) used in this paper and the code specific pre-processing strategies (e.g., tokenization) to be compared.

Dataset
CodeSearchNet 2 is a proxy dataset prepared from non-fork open-source Github repositories. It consists of 2M docstring-code pairs and 4M unlabeled code fragments, where the code fragments are function-level snippets and their respective docstrings (if any) serve as substitutes for natural language queries. Under CodeSearchNet, there are two sub-datasets, namely CodeSearchNet Corpus and CodeSearchNet Challenge. The Code-SearchNet Corpus dataset uses 2M docstrings as automatically-labeled queries, whereas the Code-SearchNet Challenge dataset uses another 99 freetext queries that were manually judged. In this work we conduct all experiments on the CodeSearchNet Corpus dataset. The labeled data are split into training, validation, and test sets in a ratio of 80:10:10. Table 1 shows the overall dataset size and the number of unique docstrings in each data split. The test set is partitioned into segments of size 1000 at the evaluation stage, and the correct code snippet for a given query is compared against the other snippets within the same segment. That is, the code snippets in the 1000 <docstring, code snippet> pairs naturally form the distractor set for each other.

De-duplication
According to Husain et al. (2019), the crawled data are filtered according to certain heuristic rules, 1 # Appends the given string at the end of the current string value for key k. 2 def putcat (k, v) 3 k = k.to_s; v = v.to_s 4 lib.abs_putcat(@db, k, Rufus::Tokyo. blen(k), v, Rufus::Tokyo.blen(v)) 5 end 6 7 # Appends the given string at the end of the current string value for key k. 8 def putcat (k, v) 9 k = k.to_s; v = v.to_s   including removing (1) pairs where the docstring is shorter than three tokens, (2) functions that contain fewer than three lines, contain the "test" substring, or serve as constructors or standard extension methods, and (3) duplicate functions. Nevertheless, even though duplicate functions are removed, queries prepared from docstrings can still repeat. That is, different functions can share the same documentation. Such duplication may result from function overloading, oversimplified documentation, or mere coincidence. An example of this duplication is shown in Figure 1. Table 2 shows that such query duplication can be observed in all programming languages to some degree, and most of the duplication arises from functions in the same repository. Considering the number of duplicate docstrings, it is inaccurate to consider all functions other than the one matched to the current query as negative samples. In this work, we aggregate all functions sharing the same docstring and regard all of them as relevant results.

Pre-processing
In all experiments, we apply the Porter stemmer and perform stopword removal using the default stopwords list in the Anserini toolkit (Yang et al., 2017), which is a Lucene-based IR system.
On top of this default configuration, we investigate the effectiveness of the following tokenization and stopword removal strategies specific to programming languages: • no-code-tokenization: No extra pre-processing is applied other than Porter stemmer and removal of English stopwords.
• code-tokenization: Tokens in both camelCase and snake case in code snippets and documentation are further tokenized into separate tokens, e.g., camel case and snake case. 3 • code-tokenization + remove reserved tokens: Considering that reserved tokens in programming languages intuitively add little value in exact match methods, we remove the reserved tokens of each programming language on top of the codetokenization condition.
We show length and vocabulary statistics after applying each pre-processing strategy in Table 3. In the table, total vocab size is the number of tokens that appear in either docstring or code, and overlapped vocabulary ratio is the percentage of tokens appearing in both docstring and code in the entire vocabulary. The table shows that code tokenization greatly shrinks the vocabulary size and raises the overlapped vocabulary ratio. Interestingly, reserved token removal shortens the code snippets length, but shows little impact on the overall vocabulary size. This results from the fact that reserved tokens are commonly contained in variable names as subtokens and thus reappear after code tokenization (e.g., the variable name class dir would be tokenized into class and dir, therefore class would still appear in the final vocabulary).

Experimental Setup
All our experiments were conducted with Capreolus , an IR toolkit integrating ranking and reranking tasks under the same data  Table 3: Average length and vocabulary statistics after applying each pre-processing strategy.  Table 4: MRR on the test set of the CodeSearchNet Corpus where each model searches for the correct code snippet against the 999 distractors. The highest scores among non-BERT models are highlighted in bold, and the ones among keyword-only models are underlined. We copied the scores of neural ranking models from Husain et al. (2019) and  processing pipeline. We chose the toolkit to enhance reproducibility and to support future comparisons. Note that although Capreolus is primarily designed for text ranking with neural ranking models, in this work we do not use any of those features. The underlying implementation of BM25 and RM3 are provided by the Pyserini toolkit (Lin et al., 2021), which in turn is built on the Lucene open-source search library, but Capreolus provides simplified mechanisms for parameter tuning and other useful features for end-to-end experiments.

Models
Following the original paper (Husain et al., 2019), each correct code snippet was searched against a fixed set of 999 distractors, as described in Section 4.1. All experiments were evaluated with Mean Reciprocal Rank (MRR). In all experiments, we tuned the parameters k1 and b for BM25 and originalQueryWeight, fbDocs, fbTerms for RM3 on the validation set, then applied the parameters from the best result on the test set. Note that since BM25 and RM3 only require parameter tuning, we did not use the training set mentioned in Table 1.  After pilot experiments on the Ruby and Go datasets to determine reasonable parameter ranges to search, we performed a grid search on each language dataset over the values shown in Table 5.

Results and Analysis
The results are shown in Table 4. The first row reports the results of CodeBERT . We list this result here to better compare the IR baselines with the state-of-the-art model in the field. The next two rows are pre-BERT neural model results copied from Husain et al. (2019). The remaining rows show the scores of BM25 and RM3 with the three aforementioned pre-processing strategies on the six programming language datasets.
As Table 4 shows, BM25 and BM25 + RM3 in general outperform the NBoW and SelfAtt baselines despite variations in effectiveness across programming languages. The SelfAtt model only shows sizeable improvement over BM25 on Python and a modest improvement on PHP. This suggests that the gap between natural language and programming languages does not necessarily hinder traditional IR methods in the code search task, and that distributed representations are not necessarily better at addressing this gap.
Comparing the results of BM25 and BM25 + RM3, we observe that adding RM3, which is generally considered more effective, does not improve over BM25 on any of the language datasets. We suspect the cause of this unanticipated result is that most of the queries in CodeSearchNet only have a single relevant document, which may not be sufficient to quantify the benefits of pseudo relevance feedback techniques. This hypothesis is supported by a similar observation that adding RM3 degrades effectiveness on the MS MARCO dataset (Bajaj et al., 2018), where each query also has few relevant documents .
The results from each pre-processing strategy show the necessity of code tokenization, which improves MRR overall. On the other hand, removing the reserved tokens does not improve effectiveness. The possible reasons could be that (1) some reserved tokens are in the English stopwords list and would be removed anyway (e.g. for, if, or, etc.), (2) some special reserved tokens rarely appear in the query and thus contribute little to the final score (e.g. elif, await, etc.), and (3) frequently-appearing reserved words are given small IDF weights in BM25, which minimizes their effect (e.g. final, return, var).

Conclusion
In this paper we examined the effectiveness of traditional IR methods for semantic code search and found that while these exact match methods are not as effective as CodeBERT, they generally outperform pre-BERT neural models. We also compare the effect of code-specific tokenization strategies, showing that while splitting camel and snake case is beneficial, removing reserved tokens does not necessarily help keyword-based methods.
There are also aspects of semantic code search that this paper does not cover. Sachdev et al. (2018) mentioned the nuance between different code com-ponents, such as how readability can differ for function names and local variables. We leave for future work an investigation of whether treating such components differently improves effectiveness. Nevertheless, the lesson from our work seems clear: even with advances in neural approaches, we shouldn't neglect comparisons to and contributions from strong keyword-based IR methods.