Constructing Multilingual Code Search Dataset Using Neural Machine Translation

We create a multilingual code search dataset by translating the existing English dataset with a machine translation model and conduct a baseline experiment on a code search task.


Introduction
Code search is the task of finding a semantically corresponding programming language code given a natural language query by calculating their similarity.With the spread of large-scale code-sharing repositories and the rise of advanced search engines, high-performance code search is an important technology to assist software developers.Since software developers worldwide search for codes in their native language, we expect code search models to be multilingual.Although many previous studies focus on multilingual code tasks other than code search (e.g., code generation, code explanation) (Wang et al., 2021;Ahmad et al., 2021;Fried et al., 2023;Zheng et al., 2023), the existing code search datasets (Husain et al., 2020;Huang et al., 2021;Shuai et al., 2021) contain only monolingual data for search queries.
In this research, we construct a new multilingual code search dataset by translating natural language data of the existing large-scale dataset using a neural machine translation model.We also use our dataset to pre-train and fine-tune the Transformer (Vaswani et al., 2017)-based model and evaluate it on multilingual code search test sets we create.We show that the model pretrained with all natural and programming language data performs best under almost all settings.We also analyze the relationship between the dataset's translation quality and the model's performance by filtering the fine-tuning dataset using backtranslation.Our model and dataset will be publicly available at https://github.com/ynklab/XCodeSearchNet.The contributions of this research are as follows: 1. Constructing the large code search dataset consisting of multilingual natural language queries and codes using machine translation.
2. Constructing the multilingual code search model and evaluating it on a code search task using our dataset.
3. Analyzing the correlation between translation quality and the model performance on a code search task.
2 Background In contrast, several datasets are created based on natural language queries used for code search by humans.CodeXGLUE (Shuai et al., 2021), a benchmark for various code understanding tasks, includes two code search datasets: WebQueryTest (WQT) and CoSQA (Huang et al., 2021).The query data of these datasets are collected from the users' search logs of Microsoft Bing and the code from CSN.Given these separately collected data, annotators who have programming knowledge manually map the corresponding query and code to construct the dataset.The common feature of these datasets is that all natural language data, such as docstrings and queries, are limited to English and do not support multiple languages.

CodeBERT
CodeBERT (Feng et al., 2020) is a model pretrained and fine-tuned with CSN and is based on the RoBERTa (Liu et al., 2019)'s architecture.Code-BERT uses Masked Language Modeling (MLM; Devlin et al., 2019;Lample and Conneau, 2019) and Replaced Token Detection (RTD; Clark et al., 2020) as pre-training tasks.Both docstring and code data in CSN are used in MLM, while only code data are used in RTD.CodeBERT is trained only with English data, thus not available for a code search task with multilingual queries.

Dataset Construction Using Machine Translation
A possible way to construct a code search dataset for multiple languages is to translate an existing monolingual dataset.However, CSN's large data size makes manually translating all of its docstrings difficult.tilingual data efficiently.By translating CSN docstrings, we create a multilingual dataset consisting of four natural languages (English, French, Japanese, and Chinese) and four programming languages (Go, Python, Java, and PHP).We also translate the queries in the datasets Feng et al. ( 2020) used for fine-tuning and evaluating CodeBERT for our experiments in Section 4.1 and Section 4.2.
In their fine-tuning data, the numbers of positive and negative labels are balanced.Note that we do not use JavaScript and Ruby data, whose sizes are much smaller than those of other programming languages.
As a translation model, we use M2M-100 (Fan et al., 2022), which supports translations in 100 languages.2M2M-100 achieved high accuracy in translations of low-resource languages by classifying 100 languages into 14 word families and creating bilingual training data within those families.We use m2m_100_1.2Bmodel, which is provided by EasyNMT3 , a public framework of machine translation models.We set the model's beam size to 3.
We manually annotate the labels to some data of our fine-tuning dataset to check the correlation with the original labels, which is found to be 0.911 (see Appendix B for the details).

Baseline Experiments
We conduct baseline experiments, where we train the Transformer-based model with our multilingual dataset under various settings of the data sizes and evaluate it on multiple code search test sets.

Training
We perform pre-training and fine-tuning on a model initialized with the XLM-R (Conneau et al., 2019) architecture and parameters.XLM-R is a model

Evaluation
As with Feng et al. (2020), we use Mean Reciprocal Rank (MRR) as an evaluation metric.
|Q| refers to the total number of queries.When a test set has 1,000 data pairs, given a natural language query i , the model calculates the similarity with the corresponding code i and the 999 distractor codes.If the similarity score given for code i is the 2nd highest among 1,000 codes, rank i equals 2.Then, the average of the inverse of rank i over all queries and codes is calculated as MRR.
Table 2 shows the sizes of CSN we use in our experiments.Each test set of CSN for MRR evaluation contains 1,000 data pairs randomly sampled from the original test sets.We use CoSQA and WQT as test sets in addition to CSN.As well as CSN, we create CoSQA test sets from the original 20,604 data pairs.We compute the average of MRR scores over three different test sets for CSN and CoSQA.The original WQT test set has 422 data pairs, so we use it as-is without sampling data like CoSQA.
We translate natural language queries in these test sets using the same machine translation model and parameter settings as the translation of the training data.

Model Settings
We prepare three model settings that differ in the amount and pattern of training data.
No-pre-training An XLM-R model with no further training applied and its initial parameters used.
All-to-One A model that uses data pairs of multilingual queries and monolingual codes for pretraining.The size of pre-training data ranges from 1.2 million to 2.7 million, depending on programming languages.
All-to-All A model that uses data pairs of multilingual queries and multilingual codes for pretraining.The size of pre-training data is over 7.6 million.

Results
Table 3 shows the scores of the MRR evaluation under all settings.The scores with CSN showed that All-to-All performed best in Go, Java, and PHP in almost all natural languages.On the other hand, All-to-One showed better scores than All-to-All on the Python test set.It is possible that the performance reached the top at All-to-One on the Python test set, given that the difference in scores between All-to-One and All-to-All was relatively small (<0.1).On CoSQA and WQT, there were also cases where model settings other than All-to-All performed better.
The performance of the original CodeBERT on a code search task is shown in Table 4. Overall, Allto-All is on par with the performance of CodeBERT in English data.Especially, All-to-All marks better scores in Java and PHP than CodeBERT.Note that our experiments and those of CodeBERT differ in the number of test sets used.Thus, it is difficult to compare these scores directly to discuss the model's superiority.
We observed a gradual trend that the scores decreased in English and French and increased in Japanese and Chinese as we increased the size of the pre-training data.This phenomenon might be due to the difference in knowledge of these languages acquired during pre-training XLM-R.The XLM-R pre-training data contain approximately 350 GiB for English and French and approximately 69 GiB and 46 GiB for Japanese and Chinese, respectively.As parameters of XLM-R were updated during our pre-training, the knowledge of English and French the model originally had was lost.On the other hand, the scores of Japanese and Chinese, in which the model owned a small amount of data, were improved by increasing the data size.FR 27,881 27,535 26,799 25,621 24,000 20,231 JA 27,433 26,524 24,901 21,981 16,327 10,304 ZH 27,115 26,178 24,971 22,280 18,445 10,792 Table 5: The sizes of our dataset for fine-tuning after back-translation filtering applied.FR .808 .810 .808 .805 .811 .809 .807 JA .816 .805 .803 .817 .813 .813 .802 ZH .804 .818 .818 .807 .798 .802 .802Table 6: MRR scores with back translation filtering for fine-tuning data.0 means no filtering applied.

Back-translation Filtering
The translation quality of our dataset must affect the model's task performance.Therefore, we investigate whether there is a difference in the scores of the code search task when we filter out the lowquality data from the fine-tuning dataset.
We apply a back-translation filtering method based on previous studies that used machine translation to automatically build a high-quality multilingual dataset from the English one (Sobrevilla Cabezudo et al., 2019;Dou et al., 2020;Yoshikoshi et al., 2020).We first apply backtranslation to French, Japanese, and Chinese docstrings.Then we calculate the uni-gram BLEU (Papineni et al., 2002) score between the backtranslated docstrings and the original English ones and collect only data with scores higher than certain thresholds.In our experiments, we conduct filtering to the fine-tuning dataset of Go.Table 5 shows the data sizes after back-translation filtering.We set thresholds to 0.2 to 0.7 in increments of 0.1 and compare the model's performance with each threshold.We choose these values because the sizes of the datasets change relatively hugely when filtered with the threshold 0.3 to 0.6 (Appendix D).

Results
Table 6 shows the MRR scores of the models whose fine-tuning data are filtered with different thresholds.In every language, the scores peak when we set the threshold between 0.2 to 0.5 and then drop with larger thresholds up to 0.7.This result implies that the filtering successfully removes the low-quality data while maintaining the number of training data and leads to better MRR scores.We assume that the change in size from the original dataset becomes more prominent with thresholds from 0.5 to 0.7 (around 100K-400K), thus eventually resulting in lowering the overall scores.
However, the score changes seem insignificant (±0.02) among these thresholds.One possible reason is that the data size remains over 250K even after filtering, which should already be enough for fine-tuning in general.
In summary, the results show that filtering out some low-quality data improves the model's performance on the code search task, but removing over 150K data worsens the test scores.

Conclusion
We created a large multilingual code search dataset by a neural machine translation model.We then constructed a multilingual code search model using our dataset.We found out that the models pre-trained with all of the multilingual natural language and programming language data achieved the best performance on a code search task almost all the time.We also investigated the relationship between the translation quality of our dataset and the model's performance.The results indicated that the data size contributed more to the model's code search performance than the data translation quality.
Overall, this research introduced that using a publicly available machine translation model helps to translate texts in the programming domain.We can apply our method to extend datasets for languages other than French, Japanese, and Chinese to construct models for various natural languages.

Limitations
We used XLM-R for the baseline model to train with our dataset in our experiments because we wanted to make experimental settings as close as the previous study of CodeBERT but for multilingual data.Since CodeBERT is based on RoBERTa, we chose XLM-R, which is also RoBERTa-based and already trained with multilingual data.

A CodeSearchNet
Table 1 shows the size of CSN for each programming language used for pre-training CodeBERT with MLM and fine-tuning on the code search task.The number of data for fine-tuning in Go is listed as 635,635 in Feng et al. (2020), but the dataset publicly provided contains 635,652 data.

B Dataset Translation
We manually evaluate the translation quality of our dataset.Table 7 shows examples of translation of query data from English to Japanese using M2M-100.Since queries of CSN are based on source code descriptions, some of them contain strings that do not necessarily need to be translated, such as variable names, function names, and technical terms (e.g., SetStatus, retrieveCoinSupply).M2M-100 successfully translates the entire sentence, leaving such domain-specific strings as needed.
On the other hand, we observe some errors, such as translating to unknown words (e.g., "alphanumeric" to "アルファナウマリ") or omitting some texts from the translation.
We also manually annotate the labels of 45 sampled data pairs from the fine-tuning dataset of Japanese queries and Go codes and calculate how much they match the original labels.These 45 data pairs do not contain queries that were not successfully translated and remain in English.Among 45 data pairs, 28 of them have "1" as their labels and 17 for "0".We calculate the correlation with accuracy, and the score is 0.911.

C Training Settings
As hyperparameters for pre-training the model, we set the batch size to 64, the maximum input length to 256, and the learning rate to 2e-4.As hyperparameters for the fine-tuning of the model, we set the batch size to 16, the learning rate to 1e-5, and the number of max training epochs to 3. In both cases, we use Adam as the optimizer.

D Back-translation Filtering
Table 8 shows an example of the removed data by filtering.Table 9 shows the data size of each filtering threshold.

Table 1 :
Training data size of CSN for each programming language used for pre-training CodeBERT with MLM and fine-tuning on the code search task.
code-only data.The natural language data in CSN is function documentation, which is pseudo data of the texts humans use to search for codes.

Table 2 :
The sizes of CSN data for training and evaluating the models in our baseline experiments.

Table 3 :
MRR scores of models pre-trained with all natural language data with either one programming language data or all programming language data.

Table 4 :
MRR scores of CodeBERT from Feng et al.
(2020)for Go, Python, Java, and PHP.CODEONLY is RoBERTa pre-trained only with code data.INIT refers to how the parameters of the model are initialized.S is for training from scratch, and R is for the initialization with those of RoBERTa(Liu et al., 2019)." Query and code data are concatenated to be input to the model, and it predicts their similarity based on the vector representation of the output [CLS] tokens.See Appendix C for more details on training settings, including hyper-parameters.

In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5690-5700.Association for Computational Linguistics.