CoDesc: A Large Code–Description Parallel Dataset

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsistencies across published works. In this study, we present CoDesc -- a large parallel dataset composed of 4.2 million Java methods and natural language descriptions. With extensive analysis, we identify and remove prevailing noise patterns from the dataset. We demonstrate the proficiency of CoDesc in two complementary tasks for code-description pairs: code summarization and code search. We show that the dataset helps improve code search by up to 22\% and achieves the new state-of-the-art in code summarization. Furthermore, we show CoDesc's effectiveness in pre-training--fine-tuning setup, opening possibilities in building pretrained language models for Java. To facilitate future research, we release the dataset, a data processing tool, and a benchmark at \url{https://github.com/csebuetnlp/CoDesc}.


Introduction
Neural models for natural language processing have benefited from large datasets and standard evaluation benchmarks (Wang et al., 2019b,a;Rajpurkar et al., 2016;Hermann et al., 2015;Com-monCrawl). However, the programming language counterpart is lagging behind due to a lack in such large datasets and benchmarks. To put this into perspective, the original Transformer network * Equal contribution. (Vaswani et al., 2017) was trained on WMT'14 English-German and English-French datasets (Bojar et al., 2014) containing 4.5 million and 36 million parallel sentences, respectively, whereas a similar network that achieved state-of-the-art results in source code summarization has been trained on only 69 thousand code-description pairs (Ahmad et al., 2020). We argue that the existing models used for programming language tasks in the literature have a significant scope of improvement given a large, good-quality dataset, and such a dataset is the missing link for effectively applying deep learning methods on programming languages.
In this work, we collect and release a large (4.2 million) Java source code -natural language (NL) parallel dataset along with denoising methods and baseline results. We apply our dataset to established works in both training from scratch and pretraining-fine-tuning setting and we demonstrate a notable performance gain in both settings. We gain 10% to 22% improvement over baseline code search models using CoDesc, and attain performances comparable to models having 8× more parameters. We achieve a new state-of-the-art BLEU score of 45.89 in code summarization by pretraining a Transformer network with our dataset for two epochs. With extensive empirical analysis, we propose a set of noise removal techniques for the source code and the NL descriptions in our dataset.
Our work brings together several datasets and multiple tasks on the intersection of Natural Language Processing (NLP) and Software Engineering (SE), such as code summarization, code search and code synthesis, and allows researchers to compare their methods on the same benchmark. It also opens the door for building large pretrained models to jointly learn code and NL representations that can be leveraged in downstream tasks that do not have adequate data, such as, code refactoring, clone detection, etc. as done by Feng et al. (2020). 3 CoDesc Dataset

Data Sources
We collect our data from several sources and formulate rules for data cleaning. 5 of the authors spent 45-50 man-hours manually going over the dataset to identify patterns of noises in different data sources. Upon group discussion, common patterns were identified and a noise removal method was established. Details about these noise patterns are provided in Appendix A.
One of the datasets used in CoDesc is CODE-SEARCHNET (CSN) 1 (Husain et al., 2019) -a parallel method-description dataset for code search. Furthermore, other datasets used are DeepCom 2 (Hu et al., 2018a), CONCODE 3 (Iyer et al., 2018), FunCom 4 (LeClair and McMillan, 2019) -datasets created for code summarization. The CODE-SEARCHNET dataset originally contained 6 programming languages, from which the Java methods are directly used in CoDesc, however, the Python methods are used after being automatically translated to Java. We combine all aforementioned datasets to create CoDesc. Appendix B shows a sample code-description parallel data from each of these datasets. Table 1 describes our data sources and their characteristics in detail.
CSN Python to Java Translation To utilize maximum possible data from the CSN CORPUS, we translate the Python methods to Java using TransCoder (Lachaux et al., 2020), a state-of-theart, neural source-to-source compiler. We modified and re-released the open-source implementation of TransCoder 5 , enabling it to translate data in batches instead of one at a time, and resulting in a 16X faster translation. Upon empirical inspection, we found that the converted Java codes are human-readable and bear a strong resemblance to the original Python code intent. The converted codes seem correct to the human eye and their syntax matches with Java syntax. However, a few cases the transcompiler suffers are -converting to Java library methods, and converting from Python coding conventions that does not have a Java equivalent (e.g. use of SELF). These conversion errors, however, were not severe enough to affect our model to learn the NL-source code mapping.

Data Cleaning and Noise Removal
We created an easy-to-use, parameterized data processing tool for removing the different types of noise that we observed in our dataset. From the natural language descriptions, we remove symbols and characters that do not carry a meaning in a natural language description, such as, comment tags (e.g., //, / * , * /), stray code characters (e.g., @, #, {, }, etc.), HTML and XML tags, non-ASCII and escape characters, and some patterns of autogenerated tags (e.g., @param, @return, @throws, etc.). From source code, we remove comments and the non-ASCII and escape characters. In previous studies, many meaningful data are discarded due to having some noisy patterns/symbols either in the code or description (Husain et al., 2019;Iyer et al., 2018;LeClair and McMillan, 2019). We identify and remove the noisy part of the data points without excluding them from the dataset to reduce data loss during preprocessing.
For both source code and NL description, we split CamelCase and snake case code tokens into subtokens (e.g., Camel Case, snake case) and separate linked alphabets and numbers (e.g., var0 to var 0) (Ahmad et al., 2020;LeClair and McMillan, 5 https://github.com/csebuetnlp/TransCoder 2019). After the aforementioned processing, we remove the data points where the source code is less than 3 tokens, or the description contains less than 2 alphabets (Husain et al., 2019). We lowercase the natural language as the case is not necessary for describing codes. We release our data processing tool along with the CoDesc dataset for applying the dataset to diverse tasks.

Dataset Characteristics
After the previous steps, we are left with nearly 4.2 million Java method and description parallel data. Table 1 presents the statistics characteristics of our dataset. The combined CoDesc dataset consists of more than one million unique tokens, which is significantly larger than natural language vocabulary (Chen et al., 2019). This can be partially attributed to inseparable multi-words (e.g. 'updateproductvariationlocalizeddeltaprice') in our dataset. Hence, we perform BPE (Sennrich et al., 2016) tokenization in our preprocessing pipeline. We also see that although the average token length of Java source codes vary in the different dataset sources, the natural language descriptions have a relatively uniform length. We create a balanced, deduplicated, and representative train-valid-test dataset by splitting individual source-dataset in 8:1:1 ratio (Table 1).

Experiments
We evaluate our code-description corpus in two well-known complementary tasks: source code summarization and natural language code search. In this section, we demonstrate that models trained on CoDesc bring about a noticeable improvement over two established baselines in code search and code summarization. Each benchmarking follows a standard cleaning, preprocessing, and train-test de-duplication process.

Natural Language Code Search
We use the code search models used by Husain et al. (2019) that jointly trains a source code and an NL encoder networks to minimize their encoded vector distance ( Figure. 1). We apply our dataset on the CODESEARCHNET (CSN) (Husain et al., 2019) -a well-studied benchmark in the semantic code search literature. We train 5 different encoder networks (Table. 2) with the CSN Java dataset, and CoDesc respectively. We compare our results with CodeBERT and RoBERTa (code) (Feng et al., 2020), two pretrained models achieving state-of-   Table 2 shows our results, along with state-of-the-art models (Liu et al., 2019;Feng et al., 2020) that have nearly 8-10 times more parameters than the baseline networks and a more complex training objective. We achieve remarkably close performance with the state-of-the-art models with much simpler and smaller networks.

Source Code Summarization
For this task, we follow the methodology proposed by Ahmad et al. (2020). They used a seq2seq Transformer (Vaswani et al., 2017) network with 77M parameters with relative positional encoding (Shaw   (Sennrich et al., 2016) to create the same size vocabulary as the previous works.
Training We train a Transformer model proposed by Ahmad et al. (2020) with CoDesc-train dataset. We use Adam optimizer with an initial learning rate of 10 −4 , mini-batch size of 32, and dropout rate 0.2, vocabulary size 50k for code and 30k for NL. However, we use maximum input length of 200 token instead of 150 based on our observation of CoDesc dataset from Table 1.
Each epoch of the model took nearly 8 hours in an NVIDIA V100 16GB GPU. In comparison, the train-small dataset took 8.5 minutes only. For limitation of computational resource, we saved the network weights after training it with the large dataset for two epochs, and to be consistent with the original implementation, trained them further with the train-small dataset for a maximum of 198 more epochs. We perform an early stop if the validation performance does not improve for consecutive 20 epoch. The pretraining provides the network parameters a more favorable initialization than random, helping the network find better local minima.
Results Table 3 shows that our two epoch pretraining with CoDesc significantly improves the state-of-the-art code summarization methods in all three evaluation metrics -BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE-L (Lin, 2004). We observe that the pretrained model often generates more descriptive summary even when it achieves lower BLEU score (Fig. 2). We believe the model has more room for  improvement with further pretraining and we wish to validate this in future work.

Ablation & Analysis
To quantify the effect of individual data sources and our noise removal methodology, we train each dataset before and after applying our data cleaning method using an NBOW model and test them in the CSN benchmark using their released test set. Although our collected data was already cleaned by the respective authors, Table 4 shows that the performance of every dataset improves drastically after our noise removal. Interestingly, without our extra layer of data cleaning, CoDesc dataset performs worse than training with only CSN data although being significantly larger. This shows the importance of a standard cleaning and processing method. Moreover, CSN (Java) have the highest accuracy, which can be attributed to the fact that it came from the same distribution of data as the evaluation and test sets, and hence contains similar tokens and patterns (Husain et al., 2019). We can see from Table 4 that the model trained with CSN (Python2Java) achieves an MRR score of 0.5548. Although this score is lower than other datasets, it is still a good indication that the translated data is helping the model is to learn NL-code association.

New Benchmark Results in Code Search
We provide a new set of benchmark results for CoDesc dataset in natural language code search. We train, validate, and test an NBOW, an RNN, and a Selfattn code search network with the balanced train, validation, and test data shown in Table 1. The three models achieve MRR score of 0.812, 0.766, and 0.839 respectively.

Discussion and Conclusion
In this work, we have accumulated CoDesc -a large code-description parallel dataset and established baseline results. CoDesc brings a noteworthy improvement in two tasks: code search and code summarization. We believe CoDesc will serve as a base for future studies on code-description joint tasks. We also show that automatically translated source code from a source-to-source compiler can be applied in a code-NL parallel task, suggesting that, translating our Java dataset to other programming languages can also be helpful.
The most striking finding of our study is that, by training with 2X larger parallel data, we achieve equivalent performance to models having 8X parameters (Feng et al., 2020) in code search. This raises an interesting question: are we fully utilizing the model capacities in code-description studies? From our pretraining results in code summarization, it can be reasonably assumed that pretraining with our large dataset the larger models will also improve further. In future works, we wish to apply new techniques for code search, code summarization, along with exploring our dataset for generalpurpose code synthesis, where the best models are still struggling in accuracy (Wei et al., 2019;Yin and Neubig, 2017  Although some noises were present in this dataset, we found this data to be least noisy in manual observation. We find that because of lower casing the documentations, some CamelCase tokens became inseparable. The dataset also contained non-English comments with English alphabets (mostly Italian). We found these documents hard to identify and remove. (2019) released a dataset of over 2.1 million pairs of Java methods and one-sentence method descriptions from over 28k Java projects 9 . They collected this dataset by filtering over 51 million Java methods from UCI Source Code datasets (Lopes et al., 2010). In their preprocessing step, LeClair and McMillan (2019) removed all datapoints where the method is more than 100 tokens long, or the method description is over 13 tokens or below 3 tokens.

FunCom LeClair and McMillan
In our observation of this dataset, we found method descriptions containing HTML tokens (e.g. <tt>, annotations (e.g., @link, @param), comment tokens, unwanted symbols, solely nonalphabetic characters, etc. It also contained comments inside methods, and a large portion of the data were getter, setter, tester, and toString methods.