The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Abstract. (not necessary)


Introduction
The advent of deep learning and advancements in large language models (LLMs) have spurred a revolution in the field of code representation learning.These developments, supported by the growing accessibility of vast open-source code repositories, have heralded the emergence of code large language models (CodeLLMs) for code generation and understanding tasks.The sheer volume of these repositories and the rich, unprocessed raw data they contain, serve as unparalleled resources for training LLMs.Consequently, current state-ofthe-art models for coding tasks effectively utilize for constructing and quality-controlling code-text pairs from raw source code, as well as an analysis of The Vault's metrics.We also share empirical results obtained from utilizing The Vault to finetune well-known foundational models.Our specific contributions include the following: • A dataset with approximately 43M pairs of highquality code-text pairs (over 10 times larger than CoDesc), 243M unimodal samples, and 69M pairs of line comments with context from 10 popular programming languages (Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, PHP), more diverse than CodeSearchNet, which has six programming languages.
• A novel approach to use a pre-trained language model for detecting and removing noisy samples to complement traditional rule-based methods.
• A thorough process for transforming raw source code into code-text pairs and filtering noisy samples.We have released the toolkit used in this process to the open community via a public GitHub repository1 , including tools for parsing code and docstrings in different programming languages.
• We perform extensive evaluation where we finetuned different CodeLLMs with The Vault compared to other datasets, such as CodeSearch-Net on various code understanding tasks, including code generation, code summarization and code search.The results show that models finetuned on The Vault outperform those fine-tuned on CodeSearchNet (code summarization, code search) and outperform the original model by a significant margin (code generation on pass@k over HumanEval and MBPP datasets).

Datasets for Code Representation Learning:
Code is commonly represented in training datasets for foundational LLMs, including the ROOTS corpus [Laurenc ¸on et al., 2023] for training BLOOM [Scao et al., 2022] and The Pile [Gao et al., 2020a] for training LLaMA [Touvron et al., 2023].The code data represented in these datasets are unlabeled raw source code from GitHub.There is also a family of code-only datasets for training or finetuning coding-specific LLMs, including The Stack [Kocetkov et al., 2022], a 3TB corpus of permissively licensed source code, preceded by CodeParrot with 50GB of deduplicated source code [Tunstall et al., 2022].These massive datasets are usually used to train CodeLLMs.However, labeled data are required for training and evaluating LLMs for coding tasks involving source code and natural language descriptions.CodeXGLUE is a benchmark dataset Lu et al. [2021] for 10 coding tasks that include 14 subsets, four of which are code-text pairs.Most of the code-text pairs in CodeXGLUE come from CodeSearchNet.CodeSearchNet (CSN) has also been employed for pretraining LLMs, enabling supervised learning techniques to achieve state-of-the-art performance for models such as CodeT5+ [Wang et al., 2023] and UniXcoder [Guo et al., 2022].A few codetext pair datasets set out to surpass CSN in size.CoDesc combines existing parallel datasets (CSN, DeepCom [Hu et al., 2020], CONCODE [Iyer et al., 2018a], and FunCom [LeClair et al., 2019]), and then refines the results from the superset, which yielded 4.2M Java data samples.PyMT5 [Clement et al., 2020] is a dataset with 7.7M Python codetext.However, both of these datasets each contains code for a single programming language.Notable datasets created from Stack Overflow2 include the necessary code-text data for generating post titles [Gao et al., 2020b, Liu et al., 2022].
3 The Vault dataset

Overview
In The Vault, we leverage a subset of The Stack [Kocetkov et al., 2022], recognized as the most expansive publicly available, multilingual, permissive-licensed source code dataset weighing in at 3TB.From this large-scale dataset, The Vault transforms raw source code into a collection of high quality pairs of code and text.Our transformation pipeline is designed to efficiently extract data from source code, create text-code pairings, and remove noise, yielding three distinct output datasets, as detailed in Figure 2. We draw from a subset of The Stack, which comprises code in 10 prevalent programming languages, such as C, C#, C++, Java, JavaScript, GoLang, PHP, Python, Ruby, and Rust (out of the total 300 languages featured in The Stack).Each language-specific raw source code feeds into a custom-built tree-sitter3 parser.
This parser is designed to extract functions, classes, methods, block code snippets, and their corresponding block or inline comments.The figure 1 illustrated a basic structure of a code file that contains multiple levels of code snippets.By applying a breadth-first search on the Abstract Syntax Tree (AST) of the root node, the parser is able to traverse down different node and leaf levels (class, function, and inline), result three separate datasets: 1.The first output dataset, referred to as D paired , contains pairs of classes (node 1) and functions (node 3) with corresponding block comments that serve as docstrings (node 2).After the initial construction, this dataset proceeds through a pipeline that employs both rule-based filters and neural-based filters to remove noisy samples that fail to meet the criteria detailed in Section 3.2.
2. The second output dataset, denoted as D unimodal , consists of standalone functions and classes, not paired with any docstring or comments, thereby forming a unimodal dataset.
3. The third and final dataset, D block , includes pairs of arbitrary code blocks (node 4) and inline comments (node 5).To construct this set, we capture all inline comments.Each comment is paired with the preceding code block, tagged as the "previous context" (node 4a), and the following code block, "next context" (node 4b).
A large number of block comments adhere to widely accepted docstring formats (Appendix A.5), encompassing neatly organized details about the name (identifier) of the associated function or class, their parameters, arguments, and return types.We channel these block comments through docstring parsers, which we have developed and made publicly available, to extract this information as metadata for each sample in our dataset.We contend that this metadata could prove beneficial for downstream tasks, prompt settings, and other applications (Figure 8).Collectively, these three datasets (D block , D unimodal , and D paired ) constitute The Vault.Note that through the evaluation process, only D paired is used since its contains data that is suitable for training and comparison with other datasets.

Data Cleaning Pipeline
From preliminary survey of the output dataset containing pairs of classes and functions with their corresponding block comments D paired , we observe salient patterns that would impair the training quality for code related tasks.We implemented a set of rule-based filters (Section 3.2.1) to remove irrelevant information or reformat textual data to be more descriptive of the corresponding code block.To address cases where the code-text pairs have inadequate or erroneous semantic correlation, we trained a neural-based model based on CodeBERT (Section 3.2.2) to serve as a filter.Such a filter generates a score, which is used to assess the alignment of a pair of code and text.Low-scoring samples are assumed to be unaligned and will be removed.

Remove Noisy Sample by Rules
Our data pipeline employs 13 rule-based filters to eliminate noisy patterns in the source dataset.These filters, detailed in Table 1, are categorized into three main groups: enhancing readability, promoting consistency, and preserving the intended usage of the code.Figure 1: The tree-sitter node structure.Classes (1) and functions (3) are extracted along with their corresponding docstring, which may be in the form of a block comment (2) or a line comment (5).The line comments (5) are extracted along with their preceding (4a) and succeeding (4b) code nodes for the inline dataset.
In terms of readability, we strip delimiters, math formulas, HTML tags, and metadata tags from the text.This ensures a cleaner and more coherent code-text pairing.For consistency, we remove elements that may cause irregularities in the dataset.This includes stripping hyperlinks and embedded code, and removing empty comments, overly short or long comments, non-English comments, autogenerated blocks, and work-in-progress comments.Lastly, to preserve the original purpose of the code, we remove comments that are questions or serve as examples or notes.This rigorous filtering process guarantees a high-quality dataset, improving the effectiveness of code-focused language models.signed to score the semantic relationship between a function or class and its corresponding docstring.

Remove
In our scoring model, we input code snippets and docstrings separated by a token < /s >.Approximately 12% of the already rule-filtered code-text pairs dataset was randomly selected for training.As labeled data was unavailable, we generated negative samples by randomly pairing functions and docstrings within the same programming language.
We then passed the representation of the < s > token to a linear layer, which produced a semantic correlation score between 0.0 and 1.0.Code-text pairs were then filtered using a binary classification gate with a threshold of 0.5.
To validate our model, we employed GPT 3.5 for analogous predictions.A million predictions were generated from unseen instances, from which we selected 300 per language: 200 high-confidence instances (100 consistent and 100 inconsistent code-text predictions) and 100 low-confidence instances.GPT 3.5-turbo was instructed to assign a consistency score (1-10) for each instance's codedocstring pair, serving as a benchmark for our model's predictions.For high-confidence instances, our model agreed with the GPT 3.5-turbo scores over 80% of the time.Although our model faced challenges with ambiguous samples, the Area Un-der the Curve (AUC) metric proved suitable due to our primary goal of excluding misalignments while preserving matched examples.An average AUC of 0.89 indicates that our approach effectively reduced dataset noise without discarding numerous informative samples.Detailed configurations and evaluation results are available in Appendix A.2.
In addition, we use our model to find noisy examples in the rule-based noise-remove version of CodeSearchNet in CodeXGlue.Table 3 presents some inconsistent examples found by our model for Python, Java, JavaScript, and PHP in CSN.It can be observed that detected pairs show strong inconsistency between docstring and code.For instance, the docstring of the example in Python does not give much insight into what the code does or its purpose.The code defines a method named 'has url' which checks if the attributes have a non-empty value; however, the docstring mentions templates which does not provide enough context to fully understand how this code relates to templates or

Empirical Evaluation
In this section, we aim to assess the quality of The Vault in comparison with other datasets, such as CSN.To substantiate this quality, we fine-tune prominent CodeLLMs on tasks that necessitate the involvement of both code and text, including code summarization, code search, and code generation.
We then compare these models, which have been fine-tuned on The Vault, with those fine-tuned on CSN.The comparison is made using the same test  datasets and commonly employed metrics, such as MRR, smoothed BLEU [Lin and Och, 2004], and pass@k [Chen et al., 2021].

Dataset Statistics
Table 2 provides the statistics of the samples for each programming language after undergoing our data-cleaning pipeline.In total, we have approximately 34M samples.The table also includes other information, like the number of tokens for code and docstrings, and the quantity of repositories.We split the training set into two smaller subsets: the small set and the medium set that contain 5% and 20% of the full training set, respectively.To reduce data leakage during training, we employed the MinHash LSH technique [Zhu et al., 2023] to filter training instance clusters that are close to samples in the validation and test sets of CSN, HumanEval, and MBPP.Additionally, during dataset partitioning, we prevented content from the same repository from appearing in multiple sets, thereby avoiding any potential internal data leakage.A more detailed analysis of The Vault at the class and code block levels can be found in Appendix A.4.

Experiment Setup
Data splitting: During the experiment phase, The Vault (D paired ) was split into three distinct datasets: training, validating, and testing sets.To avoid data leakage, we reinforced a policy where code samples from the same repository must all be in the same set.In the splitting algorithm, we also included as a goal the preservation of the token length distribution from The Vault's dataset in each subset.
For richer comparisons, the training set was further branched off to two smaller sets, the small and medium training sets, sampling 5% and 20% of the full training set, respectively.Details about experiment data can be found in Table 5.Note that TheVault/small has a comparable size with CSN, making it fair to assess and compare the quality of these two datasets.
Besides, in order to validate the efficiency of our processing pipeline, we conduct a comparison between the performance of models trained on The Stack (raw data) and The Vault (processed data).Specifically, we established three function-level subsets, each approximately the size of TheVault/small (≈1.7M code-text instances).These subsets were created by randomly sampling the raw function-level dataset extracted from The Stack, without applying any filtering (referred to as raw/TheStack).We use three different seeds to sample raw/TheStack and report the average result.All experiments are conducted using 4 NVIDIA A100 GPUs.
Code search: We select CodeBERT [Feng et al., 2020a], RoBERTa [Liu et al., 2019] and UniX-Coder [Guo et al., 2022] as the encoder for embedding source code and natural language query.We train each model for 10 epochs with a sequence max length of 512, and a learning rate of 2 −5 .
Code summarization: CodeT5 [Wang et al., 2021] and PLBART [Ahmad et al., 2021a] are employed for the summarization task.We use the base versions and set the max input tokens to 512 and the max output tokens to 400.We train for 5 epochs with batch size of 512 and a learning rate of 2 −4 .Code generation: We use CodeGen 350M and 2B Multi [Nijkamp et al., 2023] to evaluate code generation.We use the same configuration as in the code summarization task.

Code Summarization
For this task, we utilize the Vault and CSN to fine-tune CodeT5 and PLBART to summarize the source code.The Vault and CSN exhibit significant differences in docstring format.The Vault retains the complete docstring format, offering comprehensive descriptions of core logic, parameters, arguments, and return types.This feature enables versatile applications in code documentation and various downstream tasks.Additionally, we save the first sentence of each complete docstring as metadata, termed as short docstring.To facilitate fair comparison between The Vault and CSN, we apply post-processing to our full docstrings and short docstrings training sets, thereby reducing format distribution disparity.
Table 6 shows the results when comparing CodeT5 and PLBART trained on CSN and The Vault for the code summarization task, we report the best score when using full docstrings and short docstrings.We present further experimental outcomes using the Rouge-L [Lin, 2004] and BERTScore [Zhang et al., 2020] metrics in Appendix, Table 14.The results show that our pipeline has witnessed strong effectiveness compared to unprocessed data, raw/TheStack.Particularly, during training on the raw/TheStack dataset for the code summarization task, we found that the PLBART and CodeT5 generate outputs with substantial noise.These outputs are characterized by a prevalence of special tokens like "// " and "*".This finding strongly underscores the efficacy of our filtering process in enhancing the quality of the dataset.However, the result using CSN shows superior performance on CSN's testset than using The Vault.The reason for this is our mention of the post-processing step to reduce the difference between the CSN and The Vault filtering methods, where the syntactic distribution can still exhibit nonidentical characteristics, which can affect the BLEU score.However, this gap could be reduced by using the full version of The Vault as shown in Table 14.Although the total performance gain when evaluated on the CSN test set is marginal (21.73 versus 21.24), it is worth noting that, despite the intermediary processing, CSN is a considerably smaller dataset with more consistent docstring patterns.In contrast, our dataset is substantially larger and exhibits greater diversity, thereby encouraging broader generalization.When evaluated against The Vault's test set, the model fine-tuned on CSN lags behind by over 10%.

Code Search
We utilize CodeBERT, RoBERTa and UniXCoder to fine-tune both The Vault and CSN for the purpose of the code search task.We also furnish a baseline Mean Reciprocal Rank (MRR) score.MRR is a widely used metric for evaluating code search tasks, and in our case, it is trained on 10 different programming languages and assessed using the test set from CSN and The Vault.The results of this task, when fine-tuning the model on The Vault and CSN, are illustrated in Table 7. Remarkably, we attain superior results in most languages when finetund using the smallest dataset, TheVault/small, in contrast to solely fine-tuning on the CSN corpus.Surprisingly, RoBERTa, a model pretrained on natural language text, outperforms the two codepretrained models when evaluated on code search.This could imply the importance of natural language text representation over code representation in this task.Furthermore, models trained on The Vault consistently outperform all baseline models trained on raw/TheStack, underscoring both the efficiency of our processing pipeline and the dataset's ability to generalize across different architectures.

Code Generation
We experiment with two versions of CodeGen Multi [Nijkamp et al., 2023], which are 350M and 2B models on the HumanEval and MBPP benchmarks for code generation.The scope of our experiment was limited because the benchmarks only support Python.We use these checkpoints and continue fine-tuning them on The Vault because CodeGen Multi models are trained on the dataset with multiple languages.
To create Py/CodeSearchNet and Py/TheVault, we use the Python subsets of CSN and TheVault, respectively.We sampled the training Python set of TheVault to match the size of the Python subset in CSN with 250K samples in the first round of finetuning.Additionally, raw/PyTheStack is a subset of Python data from The Stack mirroring the size of Python data present in The Vault dataset, which helps us to demonstrate the advancements achieved in our data process pipeline.
The results are shown in Table 8.We can see that fine-tuning the CodeGen Multi 350M on The Vault causes the model to improve significantly in terms of pass@1, pass@10, and pass@100 on the HumanEval and MBPP benchmarks.Additionally, CodeGen 2B is used to assess The Vault on larger scale models.Similar to experiments on small models, Table 8 shows that The Vault can improve the performance of pretrained large-scale models.These results validate The Vault's ability to improve the performance of pre-existing pretrained models.In the future, we will expand our evaluation to even larger scale models and assess The Vault's impact on them.

Conclusion
In this paper, we presented The Vault, a large dataset of high-quality code-text pairs from ten programming languages, with over 43 million samples.The Vault was carefully curated to ensure that each pair meets quality standards, with detailed and informative descriptions and consistent coding styles.Our analysis uncovered a number of intriguing patterns and trends that shed light on the characteristics of programming languages and coding practices.We believe that The Vault will be a valuable resource for researchers and practitioners in this rapidly evolving field, providing a solid foundation for developing novel approaches and advancing state-of-the-art code large language models.

Limitations
In our approach, we employed 13 heuristic and context-specific rule-based filters, curated from manual data observations.While these filters effectively mitigated noisy patterns, their deterministic nature precluded comprehensive generalizability.
To address this, we supplemented these rules with a neural-based approach as described in Section 3.2.2.However, the absence of labeled training data necessitated pseudo-random sample generation, which could compromise model soundness and potentially eliminate quality code-text pairs.Although cross-validation with GPT 3.5-turbo occasionally revealed scoring inconsistencies, we believe that human labeling and model fine-tuning could further refine the dataset.
Compared to The Stack and The Pile, our dataset is smaller, mainly due to our rigorous quality control procedures.Moreover, creating AST parsers for each programming language is a non-trivial task, limiting our dataset to 10 popular programming languages compared to The Stack's 300.Nonetheless, our framework's codebase is publicly available, encouraging future contributions to extend our parsers and rules to additional languages.
The current study primarily utilized small models with less than 2 billion parameters to illustrate the value of The Vault.These models effectively demonstrated the dataset's potential, but further research with larger models would shed light on its robustness and scalability across more complex tasks.In future work, we plan to conduct experiments using large-scale language models to further assess the impact of our dataset.
Instead of discarding such characters outright, we selectively remove the noisy elements while aiming to capture as many informative sections as possible.
We analyze each docstring block individually and retain the sections that meet our quality criteria.Table 9 provides comprehensive descriptions of our 13 rule-based filters, accompanied by illustrative examples.Additionally, table 10 presents the corresponding percentages of code-text pairs generated through the application of these rule-based filters.

A.2 Neural-based refinement method
To detect semantic inconsistency between code-text pairs, we considered fine-tuning on large foundational models such as CodeGen [Nijkamp et al., 2023], BLOOM [Scao et al., 2022] or leverage GPT 3.5-turbo APIs.However, these approaches would incur very high costs in terms of financial resources, time, and computational power.We decided to train a dedicated model to deal with this specific task and use GPT 3.5-turbo to cross-check the predictions.
Training: We trained our model based on Code-BERT, [Feng et al., 2020a].The model assigns a score for semantic correspondence between code and text, before passing through binary classification into Consistent and Inconsistent categories.We randomly chose 5M samples (500K for each language in The Vault) and divided them into training, validation, and testing sets at a ratio of 3:1:1.The input to the model is the concatenation of the docstring and the code together with the < /s > token used to separate them (Figure 3).We use the representation of the < s > token and feed it into a linear layer to obtain the output logit.
Since labeled data was unavailable, we utilized self-supervised learning.We created negative samples by randomly pairing a function with a docstring from the same programming language (Figure 3).
Cross-check: We used GPT 3.5-turbo to perform similar classifications for semantic consistency of code-text pairs.We used a prompting template to ask GPT 3.5-turbo to score each pair of code-text on a scale of 1 to 10 for semantic correspondence with a detailed explanation and ran this prompting template on systematically selected 300 data points from each language with 100 data points in each of the following groups: • Consistency group: Examples that the model gives high confidence prediction to class Consistent.We select the top 100 based on the output probability for class 1.
• Inconsistency group: Examples that the model gives high confidence prediction to class Inconsistent.We select the top 100 based on the output probability for class 0.
• Uncertainty group: Examples that the model gives uncertain predictions.We select the lowest top 50 examples for each class.
The systematic sampling scheme helped us select 2994 samples in function level to be scored out of millions, reducing the cost of requesting GPT 3.5-turbo API while enabling meaningful analysis.The prompt input to GPT 3.5-turbo is as follow: I want you to act as an unbiased docstring evaluator for code.I will give you a docstring along with a source code, and you will give me a score for the consistency between them.The score will be on a scale of 1 to 10, 10 means the docstring can effectively summarize the code while 1 means they are inconsistent.
The response answers must contain the score and the explanation that follows the format in the response format.
-Response format: Score: X Explanation: Y -Docstring: "{docstring}" -Code: "{code}" Empirical Evaluation Results: Table 11 presents the performance of our model with GPT 3.5 turbo's scores as a reference, along with the scoring result for each group.In groups with high confidence, we witness a strong correlation between our model and GPT 3.5-turbo, with a high score for Consistency (7.81) and a low score for Inconsistency (3.15).A similar pattern is observed in the Uncertainty group, where the average score is close to the middle of the scale at 5.74.Recursive filter design using a least-squares method.{[}B,A{]} = YULEWALK(N,F,M) finds the N-th order recursive filter coefficients B and A.
→ Recursive filter design using a least-squares method.

Metadata Tag Metadata tags or annotations Update
Creates a slice of 'array' with 'n' elements dropped from the end.@static @memberOf @since 3.0.0→ Creates a slice of 'array' with 'n' elements dropped from the end.

Special tags. Update
Constructs a <code>GeneralStoresProductModel</code> from a plain JavaScript object.
→ Constructs a GeneralStoresProductModel from a plain JavaScript object.

Example and note Code example, note from developers Update
Pull packages data dir.note: Uses su to access package's data dir.
→ Pull packages data dir.In addition, we use GPT 3.5-turbo's scores to generate pseudo-labels and calculate accuracy and AUC for our model.We set a relative threshold of 5 to determine the labels.It can be witnessed that our model performs well in high-confidence groups but struggles in the uncertainty group.However, the accuracy is influenced by the choice of relative threshold, we consider Area Under the Curve (AUC) to measure the false positive and true positive rates.The metric shows a convincing result averaging 0.89, enabling us to effectively reduce a high amount of noise in our dataset while avoiding excluding too many informative examples.Finally, after removing noisy data using the proposed neural-based method, we notice a decrease of 1.3% in the dataset.15 illustrates some examples found in 6 programming languages.It can be observed that detected pairs show strong inconsistency between docstring and code.For instance, the docstring of the first example in Python does not give much insight into what the code does or its pur-pose.The code defines a method named 'has url' which checks if the attributes have a non-empty value; however, the docstring mentions templates which does not provide enough context to fully understand how this code relates to templates or its broader purpose.A similar pattern also presents in the remaining examples.An example that provides more clarity is the second example in Ruby.The docstring describes a function with a 'YAML filePath' parameter, but the function itself does not actually have this parameter.Besides, our model is able to identify non-English samples (the second example in PHP) that are not captured by the rule-based method.

A.3 Analysis of Function-Level Data in The Vault
Detailed description of function level data in The Vault can be found in Figure 4.  choose appropriate input and output lengths for training.This can help improve the performance of training a language model and prevent the resulting LLMs from producing outcomes too short or too long for the intended use cases [Kaplan et al., 2020, Brown et al., 2020].
Our tokenization process utilizes the tree-sitter framework to parse source code into nodes on an abstract syntax tree; each node is considered a token.For docstring tokenization, we tokenize by word and punctuation.The code and docstring tokens length distribution for each programming language is illustrated in Figure 5.The number of tokens present in a function (average of around 100 tokens) is considerably more than the number of tokens found in the docstrings (average of 15-30 tokens) that describe it.In particular, among the 10 programming languages, C and C++ have the highest number of tokens in a function.This can be attributed to the fact that these languages are low-level languages, which typically require more code to perform a task when compared to higherlevel languages.In the case of docstrings, their number of tokens is determined not only by the naturalness of the description in practice but also by cleaning rules outlined in Section 3.2.1.From Figure 5-Right and Table 10, it can be observed that the docstrings in Java and C are lengthy but are slightly cleaned by update-action rules, indicating that the docstrings in these two languages are typically long and more detailed in practice.Meanwhile, the number of tokens of docstrings in C# is the lowest.The cleaning rules may have played a role, as a significant proportion of the samples in C# has been updated based on Comment Delimite (16,7%) and HTML Tags (17,15%) rules.
Table 2 depicts the overall number of distinct tokens for each programming language.As our dataset contains extensive unique tokens, we believe that model training on The Vault can effectively handle unseen tokens.Besides, we find that multiple function names are reused due to the relatively small number of unique identifiers compared to the total number of functions in the dataset.This finding implies that even for humans, naming functions might be a difficult task.
Docstring Styles: Alongside typical docstrings that provide brief descriptions of the source code, many adhere to formatting and style conventions like Google, Jsdoc, and reST styles, among others.Our toolkit, designed to parse docstrings and extract metadata into a dictionary, supports 11 prevalent docstring styles.The styles we support and the information we aim to extract are depicted in figures 10 and 8 in Appendix A.5.This rich dataset could inspire research on advanced problems, such as controlling docstring style during generation or crafting explanations for function parameters.
Figure 9 provides statistics on the number of docstrings following a standard style.The data suggests that styled docstrings constitute a small fraction of the overall code-text dataset.One possible explanation is that our style detection rules are stringent, excluding docstrings with even minor syntax deviations, which might result in underestimating the number of docstrings adhering to a specific format.For styled docstrings, Figure 9-bottom presents the distribution of the number of extracted attributes for each programming language, with most having between 1 to 5 elements.We make our docstring-style parser available to the community to facilitate easy customization and enhancement.

A.4 Analyzing for Class and Inline Comment Set
In Table 12, we provide a statistical analysis of the number of classes and inline comments in both the raw set and the filtered set.Since the class structure is not defined in C and Go, we do not have their information to give in this table.
Initially, we excluded a substantial number of class samples from the raw dataset that lacked docstrings.The remaining class-docstring pairs underwent additional processing.Since the nature of classes and functions is similar, their functionalities can be meaningfully defined by pairs of a code snippet and a docstring.However, one of the problems when constructing paired data for classcomment samples is the large code snippet length of the class structure.As a result, we set the maximum number of code tokens that a class can have to 5000.On average, the code-token length of the class set is approximately 500, which is around five times longer compared to the average token length in the function set, while the number of docstringtoken lengths is similar between the two sets, as shown in Figure 6.Each pair of class-docstring is also examined via a rule-based filtering process, as described in Section 3.2.1,serving as a sample point in D pair dataset.
In the D block analysis, we initiate the initial formation of the sub-dataset by identifying and extracting inline comments within code functions.The extracted comments undergo a series of cleaning procedures similar to those applied to the docstrings (as discussed in Section 3.2.1).After eliminating noisy samples, we proceed to establish various intervals for the number of comment tokens, aiming to determine the optimal upper and lower bounds that yield high-quality collected comments.Our observations reveal that inline comments exceeding 15 tokens typically incorporate code snippets, while comments containing fewer than 3 tokens lack substantial meaningful information.Consequently, this interval serves as a filtering criterion to generate the final version of D block .Figure 7 shows the distribution of code-token length and docstring-token length in D block set.triple quotes ("') or a star slash (\*)).Depending on developer comment habit or docstring style format, docstrings can form two types: one-line docstrings and multi-line (or block) docstrings.A docstring can provide a concise summary of the functionality while also providing a detailed description of the code block, including its parameters, return values, exceptions, and other relevant information (as illustrated in Figure 8) The primary purpose of a docstring is to provide clear, concise, and easily accessible documentation for a code block.Docstring styles are conventions followed while writing docstrings to ensure consistency, readability, and ease of understanding throughout a codebase.This has become a standard for clean code in the industry and has developers saving tons of time when it comes to understanding or (auto-)generating documentation (using Sphinx, Doxygen, etc).
There are several popular docstring styles, such as Google Style, NumPy Style, reStructuredText (reST) Style for Python programmers, JavaDoc Style or Doxygen for Java users, each with its own formatting rules, structure and target programming language (docstring style examples and preferred language are listed in Figure 10).The statistic for docstring style corresponding to function level is presented in Figure 9.We believe that information inside a docstring is extremely useful and can provide numerous advantages for various applications in the fields of AI for source code, such as providing more precise and relevant search results for code search and retrieval tasks, or the performance of code analysis or refactoring can be significantly improved while the identifier of a parameter and its corresponding docstring information is available.Furthermore, the presence of various data types allows for the exploration of scenarios such as continual learning [Van et al., 2022, Nguyen et al., 2023, Yadav et al., 2023] and multitask learning [Zhang et al., 2023], which have been lacking investigation in the context of source code data.

A.6 Experimental results on code summarization
We report Rouge-L, BERTScore, and BLEU-4 metrics on test sets of CSN and The Vault in Table 14.
The results obtained from the experiments clearly indicate that models trained on our dataset consistently outperform CSN on all three evaluation metrics.This notable improvement across the metrics serves as strong evidence for the syntactic and semantic richness embedded within our dataset for code summarization.This highlights the effectiveness of our dataset in enabling models to grasp contextual information and generate high-quality summaries that accurately represent the underlying code functionality.

A.7 Experimental results on code search
In this section, we assess TheVault's versatility and adaptability by providing additional experimental results on several architectures (RoBERTa [Liu et al., 1907], UniXcoder [Guo et al., 2022], PLBART [Ahmad et al., 2021a]) for code search.Tables 13 illustrates the results for code search.As a result, models trained on The Vault consistently outperform all baseline models, underscoring both the efficiency of our pipeline and the dataset's ability to generalize across different architectures.

Figure 3 :
Figure 3: Input representation and Negative sample generation for code-docstring inconsistency detection.
We use our model to find noisy examples in the rule-based noise-remove version of CodeSearch-Net in CodeXGlue.Table

Figure 4 :Figure 5 :
Figure 4: Distribution and the number of functions by the presence of docstrings.Functions with docstrings are further divided into two categories: functions removed by rule-based filters and functions in the final code-text dataset.

Table 2 :
Pipeline to create datasets of code blocks with comments D block , unimodal code D unimodal , and code-text pairs D paired from raw source code.The size of extracted function data in each programming language.

Table 3 :
Examples of Inconsistent pairs in CodeSearchNet found by our model in Python, Java, Javascript and PHP."//" represents for docstring section.More examples are demonstrated in Table 15 in Appendix section.

Table 4 :
Comparison of THEVAULT function set to other code-text datasets.
its broader purpose.Besides, our model is able to identify non-English samples, which are presented in the example of PHP, that are not captured by the rule-based methods.

Table 5 :
The proportion of training, validation, and test set of THEVAULT.

Table 6 :
Table4offers a comparison between The Vault and other parallel datasets frequently used for pretraining and fine-tuning downstream tasks.These Smoothed BLEU-4 results for code summarization.The "Total" column demonstrates combined data in all languages to calculate BLEU, while "Avg" is the average BLEU score on the language level.

Table 7 :
Comparison between the models fine-tuned on the CODESEARCHNET and on different THEVAULT training subsets on code search task.

Table 8 :
Result on code generation benchmarks using CodeGen Multi 350M and 2B models.

Table 9 :
Rule-based filters and examples.

Table 10 :
The percentage of constructed code-text pairs from The Stack caught by each rule-based filter, by programming language.

Table 11 :
Evaluate CodeBERT using the consistency score provided by GPT 3.5-turbo.We report the mean ± the standard deviation for the score in each subset.

Table 13 :
Code search results of various architectures and training dataset.