CodeExp: Explanatory Code Document Generation

Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.


Introduction
Code documentation improves program comprehension (Garousi et al., 2015) and reduces software maintenance cost (Chen and Huang, 2009). Recently, many automated code summary tools have been developed to reduce the effort of document creation: for example, Denigma 1 is an IDE extension for generating inline function summaries; GitHub Copilot Labs 2 is a code summary model built on top of Codex (Chen et al., 2021a) for generating explanations for AI generated code. However, these existing code summary tools focus on the generation of short high-level descriptions of source code semantics (Haiduc et al., 2010;Roy et al., 2021;Zhu and Pan, 2019), and code summary alone is insufficient to meet the software understanding and maintenance need: A recent survey shows 85% developers expect tools to generate method-level documentations explaining the functionalities, usage and design rationales of code (Hu et al., 2022). For example, as shown in Figure 1, the code summary captures only the high-level code functionality, while the explanatory docstring explains arguments, return values, and its computation process in a detailed and informative way. However, due to the lack of training and evaluation resources, few code explanation models have been developed.
In this work, we introduce the code explanation generation task. We provide (1) the training corpus, (2) fine-tuning strategy, and (3) humanevaluation protocol and recommended automatic evaluation metrics to support developing code explanation models. Our contributions include: • We provide a python code-docstring corpus Code-Exp, which contains (1) a large partition of 2.3 million raw code-docstring pairs, (2) a medium partition of 158 thousand pairs refined from the raw corpus using a learned filter, and (3) a partition of 13 thousand pairs with rigorous human annotations. Our data collection process leverages an annotation model learned from human annotations to automatically filter high quality code-docstring pairs from raw GitHub datasets.
• We propose a two-stage strategy for fine-tuning large language models using collected data -first with the raw data and then with medium-size refined data. Our experiments show that the best fine-tuned model achieves human-comparable per-def make_rng(rng_or_seed=None, default_seed=None, constructor=None): if (rng_or_seed is not None) and isinstance(rng_or_seed, RNG): rng = rng_or_seed elif (rng_or_seed is not None): rng = constructor(rng_or_seed) elif (default_seed is not None): rng = constructor(default_seed) else: rng = constructor(42) return rng Returns a random number generator.
Returns a random number generator. The RNG object is generated using the first of these cases that produces a valid result: 1) rng_or_seed itself 2) constructor(rng_or_seed) 3) constructor(default_seed) 4) constructor (42) Parameters: rng_or_seed (int or RNG): If `rng_or_seed` is a random number generator, then it is returned. If `rng_or_seed` is an integer, then a random number generator is created using `constructor` and seeded with `rng_or_seed`. default_seed (int): Seed used if rng_or_seed is None. constructor (function or class): Must return a RNG object. constructor is called with rng_or_seed, default_seed or 42.
Returns: An RNG object.

Code:
Summary: Explanatory Docstring: Figure 1: Example code paired with summary and explanatory docstring. Difference between two styles: Summaries outline the highest-level intent of the code. Docstrings are more informative and detailed, explaining the semantics of specific code pieces.
formance and highlights the importance of highquality data for the code explanation task.
• We evaluated our models on seven automatic evaluation metrics and examined their consistency with respect to the human evaluation of 180 test examples. Our study shows that BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) best reflect code generation quality, and we recommend using them in future research.

Code Explanation Generation
Application scenarios. We focus on the generation of code explanations that describe both lowlevel and high-level code semantics. The automatic code explanation tool can benefit developers in many scenarios. For example, the tool can reduce software engineers developing effort by automatically generating function comments during development; it can help learners and codebase maintainers better understand undocumented code; it can also explain code generated by code generation models like Codex for the developer to better understand and verify its correctness. Note that in these scenarios, because the developer needs to understand both design rationale and implementation details of the code, an explanation covering detailed code semantics would be more appropriate than a short high-level description. For example, if a developer aims to create a test for the function make_rng in Figure 1 when maintaining a codebase, it is crucial to understand how the variable rng_or_seed is defined and used.
Task definition. Based on these observations, we define the code explanation task as the text generation given code snippets (functions), where the generated texts describe the code semantics. Concretely, a high quality code explanation should meet the following criteria: the description should be informative, covering important code behaviors, coherent with the semantics of source code, fluent, and grammatically correct. We follow these criteria to set up our annotation and evaluation protocols in Section 3.1 and 5, respectively.
Challenges. The first key challenge for developing code explanation models is the shortage of high-quality paired training and evaluation data. Despite the existence of large-scale public code corpora, directly using functions and their comments for modelling is not ideal because these comments can be oversimplified or misleading: Clement et al.
(2020) found 44% of python function documents are very short in one-line style, and Wen et al. (2019) showed code changes rarely (<20%) trigger comments updates, potentially making the inconsistency between code and comments a severe issue. Second, developing code explanation models is challenging. Besides the need for understanding code logic to generate a global summary, the explanation model also needs to generate detailed comments based on fine-grained local code structures (e.g., examine the control and data flow in order to explain how a variable is used).
Third, while prior work (Gros et al., 2020;Roy et al., 2021) empirically studied criteria of high quality code summaries, translating these criteria/guidelines into actionable automatic evaluation metrics for code explanation remains a challenge.
We next present how we collect high quality datasets, define evaluation metrics and fine-tune language models to address the above challenges.

CodeExp Data Collection
This section describes our data selection process. The code explanation corpus (CodeExp) consist of three sets of code-docstring pairs: (1) Code-Exp(annotated) with 13K human annotated pairs.

Examine docstring quality with human annotations -CodeExp(annotated)
In order to understand how developers evaluate the quality of explanatory docstrings, we first conduct a user study to let developers annotate quality of code-docstring pairs using the code-doc-corpus collected by Barone and Sennrich (2017 For each pair of code and docstring, the human annotator is asked to give a score from 0 (worst) to 3 (best) for each step if it is applicable (only step 1 is always applicable); otherwise, the annotator leaves a blank score. For the coverage evaluation, the annotator will also mark the specific code spans and text spans that are associated with code blocks (Figure 2). Due to the concern of feasibility, we do not require the step 2 annotation to cover all aspects of details, but only the branching blocks.
We refer to the human-annotated data as the CodeExp(annotated) in later sections. We show the statistics of annotations in Appendix A.1. In the annotated result, we found the human-written docstrings mostly perform well in explaining the general logic and provide accurate type defines, but are less optimal considering the coverage requirement. In other words, examples have high scores for step 1 and step 3, but the scores for step 2 interestingly diverge. For the 11,900 code with branching blocks, 6,300 examples do not describe any block (step 2 score equals zero). The rest 5,400 examples (33%) describe at least one code block. return [x [1] for x in flashes] return flashes ", "step1": "3", "step2": "2", "step3": "3", "match": The "CodeSpan" field extracts the specific code lines described by the texts of the "DocSpan" field.

CodeExp(raw) with Open-source Pairs
As it is economically infeasible to manually annotate a dataset large enough with code-docstring pairs for training a machine learning model, we instead create a suboptimal dataset with unlabelled pairs. Following (Husain et al., 2019), we pair the code function with its corresponding documentation to form a code-docstring pair. We leverage open-source python repositories in GitHub to collect CodeExp(raw). To ensure the code quality, we only keep repositories with more than 60 stars. This step yields around 55,000 repositories by December 2021. We downloaded all Python files with '.py' extensions and parsed the source code into abstract syntax trees (AST) with Tree-Sitter 4 . Then we include the functions provided with docstrings. Finally, we collected a corpus of 2,285,387 pairs of Python function code and docstrings. To the best of our knowledge, CodeExp(raw) is one of the largest datasets with parallel programming language (PL) and natural language (NL) till now. In comparison, CodeSearchNet (Husain et al., 2019) contains 457,461 samples of python code.

Data selection with a learned filter -CodeExp(refined)
The collected code and docstrings from Github are of mixed qualities and potentially introduce noise if used for training. Hence, we aim to refine a higher-quality subset from it for modeling. Our key insight here is to train a machine learning filter to mimic the human annotators based on the annotated dataset, and apply the learned filter to refine the raw data. The filter is fine-tuned from a pretrained BERT base model (uncased) on the collected human annotations. It takes as input the code and docstring pairs and predicts the step 1 and 2 scores 5 . The target scores are normalized to [0, 1]. We used 11208 examples, 85% of Code-Exp(annotated), for training and the rest for validation. The model achieved mean square errors (mse) of 0.027 and 0.018 for step 1 and 2, respectively, on the validation set. We apply the same workflow as in Section 3.1 to filter the raw 2.3M corpus of CodeExp(raw): we used the same complexity and length threshold to select candidate examples, and then applied the ML-filter. Finally, we selected the qualified data pairs with predicted step 1 and 2 scores greater than 1.0 (after scaling back). In this way, we collected 158,024 refined examples, named as the CodeExp(refined) partition.   Table 1 shows the statistics for the three partitions. Note that the annotated subset's quality is "mixed" because it contains both low and high scored examples annotated by human (we only use higher scored ones for testing in Section 4.2). Because the refined partition is refined by the learned filter, it better matches the definition of code ex-

Experiment Settings
In this paper, we formulate the code explanation generation as a sequence-to-sequence problem, where the source sequence is a function code (including both function signature and body), and the target sequence is an explanatory documentation string. We select four strong pretrained language models for programming language and fine-tune the models on our proposed CodeExp dataset. We next introduce the baseline models, experiment settings, and evaluation metrics.

Baseline models
We evaluate four popular pretrained language models in this paper: (1)  We finetune these models with our collected data to test the performance except for Codex, with which we only report its zero-shot performance due to the inaccessibility of model weights.

Fine-tune
We fine-tune the baseline models with the collected data of different sizes and qualities (Table 2). This yields three fine-tuning strategies: S1. Fine-tune on all collected examples, i.e. CodeExp(raw); S2.
Fine-tune on the refined subset, CodeExp(refined); S3. Fine-tune on the raw and refined partitions consecutively, in a "curriculum learning" manner. For each strategy, we leave out 1% of the raw partition and/or 5% of the refined partition for validation.
For evaluation, we select examples with high scores from the CodeExp(annotated) and remove duplicates if they also appear in the raw or refined partition (Appendix A.2). This generates a highquality test set of 2,677 examples.
The pretrained models are fine-tuned using crossentropy loss on 16 Nvidia V100 32GB GPUs. We select the checkpoint with the best perplexity score on the corresponding validation set for further evaluation. For strategy S3, the best checkpoint after fine-tuning on CodeExp(raw) is used as the starting point for the next phase on CodeExp(refined). The hyperparameters, including the max tokens, learning rate, batch size, epoch numbers, etc., are listed in Appendix Table 7. At inference time, we sample the top generated text using the default inference settings (Appendix A.3).

Automatic metrics
We adopt four existing metrics widely used in related tasks and propose two new metrics dubbed CER and CodeBERTScore to evaluate our models. The efficacy of each metric is verified in the next section against human evaluation. We include both statistical and recent model-based metrics.
Statistical Metrics BLEU (Papineni et al., 2002), ROUGE (Lin and Och, 2004), METEOR (Banerjee and Lavie, 2005) are three commonly used metrics in code summarization and machine translation. These methods compute the matching of n-grams between candidate and target texts in various manners. We use the sentence BLEU (with smoothing method 4) and METEOR interfaces provided by NLTK 6 . For ROUGE, we use the F1 score of ROUGE-1 and ROUGE-L.
We propose a new metric dubbed Common Entity Recall (CER). It first computes the number of common 1-grams shown in code, generated, and reference docstrings. Then it is divided by the number of common 1-grams of the code and reference docstring. The intuition is that we found the common 1-grams of the code and reference docstrings often contain important variable names, function identifiers, or important keywords (e.g., if, int).
Model-based Metrics BERTScore (Zhang et al., 2019) is an automatic metric that employs a BERT model to measure the similarity between generation and reference. The semantic similarity is computed 6 https://www.nltk.org/ as cosine similarities between the average token embeddings of generated and target texts.
As BERTScore is only for natural language, we propose CodeBERTScore as an adapted version of BERTScore for evaluating code-related tasks. This metric is built by replacing the language model in BERTScore with CodeBERT (Feng et al., 2020), a state-of-the-art language model pretrained with code and natural language. We use the average token embeddings of the 9th layer in CodeBERT to compute similarity.

Metric selection based on human evaluation
We conducted human evaluations on generated docstrings and compared the aforementioned automatic metrics against the human-eval results. The protocol of our human evaluation is designed to cover four important aspects: A1. General adequacy, A2. Coverage, A3. Coherence, A4. Fluency. Annotators rate for each aspect a score within 0-4 on the 5-point Likert scale (Likert, 1932). The detailed setup is listed in Appendix A.4. Aspects 1,3,4 have been used in both machine translation (Reiter, 2018) and code summarization studies (Song et al., 2019;Roy et al., 2021). The coverage aspect emphasizes the preference for informative explanations of code pieces. Notably, these scores provide reference-free assessments. Both the original reference docstring (identity is hidden) and generated ones are provided to annotators. We calculate the adapted Kendal's τ (Graham et al., 2015) to measure the agreement between automatic metrics and human evaluation. For an arbitrary pair of two examples, it considers whether two metrics both prefer one example to the other. The τ value is the ratio difference of concordant (Con) and discordant (Dis) pairs, where # denotes the number of pairs. The concordant, discordant and tie pairs are calculated as in Table 3, where s 1 and s 2 are the scores for the two docstrings within a pair.
6 Results and discussion

Results of auto-metric evaluations
We evaluated the generated docstrings of all aforementioned models (in Section 4.1). We also in-  cluded the CodeT5 checkpoint for code summarization, i.e. CodeT5-multi-sum 7 . The evaluated results are shown in Table 4. In latter sections, we use the notion "[model name]-(raw), -(refined), -(r+r)" representing the model fine-tuned using the strategy S1, S2, and S3, respectively. We observed all fine-tuned models have significant improvements over off-the-shell versions across all metrics, for example, the BLEU scores increased from below 0.4 (GPT-Neo13, CodeT5multi-sum) to around 10.0 (GPT-Neo13-(r+r), CodeT5-(r+r)). Since GPT-Neo and CodeT5-multisum are reported as strong summarization baselines , this result again highlights the large difference between summaries and explanatory docstrings.
Two-stage fine-tuning achieves best performances. For the three fine-tuning strategies (S1-3 in Section 4), S3 yields the best performance for all baseline models. To recall, it trains on all collected examples (i.e. CodeExp(raw)) and the high-quality partition (i.e. CodeExp(refined)) consecutively. Comparing across models, fine-tuned CodeT5 models perform best for S2 and S3 fine-tuning. Especially, CodeT5-(r+r) achieves the highest scores with respect to 6 out of 7 metrics.

Results of human evaluations
We randomly select 180 examples from the test set and collect human evaluations using the protocol in Section 5. For each evaluated docstring, the annotator gives four scores for the aspects A1-4 and an overall average score is computed as well. The evaluated docstrings are generated using 6 finetuned models given the selected 180 python functions. The models and respective results are shown in Table 5. For comparison, we also include the human-written reference docstrings in the original codebases and two strong baselines using the OpenAI Codex API, i.e. Codex-PyDoc and Codex-Py2NL. These two APIs generate docstrings and NL explanations, respectively, and we follow the 7 https://huggingface.co/Salesforce/ codet5-base-multi-sum official settings (Appendix A.5).
Comparing the overall scores in Table 5, the relative superiority of models are largely consistent with the results of auto-metrics (Table 4): (1) The fine-tuning strategies (S2, S3) using high-quality data partition outperforms S1. In fact, we found that fine-tuning using only CodeExp(raw) would often generate short one-line texts (like summarization) due to the majority of one-line docstring in the data, and this explains the annotators' low rating to CodeT5-(raw). (2) CodeT5-(refined) and CodeT5-(r+r) outperform other models.
Achieving human-comparable performance. We also observe that CodeT5-(refined) achieves comparable overall scores as the human-written references. Especially, it has a higher Coverage score than the references, indicating more detailed explanations. We plot in Figure 3 the (kernel density estimated) distribution of overall scores across 180 examples. The CodeT5-(refined) has the highest density accumulated at the high score range (3.5-4.0), and the distribution is very close to the distribution of reference docstrings. We also observe the best fine-tuned models outperform Codex-Py2Doc and Codex-Py2NL with respect to all evaluation aspects, although the parameter size of Codex (12B) is 60 times larger than the fine-tuned CodeT5.
BLEU and METEOR are most consistent with human evaluations. To examine the consistency between automatic metrics and human evaluation, we calculated the adapted Kendal's τ (Section 5) and show the results in Table 6. We found BLUE and METEOR mostly match the human evaluations for aspects A1,3,4 and the overall score. For the aspect of A2, Coverage, the newly introduced    CER has the highest τ . Interestingly, ROUGE scores show less alignment to human evaluations, although they have been widely used in code summarization. In summary, we recommend applying BLEU and METEOR to the task of code explanation generation.

Data quality matters
Reviewing both automatic and human evaluations, we find fine-tuning solely on the high-quality par-tition (CodeExp(refined)) significantly improves the performance compared to fine-tuning on Code-Exp(raw). Taking as example the best performed model series, CodeT5: (1) For human evaluation, the CodeT5-(refined) achieves an overall score of 3.446 (ranking 1st) and improves 31.8% over the score 2.614 of CodeT5-(raw) ( Table 5).
(2) For automatic metrics, BLEU and METEOR are the two most faithful metrics recommended in Section 6.2. CodeT5-(refined) improves the BLEU score from 5.39 to 8.02 and METEOR score from 17.51 to 24.38, when compared to CodeT5-(raw). In fact, the preference of -(refined) over -(raw) can be observed for most automatic metrics and model types (Table 4). This pattern is particularly interesting considering that the refined partition is only 1/15 of the total size. One reason is that the majority of CodeExp(raw) are of short length and do not satisfy the requirements of code explanation documents (Section 3.3). Therefore, it brings about observable noise during the optimization. This result demonstrates the importance of data quality for code explanation generation. Figure 4 shows an example of code and generated docstring. More examples are listed in Appendix A.6. The CodeT5-(r+r) model successfully captures the main logic of "Indent text by a given number of characters". It also describes in detail the types and semantics of input parameters and returns. Interestingly, the model also captures the condition for the first "ValueError", although a more faithful description should be "ValueError: if the number of characters differs from the number of lines". Notably, achieving human-written quality does not mean perfect. In this example, the humanwritten reference missed the description for the Val-ueErrors. We found the coverage aspect is challenging for both human-written and model-generated docstrings in general. This has also been reflected by the A2 scores in Table 5

Reference:
Indent lines of text in the string ``text`` using the indentation character(s) given in ``indent_chars`` ``level`` times.
:param text: A string containing the lines of text to be indented. :param level: The number of times to indent lines in ``text``. :param indent_chars: The characters to use for indentation. If a string uses repetitions of that string for indentation. If a list of strings, uses repetitions of each string to indent each line.
:return: The indented text. Figure 4: Case study of generated docstring. The CodeT5-(r+r) correctly captured some detailed code semantics (highlighted in cyan). An ambiguous span is highlighted in pink as well.

Related works
Several approaches have been proposed for method/function-level automatic code documentation. Early studies use template-based and information retrieval approaches to generate longform documents (Wong et al., 2013;McBurney and McMillan, 2014;Moreno et al., 2013). Recent efforts have been mainly focused on the summarization. Barone and Sennrich (2017) collected a large parallel Python code and docstring corpus and trained LSTM-based machine translation model.
DeepCom (Hu et al., 2018) introduced structure information of AST to help generate summaries for Java methods. Zhou et al. (2019) proposed ContectCC, which encodes the context information of external dependencies using the API calls in the source Java method. More recently, Ahmad et al. (2020) proposed a transformer approach for method level summarization. Zhang et al. (2020) built a retrieval-based approach using similar code snippets in the generation. The aforementioned studies employed several popular PL-NL corpora (Barone and Sennrich, 2017;Hu et al., 2018;Husain et al., 2019). The average length of these corpora is below 20 tokens and they target the summarization task where the docstrings are often oneliners. Apart from these efforts, few deep learning based approaches generate full-length of documentation. Clement et al. (2020) pretrained T5 models to generate python docstring in numerous styles. The OpenAI Codex (Chen et al., 2021b) trained GPT-3 for both code and document generation in six programming languages.
In these mentioned studies, machine translation (MT) metrics have been widely used for code comment assessments. Gros et al. (2020) questioned this adoption of reference-based MT metrics by examining and showing the semantic difference between code and NL. Roy et al. (2021) rigorously examined the consistency between automatic metrics and human assessments using Kendall's τ as well and recommended the usage of BLEU, ME-TEOR, and chrF.

Conclusion
The code explanation generation is an important task for code understanding. On one hand, we show existing summarization methods do not directly apply to this task. On the other hand, we built data collection pipelines, explored consistency between automatic and human evaluations, and provided a framework for fine-tuning existing pretrained models to generate explanatory docstrings of human-comparable quality. We highlight the importance of data quality by showing that fine-tuning on high-quality data exceeds the performance using raw data of 15 times larger scale. We expect the proposed infrastructures, including the annotated dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy, to boost future research for code explanation.

Limitations
The examined automatic metrics provide insufficient semantic verification for the generated docstrings. Also, the absolute τ values of automatic metrics are all below 0.5, which indicates limited consistency with respect to human evaluations. We look forward to potential factuality-based metrics better modeling the correctness and coverage of the explained semantics. Apart from evaluating stand-alone generations, a user study in the production/developer environment could more accurately reflect the effectiveness of AI-generated explanatory documents. As for the model performance, increasing the coverage over detailed code semantics remains a challenge for tested models. In fact, both generated and human-written docstrings have low coverage scores in the human assessments. Lastly, we tested various fine-tuning strategies in this work, while the large-scale pretraining for code explanation is as well worth exploring.

A Appendices
A.1 Statistics of CodeExp annotation results Figure A5: Histogram of Step 1. General adequacy and Step 2. Coverage. Blank score of step 2 indicates there is no branching if/else conditions in the code example.

A.2 Test set configuration
The test set mainly consists of examples in the CodeExp(annotated) partition. We select the group of data of high quality, i.e. with scores of all steps ≥ 1 (including blank scores), and remove duplicates that also appeared in the CodeExp(raw) and CodeExp(refined). This process generates 1744 examples. We also applied the same procedure of quality filtering and deduplication on a small heldout set of GitHub code-doc pairs. Altogether the test set contains 2677 examples.
In detail, the deduplication works by computing the Levenshtein distance between a candidate string of code/document and each code/document in the 2.3 million CodeExp(raw). To accelerate, We compare the first 300 characters of the candidate and target strings. If any computed distance is less than 5% of the total string length, the candidate code-doc example will be considered a duplicate and excluded from the test set.

A.3 Inference settings
At inference time, the generated tokens are sampled with temperature set to 0.1 for all models. The max generated token is set according to each model's capacity. Specifically, 512 tokens for CodeT5 and GPT-Neo models, 256 tokens for GPT-2-base.

A.4 Human evaluation settings
The human evaluation consists of four aspects. For each aspect, a question related to the docstring quality is asked, and the annotator is expected to give an integer score within 0-4, where 0 stands for "not satisfying the question at all" and 4 stands for "perfectly satisfying the question". The four aspects and questions are as follows: Notably, the first three aspects reflect those aspects of data annotations in Section 2. Given one source code, each annotator must evaluate the generated docstrings of all models (including the reference docstring) to remove inter-model bias. The source of the docstring is hidden to the annotators. In total, ten annotators provided 6480 scores, 180 examples × 9 models × 4 aspects. All annotators have python developing experience of over 2 years.

A.5 Codex API settings
The Codex-Py2Doc stands for the Codex API example of "Write a Python docstring". The official prompt includes the # Python 3.7 header, the source code, and appends at the end the prompting sentence # An elaborate, high quality docstring for the above function: """.
The model stops generating when the stop token # or """ is generated. Similarly, the Codex-Py2NL denotes the Codex API of "Python to natural language". The prompt includes the # Python 3 header, the source code, and the prompting line # Explanation of what the code does \n\n #.
We follow all official settings for these APIs, including setting the temperature to 0, top p to 1.0, frequency penalty to 0.0, and presence penalty to 0.0. We use the most capable engine available, "code-davinci-002" and increase the max generated tokens to 256. The default stop token # for Codex-Py2NL is removed in our settings because otherwise the API would only generate one line of text. Table 7: Fine-tuning hyperparameters. Note: (1)All max tokens are set to the upper limits of the model_max_token of the pretrained model from Huggingface.co (2)The two-stage fine-tuning for each model adopts the settings of -(raw) and -(refined) at each stage, respectively.
A.6 Examples of generated docstrings See Figure A6.