Impact of Evaluation Methodologies on Code Summarization

There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during evaluation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations.

Despite a solid progress in generating more accurate summaries, the evaluation methodology, i.e., the way we obtain training, validation, and test sets, is solely based on conventional ML practices in natural language summarization, without taking into account the domain knowledge of software engineering and software evolution.For example, temporal relations among samples in the dataset are important because the style of newer code summaries can be affected by older code summaries; however, they are not explicitly modeled in the evaluation of code summarization in prior work, which assumed the samples in the dataset are independent and identically distributed.This gap could lead to inflated values for automatic metrics reported in papers and misunderstanding if a model might actually be useful once adopted.
The key missing piece in prior work is the description of the targeted use cases for their ML models.Prior work has implicitly targeted only the batch-mode use case: applying the model to existing code regardless of when the code is written.However, a more realistic scenario could be the continuous-mode use case: training the model with code available at a timestamp, and using the model on new code after that timestamp (as illustrated in Figure 1).Considering that programming languages evolve and coding styles are constantly revised, results obtained in batch-mode could be very different from those obtained in continuousmode.Thus, it is insufficient to only report the task being targeted in a paper, and it is necessary to explain intended use cases for the ML models.
Once the task and use cases are clearly defined, an appropriate evaluation methodology (or potentially several methodologies) should be used.
In this paper, we study recent literature on ML models for code summarization.By reasoning about their evaluation methodologies (which we In Annual Meeting of the Association for Computational Linguistics, May 2022.call mixed-project and cross-project), we define two use cases that could be evaluated by these methodologies.Next, we define a more practical use case when a developer uses a fixed model continuously over some period of time.We describe an appropriate evaluation methodology for this use case: time-segmented.Finally, we evaluate several existing ML models using the three methodologies.We highlight two key findings.First, depending on the employed methodology we end up with conflicting conclusions, i.e., using one methodology, model A is better than model B, and using another methodology, model B is better than model A. Second, our results show that the absolute values for automatic metrics vary widely across the three methodologies, which indicates that models might be useful only for some use cases but not others.Thus, it is imperative that future work describes what use case is being targeted and use the appropriate evaluation methodology.
In summary, this paper argues that we need to more diligently choose evaluation methodology and report results of ML models.Regardless of whether or not the conclusions of prior work hold across methodologies, we should always choose the methodology appropriate for the targeted task and use case.We hope the community will join us in the effort to define the most realistic use cases and the evaluation methodology for each use case.
We hope that our work will inspire others to design and formalize use cases and methodologies for other tasks.Only a few research studies on defect prediction (D'Ambros et al., 2012;Tan et al., 2015;Wang et al., 2016;Kamei et al., 2016), program repair (Lutellier et al., 2020), and bug localization (Pradel et al., 2020) took into consideration software evolution when evaluating ML models.Taking software evolution into account in those tasks appears more natural, but is not more important than in code summarization.Moreover, for the first time, we present an extensive list of potential use cases and evaluation methodologies side-by- side, as well as the impact of choosing various methodologies on the performance of ML models.
Table 1 lists prior work on developing new ML models for code summarization.The last three columns show which methodology/methodologies were used in the evaluation in each work (MP: mixed-project, CP: cross-project, T: timesegmented).Out of 18 papers we found, 15 used the mixed-project methodology and 4 used the cross-project methodology.No prior work used the time-segmented methodology.

Mixed-Project
The mixed-project methodology, which is the most commonly used methodology in prior work, extracts samples (code and comments) at a single timestamp (τ ) from various projects, then randomly shuffles the samples and splits them into training, validation, and test sets.

Cross-Project
The cross-project methodology, also commonly used in prior work, extracts samples at a single timestamp (τ ) from various projects as well.Unlike the mixed-project methodology, the cross-project methodology splits the set of projects into three disjoint sets for training, validation, and test.Thus, the samples from one project are contained in only one of the training, validation, and test sets.Figure 3 illustrates this methodology.The crossproject methodology is explicitly evaluating the ability to generalize a model to new projects.However, cross-project is also time-unaware, i.e., it does not consider if the samples from a project in the test set come before or after the samples from the projects in the training set.

Time-Segmented
We introduce a novel methodology: timesegmented.Unlike the methodologies explained earlier, the time-segmented methodology is timeaware, i.e., the samples in the training set were available in the projects before the samples in the validation set, which were in turn available before the samples in the test set.
Figure 1 illustrates this methodology.The samples available before τ −2 (i.e., their timestamps are earlier than τ −2 ) are assigned to the training set.The samples available after τ −2 and before τ −1 are assigned to the validation set.And finally, the samples available after τ −1 and before τ (which is the time when the dataset is collected) are assigned to the test set.This assignment may not be the only approach to satisfy the definition of the time-segmented methodology, but is one approach that utilizes all samples collected at τ .Alternative assignments, e.g., excluding samples available before τ −3 (a timestamp earlier than τ −2 ) from the training set, may have other benefits, which we leave for future work to study.

Use Cases
Methodologies are used to set up experiments and obtain an appropriate dataset split for the evaluation.However, they do not describe the envisioned usage of an ML model.Prior work picked a methodology in order to set up experiments, but we argue that ML models should be described with respect to use cases, i.e., how will the developers use the models eventually.Once a use case is chosen, an appropriate methodology can be selected to evaluate the model.
In this section, we define three use cases via examples of the comment generation task.The first two use cases are "extracted" from prior work.Namely, we reason about the mixed-project and the cross-project methodologies used in prior work and try to link each to a (somewhat) realistic use case.The third use case is inspired by our own development and can be evaluated using the time-segmented methodology.Note that we do not try to provide an exhaustive list of use cases, but rather to start off this important discussion on the distinction between a use case and an evaluation methodology.For the simplicity of our discussion, we only focus on the training and test sets (since the validation set can be regarded as the "open" test set for tuning).

In-Project Batch-Mode Use Case
Consider Alice, a developer at a large software company.Alice has been developing several software features in her project over an extended period of time (since τ −1 ), but she only wrote comments for a part of her code.At one point (τ ), she decides it is time to add documentations for the methods without comments, with the help of an ML model.Alice decides to train a model using already existing samples (i.e., (code, comment) pairs for the methods with comments) in her code, and since this may provide only a small number of training samples, she also uses the samples (available at time τ ) from other projects.We call this in-project batch-mode use case, because Alice trains a new model every time she wants to use the model, and she applies it to a large amount of methods that may be added before or after the methods in the training set.This use case can be evaluated using the mixed-project methodology ( §2.1).
Because prior work using the mixed-project methodology did not set any limit on the timestamps for samples in training and test sets, the time difference between samples in the two sets can be arbitrarily large.Moreover, the model is applied on all projects that it has been trained on.These two facts make the in-project batch-mode use case less realistic, for example, a sample from project A available at time τ may be used to predict a sample from project B available at time τ −1 , and a sample from project B available at time τ may be used to predict a sample from project A available at time τ −1 , simultaneously.

Cross-Project Batch-Mode Use Case
In this case, we assume that Alice works on a project (since τ −1 ) without writing any documentation for her code.At some point (τ ), Alice decides to document all her methods, again with the help of an ML model.Since Alice does not have any comments in her code, she decides to only train on the samples (i.e., (code, comment) pairs) from other projects (at time τ ).Once the model is trained, she uses it to generate comments for all the methods in her project.We call this cross-project batch-mode use case, because Alice trains a new model at a specific timestamp and applies it to all the methods on a new project.(Note that once she integrates the comments that she likes, she can use them in the future for training a new ML model, which matches in-project batch-mode use case, or potentially she could decide to ignore those comments and always generates new comments, but this is unlikely.)This use case can be evaluated using the cross-project methodology ( §2.2).
While the cross-project methodology is reasonable for evaluating model generalizability, the cross-project batch-mode use case does make strong assumptions (e.g., no documentation exists for any method in the targeted projects).

Continuous-Mode Use Case
In this case, Alice writes comments for each method around the same time as the method itself.For example, Alice might integrate a model for comment generation into her IDE that would suggest comments once Alice indicates that a method is complete.(Updating and maintaining comments as code evolves (Panthaplackel et al., 2020;Liu et al., 2020;Lin et al., 2021) is an important topic, but orthogonal to our work.)Suppose at τ −1 , Alice downloads the latest model trained on the data available in her project and other projects before τ −1 ; such model could be trained by her company and retrained every once in a while (finding an appropriate frequency at which to retrain the model is a topic worth exploring in the future).She can keep using the same model until τ when she decides to use a new model.We call this continuous-mode, because the only samples that can be used to train the model are the samples from the past.This use case can be evaluated using the time-segmented methodology ( §2.3).

Application of Methodologies
We describe the steps to apply the methodologies following their definitions ( §2) with a given dataset, as illustrated in Figure 4.The input dataset contains samples with timestamps, and the outputs include: a training and validation set for each methodology to train models; a standard test set for each methodology to evaluate the models for this methodology only; and a common test set for each pair of methodologies to compare the same models on the two methodologies.Appendix A presents the formulas of each step.
Step 1: time-segment.See Figure 4 top left part.
A project is horizontally segmented into three parts by timestamps τ −2 and τ −1 .
Step 2: in-project split.See Figure 4 top right part.A project is further vertically segmented into three parts randomly, which is orthogonal to the time segments in step 1.
Step 3: cross-project split.See Figure 4 middle part.Projects are assigned to training, validation, and test sets randomly, which is orthogonal to the time segments and in-project splits in step 1 and 2.
Step 4: grouping.Now that the dataset is broken down to small segments across three dimensions (time, in-project, and cross-project), this step groups the appropriate segments to obtain the training (Train), validation (Val), and standard test (TestS) sets for each methodology.This is visualized in Figure 4 bottom left part.
Step 5: intersection.The common test (TestC) set of two methodologies is the intersection of their TestS sets.This is visualized in Figure 4 bottom right part.
In theory, we could compare all three method-ologies on the intersection of the three TestS sets, but in practice, this set is too small (far less than 4% of all samples when we assign 20% projects and 20% samples in each project into test set).
Step 6: postprocessing.To avoid being impacted by the differences in the number of training samples for different methodologies, we (randomly) downsample their Train sets to the same size (i.e., the size of the smallest Train set). 1 The evaluation (Val, TestS, TestC) sets may contain samples that are duplicates of some samples in the Train set, due to code clones (Sajnani et al., 2016;Roy et al., 2009) and software evolution (Fluri et al., 2007;Zaidman et al., 2011).We remove those samples as they induce noise to the evaluation of ML models (Allamanis, 2019).We present the results of removing exact-duplicates in the main paper, but we also perform experiments of removing near-duplicates to further reduce this noise and report their results in Appendix B (which do not affect our main findings).
We run several existing ML models using different methodologies to understand their impact on automatic metrics, which are commonly used to judge the performance of models.

Tasks
We focus on two most studied code summarization tasks: comment generation and method naming.We gave our best to select well-studied, representative, publicly-available models for each task; adding more models may reveal other interesting observations but is computationally costly, which we leave for future work.Comment generation.Developers frequently write comments in natural language together with their code to describe APIs, deliver messages to users, and to communicate among themselves (Padioleau et al., 2009;Nie et al., 2018;Pascarella et al., 2019).Maintaining comments is tedious and error-prone, and incorrect or outdated comments could lead to bugs (Tan et al., 2007(Tan et al., , 2012;;Ratol and Robillard, 2017;Panthaplackel et al., 2021).Comment generation tries to automatically generate comments from code.Prior work mostly focused on generating an API comment (e.g., JavaDoc summary) given a method.
We used three models: DeepComHybrid model from Hu et al. (2018aHu et al. ( , 2020)), Transformer model and Seq2Seq baseline from Ahmad et al. (2020).We used four automatic metrics that are frequently reported in prior work: BLEU (Papineni et al., 2002) (average sentence-level BLEU-4 with smoothing (Lin and Och, 2004b)), ME-TEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin and Och, 2004a), and EM (exact match accuracy).Method naming.Descriptive names for code elements (variables, methods, classes, etc.) are a vital part of readable and maintainable code (Høst and Østvold, 2009;Allamanis et al., 2015).Naming methods is particularly important and challenging, because the names need to be both concise-usually containing only a few tokensand comprehensible-such that they describe the key functionality of the code (Lawrie et al., 2006).
We used two models: Code2Vec from Alon et al. (2019b) and Code2Seq from Alon et al. (2019a).We used four automatic metrics that are frequently reported in prior work: Precision, Recall, F1, and EM (exact match accuracy).

Data
We could not easily reuse existing datasets from prior work because the timestamps of samples are not available.We extracted samples with timestamps from popular and active open-source Java projects using English for summaries (comments and names) from GitHub.We collected samples before τ = 2021 Jan 1 st , and we time-segmented samples by τ −2 = 2019 Jan 1 st and τ −1 = 2020 Jan 1 st .The splitting ratios for in-project and cross-project splits are 70%, 10%, 20%.
Table 2 presents the number of samples in each set for each methodology.We present more details and metrics of data collection in Appendix C.

Results
We use the hyper-parameters provided in the original papers.Validation sets are used for earlystopping if needed by the model.We run each model three times with different random seeds.Appendix D presents more details of our experiments to support their reproducibility.
Tables 3 and 4 present the results for comment generation and method naming, respectively.Each table has four parts and each part contains the results for one metric.Each number is the metric of a model (name at column 1) trained on the Train set of a methodology (name at row 1) and evaluated on a TestC set involving that methodology (name at row 2).The best results are in bold text.The results marked with the same Greek letter are not statistically significantly different. 2Appendix E presents the results on Val and TestS sets, and bar plots visualizing the results.

Findings
Depending on the methodology, one model can perform better or worse than another.On method naming task, we found that Code2Seq outperforms Code2Vec only in cross-project methodology but not the other methodologies, consistently on all metrics.Our observation aligns with the finding in the original paper (Alon et al., 2019a) that Code2Seq outperforms Code2Vec when using the cross-project methodology.The reason is that in contrary to Code2Seq which generates a name as a sequence of subtokens, Code2Vec generates a name by retrieving a name in the Train set, and thus has better chances to generate correct names under the mixed-project and time-segmented methodologies where the names in the Test set are similar to the names in the Train set.This finding suggests that a model may work better for one use case but not another-in this case, Code2Seq performs better in the cross-project batch-mode use case, but Code2Vec performs better in the in-project batch-mode and the continuous-mode use case.Depending on the methodology, the differences between models may or may not be observable.For example, for comment generation, on the TestC set of cross-project and time-segmented methodologies when using the METEOR metric (Table 3, columns 6-7), Transformer significantly outperforms Seq2Seq when trained on the timesegmented Train set, but does not when trained on the cross-project Train set.Similar observations can be made on the BLEU and EM metrics for comment generation, and the EM metric for method naming.Two models' results being not statistically significantly different indicates that their difference is not reliable.We could not find reference points for this finding in prior work (unfortunately, Ahmad et al. (2020) did not compare Seq2Seq with Transformer though both were provided in their replication package).Results under the mixed-project methodology are inflated.We found that the results under the mixed-project methodology are always higher than the other two methodologies.This is not surprising as ML models have difficulty in generalizing to samples that are different from the Train set.
Considering that the mixed-project methodology represents a less realistic use case than the other two methodologies, the mixed-project methodology always over-estimates the models' usefulness.As such, we suggest that the mixed-project methodology should never be used unless the model is targeted specially for the in-project batch-mode use case ( §3).Results under the cross-project methodology may be an under-estimation of the more realistic continuous-mode use case.We found that the results under the cross-project methodology are always lower than the results under the timesegmented methodology, consistently on all metrics in both tasks.We have discussed that the continuous-mode use case is more realistic than others ( §3).This suggests that the usefulness of the models in prior work using the cross-project methodology may have been under-estimated.Findings in prior work may not hold when using a different methodology or a different dataset.We found that the findings reported by prior work may not hold in our experiment: for example, the finding "DeepComHybrid outperforms Seq2Seq" from Hu et al. (2020) does not hold on our dataset (one reason could be the Seq2Seq code we used is more recent than the version that DeepComHybrid based on).This indicates that researchers should specify the targeted use case, the employed methodology, and the used dataset when reporting findings, and expect that the findings may not generalize to a different use case or dataset.
6 Future Work

Methodologies for Other SE Areas Using ML Models
We studied the impact of different evaluation methodologies in the context of code summarization, and future work can study their impacts on other software engineering (SE) areas using ML models.We briefly discuss the potential ways and challenges of transferring our methodologies from code summarization to ML models for other SE tasks, including generation tasks (e.g., commit message generation and code synthesis) and nongeneration tasks (e.g., defect prediction and bug localization).The key is to modify the application steps of the methodologies based on the format of samples (inputs and outputs) in the targeted task.
For most tasks where inputs and outputs are software-related artifacts with timestamps, the methodologies, use cases, and application steps defined by us should still apply.For example, transferring our methodologies from the code summarization task to the commit message generation task only requires replacing "(code, comment) pairs" to "(code change, commit message) pairs".
For some tasks, the input or output of one sample may change when observed at different timestamps.For example, in defect prediction (pointed out by Tan et al. (2015)), suppose a commit at τ −2 was discovered to be buggy at τ , then when training the model at τ −1 , that commit should be labeled as not buggy.The correct version of the sample should be used according to its timestamp.

Other Use Cases and Methodologies
Out of many other use cases and methodologies, we discuss two that are closely related to the continuous-mode use case and the time-segmented methodology.Future work can expand our study and perform experiments on them.Cross-project continuous-mode use case.Compared to the continuous-mode use case, when training the model at τ , instead of using all projects' samples before τ , we only use other projects' samples.The corresponding methodology is a combination of the cross-project and time-segmented methodologies.From the ML model users' perspective, this use case is less realistic than the continuous-mode use case, because using samples from the targeted projects can improve the model's performance.However, from ML model researchers' perspective, this methodology may be used to better evaluate the model's effectiveness on unseen samples (while considering software evolution).Online continuous-mode use case.Compared to the continuous-mode use case, when we train a new model at τ , instead of discarding the previous model trained at τ −1 and training from scratch, we continue training the previous model using the samples between τ −1 and τ , e.g., using online learning algorithms (Shalev-Shwartz, 2012).The corresponding methodology is similar to the timesegmented methodology, but with multiple training and evaluation steps.Compared to the timesegmented methodology, the model trained using this methodology may have better performance as it is continuously tuned on the latest samples (e.g., with the latest language features).

Applications of Our Study in Industry
We provide generic definitions to several representative use cases (in-project batch-mode, crossproject batch-mode, and continuous-mode).We believe these three use cases, plus some variants of the continuous-mode use case ( §6.2), should cover most use cases of ML models in the SE industry.In practice, it may not always be possible to determine the target use cases in advance of deploying ML models, in which case performing a set of experiments (similar to the one in our study) to compare between different methodologies and use cases can guide the switching of use cases.We leave studying the usages of ML models in the SE industry and deploying the findings of our study as techniques to benefit the SE industry as future work.
7 Related Work

Evaluation Methodologies
To our best knowledge, ours is the first work to study the evaluation methodologies of code summarization ML models and use the time-segmented methodology in this area.Outside of the code summarization area, a couple of work on defect prediction (D'Ambros et al., 2012;Tan et al., 2015;Wang et al., 2016;Kamei et al., 2016), one work on program repair (Lutellier et al., 2020), and one work on bug localization (Pradel et al., 2020) have taken into account the timestamps during evaluation, specifically for their task.The methodologies we proposed in this paper may also be extended to those areas.Moreover, our work is the first to study the impact of the mixed-project, cross-project, and time-segmented methodologies side-by-side.Tu et al. (2018) revealed the data leakage problem when using issue tracking data caused by the unawareness of the evolution of issue attributes.We revealed that a similar problem (unawareness of the timestamps of samples in the dataset) exists in the evaluation of code summarization tasks, and we propose a time-segmented methodology that can be used in future research.Bender et al. (2021) pointed out a similar issue in NLP, that the ML models evaluated in standard cross-validation methodology may incur significant bias in realistic use cases, as the models cannot adapt to the new norms, language, and ways of communicating produced by social movements.

Code Summarization
Code summarization studies the problem of summarizing a code snippet into a natural language sentence or phrase.The two most studied tasks in code summarization are comment generation and method naming ( §5.1).Table 1 already listed the prior work on these two tasks.Here, we briefly discuss their history.
The first work for comment generation (Iyer et al., 2016) and method naming (Allamanis et al., 2016) were developed based on encoder-decoder neural networks and attention mechanism.Other prior work extended this basic framework in many directions: by incorporating tree-like code context such as AST (Wan et al., 2018;Xu et al., 2019;LeClair et al., 2019;Hu et al., 2018aHu et al., , 2020)); by incorporating graph-like code context such as call graphs and data flow graphs (Xu et al., 2018;Fernandes et al., 2019;Yonai et al., 2019;LeClair et al., 2020); by incorporating path-like code context such as paths in AST (Alon et al., 2019b,a); by incorporating environment context, e.g., class name when generating method names (Nguyen et al., 2020); by incorporating type information (Cai et al., 2020); or by using more advanced neural architecture such as transformers (Ahmad et al., 2020).
Recently, pre-trained models for code learning (Feng et al., 2020;Guo et al., 2021;Ahmad et al., 2021;Wang et al., 2021;Chen et al., 2021) were built on large datasets using general tasks (e.g., masked language modeling), and these models can be fine-tuned on specific code learning tasks, including comment generation and method naming.Evaluating pre-trained models involves a pretraining set, in addition to the regular training, validation, and test sets.Our methodologies can be extended for pre-trained models; for example, in the time-segmented methodology, the pre-training set contains samples that are available before the samples in all other sets.No prior work on pretrained models has considered the timestamps of samples during evaluation.

Conclusion
We highlighted the importance of specifying targeted use cases and adopting the correct evaluation methodologies during the development of ML models for code summarization tasks (and for other software engineering tasks).We revealed the importance of the realistic continuous-mode use case, and introduced the time-segmented methodology which is novel to code summarization.Our experiments of comparing ML models using the time-segmented methodology and using the mixedproject and cross-project methodologies (which are prevalent in the literature) showed that the choice of methodology impacts the results and findings of the evaluation.We found that mixed-project tends to over-estimate the effectiveness of ML models, while the cross-project may under-estimate it.We hope that future work on ML models for software engineering will dedicate extra space to document intended use cases and report findings using various methodologies.

A Formulas of Application of
Methodologies §4 described the steps to apply the methodologies on a given dataset.In this section, we present the formulas used in those steps.
Table 5 lists the symbols and functions that we use.Recall that Figure 4 visualizes all the steps.In the following discussion, we use these abbreviations: MP = mixed-project; CP = cross-project; T = time-segmented; Train = training; Val = validation; TestS = standard test; TestC = common test.
Step 1: time-segment.We first obtain the samples in each project on three selected timestamps τ −2 , τ −1 , τ : E τ −2 ,p , E τ −1 ,p , E τ,p .Then, we compute the difference of the sets to get: the samples after τ −2 and before τ −1 , denoted as E τ −1 \τ −2 ,p = E τ −1 ,p \ E τ −2 ,p ; and the samples after τ −1 and before τ , denoted as Step 2: in-project split.We perform the split with the following formula (r x , r y , r z are the splitting ratios, and following ML practices, r x ≫ r z ⪆ r y ): Step 3: cross-project split.Given the set of projects P and the splitting ratios r x , r y , r z , we perform the split with the following formula: P train , P val , P test = split(shuffle(P), r x , r y , r z ) Step 4: grouping.Table 6 left part lists the formulas used in this step.
Step 5: intersection.Table 6 right part lists the formulas used in this step.

P
A set of projects, from which samples are derived.p A project.
The set of samples extracted from project p at timestamp τ .E τ \τ −1 ,p = E τ,p \ E τ −1 ,p (where \ is the set difference operator), i.e., the samples extracted from project p at timestamp τ that were not available at timestamp τ −1 .

shuffle(l)
Given a set (of samples or projects) s, returns a set with the same items after random shuffling.
split(E, rx, ry, rz) Given a set of samples  Appendix B), we define clean(E eval , E train ) which is task-specific.It takes two inputs: the samples in the evaluation set that needs to be cleaned, and the samples used for training; and returns the cleaned evaluation set.Note that when the evaluation set is the TestS or TestC set, we also consider samples in the Val set as used for training (because they are used for hyper-parameter tuning or early-stopping).The formulas for this step are: for m ∈ {MP, CP, T}: for m, m ′ ∈ {(MP, CP), (MP, T), (CP, T)}:

B Filtering Near-Duplicates
We experimented if filtering near-duplicates can lead to any change to our findings.We used the following three configurations to define nearduplicates (there are many other ways to define near-duplicates, which we leave for future work).
The numbers in parentheses are the percentages of  The experiment results are presented in the following tables and plots: • Using same_code configuration: comment generation: Table 9, Figure 7.
We can draw several conclusions.First of all, our findings in Section 5.4 still hold.The metrics of same_code are closest to the metrics of not filtering near-duplicates, which indicates that this filtering configuration has little impact on eval-uation results.On the contrary, the metrics of same_summary and high_similarity are lower than the metrics of not filtering near-duplicates, which means the models become less effective.This indicates that current ML models for code summarization are better at following the samples in the training set than generating novel summaries.

C Data Collection Details
This section extends §5.2 and describes our data collection process in details.Overall, our datasets are collected and processed following the steps in §4 and Appendix A. We started by collecting samples of methods with comments from opensource GitHub projects, and then performed taskspecific processing to get the dataset for each task.Selecting projects.We initially chose 1,793 popular Java projects on GitHub: 1,000 Java projects with the highest number of stars (indicating how many GitHub users bookmarked a project) and another 793 Java projects whose owner is one of the famous open-source organizations on GitHub3 .We chose to use only Java projects because most prior work focused on this language (see Table 1).Then, we only kept the projects meeting the following criteria: (1) the number of stars should be larger than 20; (2) the lines of code of the project (as reported by GitHub API4 ) should be in the range of [10 6 , 2 × 10 6 ], to keep the number of samples balanced across projects; (3) the project should have at least one commit after Jan 1 st 2018.160 projects satisfied all the criteria.
Collecting the raw dataset.We set the timestamps τ −2 to 2019 Jan 1 st , τ −1 to 2020 Jan 1 st , and τ to 2021 Jan 1 st .For each project and for each year in [2019,2020,2021], we identified the last commit in the project before Jan 1 st of that year, checkedout to that commit, used JavaParser 5 to parse all Java files, and collected samples of Java methods in the form of (code, comment, name, project, year) tuples, where the comment is the first sentence in the JavaDoc of the method.We discarded the samples where: (1) the code or the comment contains non-English characters (157 and 5,139 cases respectively); (2) the code is longer than 10,000 characters (60 cases); (3) the method body is empty, i.e., abstract method (77,769 cases); (4) the comment is empty after removing tags such as @inheritDoc (21,779 cases).If two samples are identical except for the "year" label, we would keep the one with the earliest year (e.g., two samples from 2018 and 2019 years have identical code, comment, name, and project, so we only keep the 2018 one).We ended up with 77,475 samples in the raw dataset.Then, we follow the steps described in §4 to 5 https://javaparser.org/split the raw dataset into Train, Val, TestS sets for each methodology and TestC set for each pair of methodologies.The splitting ratios (for in-project and cross-project splits) are: r x = 70%, r y = 10%, r z = 20%.
Comment generation.Table 7 shows the statistics of the comment generation dataset.The rows, from top to bottom, are: the number of samples; the average number of subtokens in code; the percentage of samples whose number of subtokens in the code is less than 100, 150, 200; the average number of subtokens in comments; the percentage of samples whose number of subtokens in the comment is less than 20, 30, 50. Figure 5 visualizes the distributions of the number of subtokens in code (x-axis) and the number of subtokens in comments (y-axis).
Method naming.For each sample, we replaced the appearances of its name from its code to the special token "METHODNAMEMASK" such that the models cannot cheat by looking for the name in the signature line or in the method body of recursive methods.Table 8 shows the statistics of the method naming dataset.The rows, from top to bottom, are: the number of samples; the average number of subtokens in code; the percentage of samples whose number of subtokens in the code is less than 100, 150, 200; the average number of subtokens in names; the percentage of samples whose number of subtokens in the name is less than 2, 3, 6. Figure 6 visualizes the distributions of the number of subtokens in code (x-axis) and the number of subtokens in names (y-axis).

D Experiments Details
This section presents details of our experiments to support their reproducibility.
Computing infrastructure.We run our experiments on a machine with four NVIDIA 1080-TI GPUs and two Intel Xeon E5-2620 v4 CPUs.Reproducibility of prior work.We used the replication packages provided in the original papers of the models when possible.We made (small) updates to all models to: (1) upgrade outdated data processing code (because of our dataset contains samples with new programming language features that were not considered in the past); (2) export evaluation results in the format compatible with our scripts.We integrated these updates in our replication package.

E Additional Experiment Results
We present the following additional tables and figures to help characterize our experiments results and support our findings: • Evaluation results on the Val and TestS sets.
• Bar plots of the automatic metrics per sample.
comment generation:

Figure 1 :
Figure 1: Continuous-mode use case that can be evaluated with the proposed time-segmented methodology.

Figure 4 :
Figure 4: Steps of processing a dataset into training, validation, standard test, and common test sets.

Table 1 :
illustrates this methodology, where each box represents a project and each circle represents a sample.This methodology is time-unaware, i.e., it does not consider if samples in the test sets are committed into a project before or after samples in the training or validation sets.Methodologies used in prior work on code summarization; we use the highlighted lines in our experiments.

Table 2 :
Number of samples in our datasets.

Table 3 :
Comment generation models' results on TestC sets.The six results in each block are comparable because they use the same set and metric, where results marked with the same Greek letter are not statistically significantly different.Depending on the methodology, we may or may not observe statistically significant differences results between models.

Table 4 :
Method naming models' results on and TestC sets.The four results in each block are comparable because they use the same set and metric, where results marked with the same Greek letter are not statistically significantly different.Surprisingly, Code2Seq outperforms Code2Vec in the Cross-Project methodology but the opposite holds in the other two methodologies.

Table 5 :
Definitions of symbols and functions.

Table 6 :
The formulas (at steps 4 and 5) to get the training (Train), validation (Val), and standard test (TestS) sets for each methodology, and the common test (TestC) set for each pair of methodologies.
Estimated runtime of models.The approximate model training time are: DeepComHybrid 7 days; Seq2Seq 4 hours; Transformer 10 hours; Code2Seq 4 hours; Code2Vec 15 minutes.The evaluation time is around 1-10 minutes per model per evaluation set.Number of parameters.The number of parameters in each model are: DeepComHybrid 15.6M; Seq2Seq 31.3M;Transformer 68.2M; Code2Seq 5.7M; Code2Vec 33.1M.Random seeds.The random seed used for preparing the dataset (performing in-project and crossproject splits) is: 7. The random seeds used for the three times of training are: 4182, 99243, 3705.