Reproducibility in NLP: What Have We Learned from the Checklist?

Scientific progress in NLP rests on the reproducibility of researchers' claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. We provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, we find evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist's introduction. Further, we show acceptance rate grows for submissions with more Yes responses. We find that the 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. We find that only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. We discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.


Introduction
Reproducibility is a foundational component of scientific progress.NLP systems are complex, and even when their behavior is carefully measured, incentives to publish quickly and limitations in the publishing process can lead to underreporting of information necessary for reproducible science.The ramifications of this extend beyond the research community; the audience of NLP papers published years ago was largely other NLP researchers, but today the world is watching developments in the field, looking for advances that will lead to broadlyadopted applications.As the impact of NLP grows, Acceptance rate for submissions with X Yeses EMNLP 2021 (r 2 = 0.37) Papers with more YES responses are more likely to be accepted, except those that mark YES to all checklist items, which we hypothesize contain responses which do not accurately represent the associated paper.
so too do the consequences of reproducibility in our field.
Of course, NLP is not the first field to evaluate reproducibility; some have even described a "reproducibility crisis" in science (Aarts et al., 2015;Baker, 2016).One tool designed to improve reproducibility is a checklist filled out at paper submission time.Such a checklist can descriptively remind authors of relevant information to report, while preserving the freedom for authors to do so however they see fit.For example, the journal Nature requires authors fill out a Reporting Checklist for Life Sciences Articles (Nature, 2018).In 2019 NeurIPS started to require that submissions fill out the ML Reproducibility Checklist (Pineau et al., 2021), partly inspired by the Nature checklist, and in 2021 AAAI required their own checklist.1CVPR, ICCV, ECCV, and IJCAI provide a checklist but do not collect responses.
In this work, we provide the first analysis of the NLP Reproducibility Checklist (Dodge et al., 2019).We have gathered 10,405 anonymized responses from EMNLP 2020 and 2021, NAACL 2021, and ACL 2021.For the latter two we are also able to obtain reviewer scores, reproducibility judgements, and feedback on the Checklist.Our findings include: (1) Most checklist items are frequently reported, and submissions reporting them are more often accepted and perceived as reproducible.(2) Submissions that collect new data are accepted less and viewed as less reproducible, and these gaps are not explained by non-reporting of any current Checklist items.(3) Only about half of submissions report open sourcing code and many that do not also lack reporting on efficiency measures and even evaluation metrics.(4) A majority of reviewers describe the checklist as useful, and by contrasting responses to observed rates prior to the Checklist we evidence a possible increase in reporting.We conclude with a discussion of what can be inferred from these findings about the state of reproducibility in NLP and offer recommendations to address the gaps we have measured.

The NLP Reproducibility Checklist
The NLP Reproducibility Checklist was originally introduced by Dodge et al. (2019).Each item on the checklist is phrased as a statement, like "The number of parameters in each model," and authors can mark YES if they include that information in their paper, NO if they do not include it in their paper, or N/A if that information does not make sense for their submission (e.g., they do not use any models to report parameter counts for).The checklist items were a part of the submission form, and it was required that authors fill it out to submit their paper. 2Thus, the checklist responses act as a (self-reported) overview of the contents of papers submitted to NLP conferences.Importantly, authors were not required to include any information in their papers, they were only required to indicate whether or not they did include information.Answers were made available to reviewers, who were expressly asked to assess the reproducibility of the work.The filled checklists were not released with the published papers.paper.There are three categories of items: (1) for all reported experimental results, (2) for results involving multiple experiments, like hyperparameter search, and (3) for all datasets used.16 of 19 items appeared in all four conferences.We compare specific checklist items between the ML Reproducibility Checklist and the NLP Reproducibility Checklist conference variations in Appendix A.1.Full phrasing for each conference is listed in Table 6 (Appendix).
Our data includes, for a given submission, the checklist responses (YES, NO, N/A for each item), MAIN + FINDINGS acceptance status (ACCEPT ∈ {accepted, rejected}), and the TRACK.No data includes any deanonymizing information, such as authors or paper titles.For NAACL 2021 and ACL 2021, we have the following metadata for each review: overall recommendation score ("Should this paper be accepted to <conference name>?") averaged to AVGREC ∈ [1, 5], perceived reproducibility score ("How do you rate the paper's reproducibility?Will members of the ACL community be able to reproduce or verify the results in this paper?")averaged to AVGREPROD ∈ [1,5] or N/A if any reviewer responds N/A, and reproducibility checklist feedback ("Are the authors' answers to the Reproducibility Checklist useful for evaluating the submission?") aggregated by majority vote to CHECKLISTFEEDBACK ∈ {Not useful, Somewhat useful, Very useful}. 3s shown in Table 2, there were a total of 13,655 submissions across the four conferences.We remove all withdrawn and desk-rejected submissions from analysis,4 comprising 3, 250 submissions (23.8% of the data), leaving a total of 10, 405 submissions for analysis.
We recognize that the checklist responses are self-reported information, and thus in some cases might not be accurate representations of the associated submission (e.g., authors may mark YES to an item on the checklist when in fact their paper does not include that information).We discuss this in Appendix A.2.
To the best of our knowledge, the creators of the checklist indicated that the data would not be made public; while we currently do not plan to fully open source the data, the data can be made available upon request (and we welcome feedback on this policy). 5n all analyses, error bars represent 95% confidence intervals.These are computed by Clopper-Pearson interval for binary values and bootstrap for continuous values, both using scipy version 1.9.1 (Virtanen et al., 2020).All comparisons of differences in results are absolute differences unless explicitly stated as relative.

What Can We Learn About How
Reproducibility Already Works?
We begin by measuring current practice, according to the (self-reported) Checklist data.Across all items and conferences, 62.7% of responses were YES. Figure 4 shows that most items are reported in most submissions.Moreover, we can measure reviewers' perception of reproducibility as well as differences in rates of reporting for items among papers that do and do get accepted by *CL review.
More YES responses to checklist items associate with higher acceptance.In Figures 1 and 2 we show positive associations between answering more items as YES and ACCEPT rate.Each point in these figures represents the ACCEPT rate among all the submissions with the same number of YES responses among items.We regress the ACCEPT rate on a single variable counting checklist items answered YES for a submission.When pooling responses across all shared questions on all conferences r 2 = 0.53.6 Notably, submissions with YES responses to all items are consistently below the trend.We hypothesize in Appendix A.2 that these submissions include responses which do not accurately represent the associated paper; recall that authors were required to fill out the checklist in order to submit, so marking the same response to all items is, in some sense, as close as they can get to not filling it out.The lower acceptance rate suggests that reviewers are not scoring papers based on the responses to the checklist itself, instead evaluating the contents of the paper, as intended.
Reviewer assessed reproducibility associates with acceptance rate.In Figure 3 we compare ACCEPT rate across quantiles of AVGREPROD and AVGREC.Though ACCEPT rate grows much more Acceptance rate conditioned on Score slowly for AVGREPROD than AVGREC, reviewers assessment of reproducibility is still evidently associated with acceptance.
In all but three checklist items, YES responses are associated with higher ACCEPT rates.Figure 5.A presents ACCEPT rates conditioned on a given response.YES responses receive a 0.9% higher rate than the overall average, while NO and N/A receive 0.7% and 1.1% lower rates respectively.Figure 5.B presents the three exceptions where answering YES to that checklist item receives a lower than average rate.These are discussed in detail in Section 5 and Appendix A.3.
In all but one checklist item, YES responses are associated with higher AVGREPROD scores.Figure 6.A shows the mean AVGREPROD score conditioned on a given response in NAACL 2021 and ACL 2021.Reassuringly, YES responses receive 0.04 higher scores than average, while NO or N/A score 0.04 and 0.04 lower than average, respectively.Figure 6.B shows that LINKTOCODE has the highest score, 0.18 above average.We also highlight NEWDATADESCRIPTION as it is the only item with a lower than average score when answered YES.This exception is discussed further in Section 5, where we hypothesize this reflects a lower than average perceived reproducibility of submissions presenting new data.

The Data Collection Gap
Natural language processing has long been a field driven by data.A body of work has proposed best practices for documenting the characteristics and creation of datasets (Bender and Friedman, 2018;Gebru et al., 2018;Hutchinson et al., 2020;Dodge et al., 2021;Rogers et al., 2021;Pushkarna et al., 2022).Among other concerns, such documentation is critical for the difficult task of dataset reproduction (Recht et al., 2019, inter alia).From the checklist item NEWDATADESCRIPTION, which asks that collection is described if new data is presented, we find that 38.2% and 6.3% of submissions mark YES and NO, respectively.This implies that 44.5% of submissions to our NLP conferences collect new data; if almost half of submissions collect new data, we argue that data collection and dissemination practices deserve further attention.This also highlights clear room for improvement in the community: 14.1% of submissions that collect new data do not describe how it was collected, totalling 650 papers.
Submissions with new data have lower than average ACCEPT rate and AVGREPROD scores.Alarmingly, submissions that collect new data (i.e., submissions that mark YES or NO to NEW-DATADESCRIPTION) have a 5.1% lower acceptance rate than those that do not (i.e., mark N/A or BLANK).A low acceptance rate for answering NO to NEWDATADESCRIPTION would, by itself, be encouraging, perhaps indicating that reviewers expect data collection to be well documented.However, Figure 5B shows that, even when answering YES  PROD.We hypothesize that these phenomena arise both because dataset papers may indeed be more challenging to (re)produce and also because of the persistent (and problematic) tendency to value modeling over data collection (Rogers, 2021).
High compliance among the dataset checklist items does not reveal the source of the ACCEPT rate and AVGREPROD gap.DATASTATS, DATAS-PLIT, DATADOWNLOAD, and DATALANGUAGES receive the highest rates of reporting other than MODELDESCRIPTION and METRICS.This only grows when looking just at submissions presenting new datasets, reaching 97.3%, 91.7%, 91.4%, and 86.6% respectively, and NEWDATADESCRIPTION is also reported in 86.0% of these submissions.Un-  like the other dataset items, DATADOWNLOAD is less frequently reported, but its occurrence and associated ACCEPT rates and AVGREPROD scores are similar whether considering submissions presenting new data or not.This suggests that additional checklist items for data collection should be introduced to measure where this gap in perceived reproducibility is coming from.
28.9% of submissions with new data do not provide a downloadable version of the data.More generally, a clear area for improvement is that 25.3% of submissions overall answer NO to DATADOWNLOAD; providing a link to download a dataset is still important for previously released datasets as it might be ambiguous which version of a dataset was used.But for newly collected data, answering NO to DATADOWNLOAD implies the data is not publicly available at all.Moreover, when DATADOWNLOAD is NO, the rate of submissions reporting the collection process in NEW-DATADESCRIPTION drops 14.2%.Figure 7 further reveals the interaction between DATADOWNLOAD and NEWDATADESCRIPTION.When a new dataset submission provides neither a description of the data collection process nor access to the data itself, this leaves very little for reviewers to assess, at least with regard to the data contributions of the paper.Yet 28.9% of these papers are accepted.

Code (Un)availability
We see that on average 45.9% of submissions report linking to code (47.5% for accepted papers).We see in Figure 6 that whether submissions answer LINKTOCODE as YES or NO has the largest difference in AVGREPROD scores, with a gap of 0.30.Yet ACCEPT rates for submissions with or without LINKTOCODE are nearly the same.
We find similar rates of links to code as at ML conferences.Pineau et al. (2021) reveal a 38.8% self-reported rate of code availability at submission time for NeurIPS 2019.They find this number drops to 27.7% when checked by at least one reviewer.Extrapolating from this false reporting rate, the true code availability rate among accepted papers in our data might be 32.8%.Meanwhile, a study on ICML 2019 by Chaudhuri and Salakhutdinov (2019) finds 36% of submitted and 43% of accepted papers have code at submission time, though it unclear if these are self-reported.
Previous efforts to measure camera-ready code availability have found widely different rates than our reported LINKTOCODE at submission time.Unfortunately our data does not cover code availability at camera-ready, except insofar as some authors may interpret this checklist item to permit promises to later release code.24.3% of papers at NAACL 2022 opted in to submitting a code link to the Reproducibility Track and received an Open Source Code badge. 7We recognize this was optional for authors, and thus it is likely the case that the true number of camera-ready papers that included a link to code was higher.The studies mentioned before found that 74.4% and 64% of camera-ready papers had links to code at NeurIPS 2019 and ICML 2019.Narrowing the range of these measurements should be a worthwhile effort, as these studies found code being available during review was useful in 1,315 reviews in NeurIPS 2019, and 18.3% of ICML 2019 reviewers surveyed were able to look at code and found it useful.
Items on compute efficiency are completely reported in only 29.8% of submissions without code.Figure 8 shows patterns for these efficiency items that occurred more than 100 times.While ACCEPT rates are somewhat lower when items are not reported, 21.2% of these without-code submissions report none of the efficiency measures.There may be unavoidable impediments to making code available, such as intellectual property.But in this case even greater emphasis should be placed on reporting efficiency measures, as estimating these without code is quite difficult.Similarly 19.6% of submissions with no code report NO to explaining METRICS which may render evaluations irrecoverably ambiguous if there are varying implementations of a metric.
7 How Effective is the Checklist?Dodge and Smith (2020) describe the Checklist as intended to improve "reporting of the setup and results of the experiments that authors have conducted."Though self-reported data do not directly answer this question, we find potential evidence of such an improvement.Diachronic analysis also shows that reporting rates may have stagnated after initial improvement.We also examine reviewer and author views on the Checklist.
Compared against manually checked data from before the Checklist introduction, our data shows increases in 8 of 10 items.Figure 9 shows rates of a subset of items that were manually checked by Dodge et al. (2019)  we would expect fewer N/As, given the Checklist's focus on empirical work.
There is little variation in response proportions between conferences.Excluding two types of outliers likely caused by changes in the Checklist (see Appendix A.3), the maximum difference between conferences for an item is 6.6% and the maximum difference averaged over all items is 2.2%.This does demonstrate that measured response patterns are robust across conferences.However it also indicates that reproducibility reporting has stagnated over this one year period.
When asked, a majority of reviewers found the Checklist to be somewhat or very useful.In NAACL 2021 and ACL 2021 reviewers gave feedback on the Checklist.59.9% found the checklist "Somewhat Useful," 17.0% found it "Very Useful," and 23.2% found it "Not Useful."While this is higher than the 34% of reviewers who answered "yes" that the similar NeurIPS 2019 Checklist was "useful for evaluating the submission" (Pineau et al., 2021), it is worth noting that respondents to the question for NeurIPS could answer that they did not read the checklist results.
Author comments from submissions where the majority of reviewers found the Checklist "Not Useful," show possible gaps in checklist coverage.Some comment on not training mod-els or using hyperparameters from previous work.Many such submissions are represented among the 22.0% that answer N/A to all hyperparameter questions.Others comment on referring readers to citations for details of standard models, data, or metrics.Re-elaboration is pedagogically important but, comments argue, especially onerous for survey papers.Finally a comment notes that the Checklist is less relevant to psycholinguistics and cognitive modeling, and indeed the N/A rate of "Linguistic Theories, Cognitive Modeling and Psycholinguistics" TRACK submissions is 33.7%, an increase of 14.6% above the N/A rate over all TRACKs.

Discussion
Our findings from the NLP Reproducibility Checklist can both help inform new interventions and guide improvements to future checklists that will measure the outcomes of those interventions.These findings suggest that, after an initial increase, rates of reporting have stagnated in the period examined and will need new approaches to improve further.
The conference system should better support papers that collect new data.As discussed in Section 5, papers that collect new data have 5.1% lower acceptance rate than those that do not.Whether or not this gap is a cause or effect of the lack of prestige given to data work that Rogers (2021) describes, increasing awareness and resources for this work can help more high quality data reach publication.Checklists should also increase coverage of this topic.In our data a single item, NEWDATADE-SCRIPTION, covers all reporting regarding data collection.We find that papers with new data are perceived as less reproducible both when answering NO or YES to describing how they collected data.Likely a combination of several factors lead reviewers to score the reproducibility of papers with new data lower by 2.4% relative to papers without.To discover which are lacking, best practices in data reproducibility documentation (Gebru et al., 2018;Dodge et al., 2021) should be tracked individually with checklists.
Incentivize authors releasing code.We find that releasing code is the single most influential checklist item on perceived reproducibility.This aligns with work across diverse fields that argues open source code is key for transparent and reproducible science (Eglen et al., 2017;Celi et al., 2019;Shamir et al., 2013).These works also sug-gest that beyond reproducibility, open source code enables more impactful research by allowing other researchers to build on introduced methods and better understand findings through reading code.However, we find that less than half of papers in our study report releasing code at submission.We encourage conferences to incentivize code release at submission and especially camera-ready, and authors should be made aware of the significant benefit that code submission can have for the review process. 8Initiatives like the NAACL 2022 Reproducibility Track are a step in the right direction, as they publicly recognize open source code and verify code availability rather than only relying on self-reporting.However, in our data we see no evidence that code availability is increasing over time, so more direct incentives from publication venues are needed.
Make checklist responses public.Self-reported data is notoriously unreliable, but making the checklist responses public will add accountability. 9n addition, the checklist responses can reference specific sections and act as an index of the paper, so a reader knows where to look for what information.This will be implemented at ACL 2023, and we recommend other conferences follow.
Conferences should allow submission of checklists, unlimited appendices, and code a week after the main deadline.Doing so can help establish a norm of code submission as part of the review process.Likewise, additional time could improve completeness and accuracy of the checklist.Many pieces of information important for reproducibility are appropriate to include in the appendix of a paper without counting towards the page limit (e.g., a full list of hyperparameter values).This need not increase the burden on reviewers, as they can consult checklists rather than the appendix to assess reporting.

Looking Forward
Checklists collected during submission can measure practices in NLP at a comprehensive scale.To our knowledge, our work and Pineau et al.'s (2021) are the only analyses of submitted reproducibility checklists at AI conferences.These are examples of metascience in AI, or applying scientific rigor to the process of AI research; we expect that as NLP matures, we will see more examples of work analyzing and improving the scientific process.There are also examples of other work which manually audits papers (Fokkens et al., 2013;Gundersen and Kjensmo, 2018;McDermott et al., 2019;Haibe-Kains et al., 2020;Marie et al., 2021), which can compliment self-reported checklists, and other conference submission metadata (Chen et al., 2022), with validated samples.
As standard practices in our field evolve, we will have to update all parts of the conference process, from checklists to reviews to paper presentations.As a positive example, ACL Rolling Review implemented the Responsible NLP Checklist, 10 which includes ethics as well as reproducibility items.While we do not have data with which to evaluate the Responsible NLP checklist, our findings show the need for just such efforts to expand the coverage of checklists to better serve the community.

Limitations
Our analyses rely on data from checklists filled in by authors and ratings provided by reviewers.Checklists are self-reported and thus not necessarily accurate.We discuss where these bad faith resposes might influence our results in Appendix A.2. Another data limitation is that phrasing changes between conferences for some items, and 3 items do not appear in all conferences (see Appendix A.1). NAACL 2021 also introduces BLANK as a possible answer when respondents do not choose any answer.There is also possible ambiguity between the NO and N/A answers as it is apparent from the checklist open text comments that some authors used NO when the item was not applicable to their work.Our data also only covers four conferences across 2020 and 2021, and as such it is difficult to assess any temporal trends.Reviewer data is also subject to inaccuracy; for instance reviewer perceived reproducibility scores are only subjective estimations of the likelihood of actual reproducibility.Rushed reviewers could easily miss where some important information is reported in a paper.Moreover, we only have reviewer data for 2 of 4 conferences.
Our finding that papers that collect data have a gap in acceptance and perceive reproducibility 10 aclrollingreview.org/responsibleNLPresearch/relies on an indirect inference about which papers collect data.Checklists did not ask this explicitly but rather NEWDATADESCRIPTION should be answered N/A for all papers that do not collect data.
Our findings about code and data availability are limited by the ambiguity of when they must be made available to qualify for answering YES.It is evident from the open text checklist comments that some authors answer YES, NO, or even N/A when they have not yet made code or data available but plan to do so on acceptance.
Any self-reported inaccuracies in our data would particularly affect our findings about the impact of the Checklist introduction on reporting rates.By definition, we are not able to compare to selfreported rates from before Checklist introduction, so we instead rely on Dodge et al.'s (2019) manually checked rates.9 of the items in our data are not covered in the previous work, but the items that are share have similar phrasing.
Finally, pooling results over conferences can obscure conference-specific dynamics, such as differences in which items have lower than average YES ACCEPT rates discussed in Appendix A.3.We check that trends that we highlight in our analyses are consistent across conferences.And we also present unaggregated figures in the appendix.Likewise, we find that ACCEPT rates are nearly identical across conferences (see Appendix A.4), enabling us to contrast against an overall acceptance rate.

Ethics Statement
Scientific reproducibility is key to the benefits science can bring to society.Simply put, findings that cannot be reproduced cannot be relied upon, which can lead to wasted societal resources or even to harmfully incorrect understandings that misguide interventions.Our work focuses on the use of checklists to improve reporting of reproducibility information in scientific publications.While overly prescriptive and general rules about reproducibility could stifle less represented research communities whose practices may be less well understood by conference organizers, checklists attempt to mitigate this risk by only reminding authors of possibly salient information while still permitting authors to determine which items are or are not applicable.
At the same time, checklists which are filled out and collected for data analysis have the additional ethical risks associated with work that attempts to make social practices legible.That is, a check-list may neglect to cover practices used in a research community and thereby efface their role in the overall scientific endeavor, or conversely some practice may receive unfair scrutiny in excess of that given to other more prestiged practices.In the long term, checklists are perhaps most important as documents for guiding new generations of researchers writing their first papers, and thus even without being enforced they may still be taken as normative statements about best practices in the field.
To guide efforts to improve reproducibility in the field of NLP, we have analyzed responses to the NLP Reproducibility Checklist collected by four conferences.The Checklist data is covered by the default terms as it has no stated license, and we use it with direct permission from the conference organizers who collected it.The authors of the first version of the checklist state that it is intended for "improved reporting of the setup and results of the experiments that authors have conducted" and that it will be used to "quantitatively analyze our checklist responses" (Dodge and Smith, 2020).
We have endeavored to maintain the privacy of respondents by keeping the data anonymized and presenting results at a sufficient level of aggregation to prevent deanonymization.Nevertheless all work that seeks to describe the opinions of groups of humans caries an ethical burden to do so accurately and consistently with the wishes of those represented.To that end, we take care to point out limitations in what can be inferred from the data, and as originally intended by the data creators we do not make the data publicly available.
Table 3: Checklist item phrasing differences across conferences.∆ marks differing item phrasing.N/A marks conferences with no equivalent item.
checklist. 2 more, VALIDATIONPERF and HYPER-SEARCH, are incorporated that are manually evaluated along with 8 items from Pineau et al. ( 2021) on a random sample of 50 papers from EMNLP 2018 in Dodge et al. (2019).In that analysis at least one checklist item was found per paper and each checklist item occurred in at least one paper.Finally, PARAMETERS is included for its important role in measuring the complexity of models, and DATALANGUAGES is included because of the importance of acknowledging which communities of speakers are being served by a language technology as noted by Bender (2019).
The phrasing overlap in NLP and ML Checklists permits comparison of our data to responses from NeurIPS 2019.Pineau et al. (2021) find similar rates of reporting for dataset and efficiency items, though fewer submissions respond N/A to describing data collection.They find higher rates for items concerning hyperparameters and multiple experiments.Most of all their acceptance rates conditioned on items differ dramatically from ours.All but one item for "empirical results" get lower than average acceptance for YES and higher for N/A, while our data shows lower YES ACCEPT rates for only 3 empirical items.This suggests the applicability of the NLP Checklist is more aligned with reviewing at the studied conferences.
The phrasing of the Checklist items was determined by distinct groups of organizers for each conference.While 9 items maintain the same phrasing, 6 see phrasing changes, and 3 are only asked at some conferences (see Tables 3 and 6).LINK-TOCODE remains the same in substance while phrasing variations address logistics such as file formats and anonymization.RUNTIME varies in EMNLP 2021 and NAACL 2021 by asking for runtime or energy cost.METRICS varies in NAACL 2021 by not specifying links to metric code.EXPECTEDPERF varies in EMNLP 2020 and ACL 2021 by asking for mean and variance of hyperparameters, where in the other phrasing any summary statistic of results is sufficient.DATAS-TATS includes languages and label distributions in its variations.DATADOWNLOAD varies only in file formats, except for NAACL 2021 which also allows for a simulation environment.

A.2 Bad Faith Responses
As expected the MODELDESCRIPTION question was answered YES by almost all submissions (96.3% of responses over all three conferences).This question was intended as an attention check and was designed such that almost all submissions should answer YES.This helps assure that respondents are not using the N/A (or NO) response in protest or bad faith to quickly fill in meaningless answers, as only 2.6% (or 0.2%) of submissions answer MODELDESCRIPTION this way.However this does not preclude the use of answering questions YES in bad faith to bypass the checklist.Likewise we see 8.0% of NAACL 2021 respondents leave this field BLANK.
Submissions with all identical answers have lower than average acceptance.In Table 4 we show counts and change from average acceptance rates for submissions whose answers are all identical.This pattern is most prevalent for YES and BLANK, accounting for several percent of all submissions.All NO and N/A submissions, however, are quite infrequent.One possible explanation is that selecting all YES or BLANK is an expedient way to bypass the checklist during the submission process.Though we cannot know what portion of submissions with this pattern may exhibit this issue, it is important to be aware of this limitation.

A.3 Additional results
YES was the most common response to checklist questions.The proportion of a given answer in responses for each question is shown in Figure 12. 62.7% of responses to checklist questions across all conferences were YES, with 62.8%, 63.7%, 60.0%, and 62.5% respectively for EMNLP 2020 and 2021, NAACL 2021, and ACL 2021.The ma- The checklist items which receive less than average YES acceptance rates are not consistent across all conferences.Figure 13 shows acceptance rates for all checklist items over all conferences.From this figure we see that ACL 2021 also has LINKTOCODE, PARAMETERS, DATAS-TATS, and DATADOWNLOAD YES acceptance rates below average, though all of these estimates include the average acceptance rate within their 95% confidence intervals.NAACL 2021 has no YES acceptance rates below average, though VALIDA-TIONPERF and NEWDATADESCRIPTION remain the two lowest.Likely all NAACL 2021 YES acceptance rates are elevated because in this conference respondents could leave questions BLANK, possibly diverting some low-quality responses to BLANK instead of YES.Also of note, however is that across conferences RUNTIME receives high YES acceptance rate, achieving the best overall at 4.3% higher than average.
There are outliers to the little variation in response proportions between conferences, but they are likely artifacts of changes in the check-list.The rate of YES responses is generally lower for NAACL 2021, but this is likely due to the ability to leave checklist responses BLANK.Excluding NAACL 2021, the largest difference in YES rate (27.3%) occurs on EXPECTEDPERF when this item changes phrasing substantially between EMNLP 2020 and EMNLP 2021.
PCA analysis.To identify clusters of checklist items that relate to each other, we take inspiration from similar analysis in Michael et al. (2022) and use principal component analysis (PCA) on all responses to shared checklist items across the four conferences.This results in 16 features, which we linearize as {NO → −1, N/A /BLANK → 0, YES → 1}.We run PCA using scikit-learn version 1.1.1(Pedregosa et al., 2011) and find that the first 4 components cover 55.9% of the variance in the data.Table 5 shows these components and their coefficients with magnitude > 0.20.The first component assigns weight all in one polarity to the checklist items with middling frequencies, highlighting practices where perhaps community norms have not settled.wards work that adapts choices other than traditional hyperparameters to a validation set.

A.4 Baseline Acceptance Rates
To aid in analyzing how publication decisions differ based on responses to the checklist, we first must establish what is the average acceptance rate across all papers in our data.In Table 7 we provide basic statistics about submissions and decisions.The acceptance rates reported12 by the conferences all include an unknown and varying number of withdrawn and desk-rejected papers and thus are not easily comparable.For the rest of our analysis we will instead make use of acceptance rates computed from our data that always remove all withdrawn and desk-rejected papers.With this approach we find that all of the conferences have similar acceptance rates when including both acceptance to the MAIN conference and to FINDINGS.

HyperMethod
The method of choosing hyperparameter values (e.g., uniform sampling, manual tuning, etc.) and the criterion used to select among them (e.g., accuracy) The method of choosing hyperparameter values (e.g., uniform sampling, manual tuning, etc.) and the criterion used to select among them (e.g., accuracy) The method of choosing hyperparameter values (e.g.manual tuning, uniform sampling, etc.) and the criterion used to select among them (e.g.accuracy) The method of choosing hyperparameter values (e.g., uniform sampling, manual tuning, etc.) and the criterion used to select among them (e.g., accuracy)

NewDataDescription
For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control.
For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control.
For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control DataLanguages N/A N/A For natural language data, the name of the language(s) N/A

Figure 1 :
Figure 1: Submissions to EMNLP 2021 binned by count of YES responses to the NLP Reproducibility Checklist items.The ACCEPT rate is given for each bin.Papers with more YES responses are more likely to be accepted, except those that mark YES to all checklist items, which we hypothesize contain responses which do not accurately represent the associated paper.

Figure 2 :
Figure 2: ACCEPT rate among submissions binned by count of YES responses.YES response count and ACCEPT rate trend consistently positive.All-YES responses are notably below trend, as discussed in Appendix A.2.

Figure 3 :
Figure 3: ACCEPT rates across quantiles for perceived reproducibility (AVGREPROD) and overall recommendation (AVGREC) for NAACL and ACL 2021.Perceived reproducibility trends positively with acceptance.

Figure 7 :
Figure7: Proportion (row labels) and ACCEPT rates (horizontal purple bars) for all response patterns on dataset availability and creation (excluding instances where either item is N/A or BLANK).Nearly 1 in 11 of these neither share the data nor describe its collection, yet 28.9% of those are accepted.

Figure 8 :
Figure8: Proportion (row labels) and ACCEPT rates (horizontal purple bars) for efficiency response patterns with > 100 submissions when LINKTOCODE is NO and no responses are N/A or BLANK.More than 1 in 5 of these do not report any efficiency items, which are difficult to infer without source code.

Figure 10 :
Figure 10: Phi coefficient of binary YES or not YES answer for each item to binary ACCEPT or not ACCEPT for each submission.

Figure 11 :
Figure11: Phi coefficient between items shared over all conferences for the binary variable YES or not YES.Unsurprisingly, related groups of items about efficiency, hyperparameters, and data each correlate together.

Figure 12 :
Figure12: The portion of submissions giving a particular response per question.Note that NAACL 2021 respondents were able to leave questions BLANK; These are still counted in the total responses for these ratios.

Figure 14 :
Figure 14: Reviewer perceived reproducibility score (AVGREPROD∈ [1, 5]) for submissions with a given response.Column (A) shows score conditioned on response regardless of item.(B) conditions on answer and item.Rows present the two conferences with such data and pooled results overall.

Table 2 :
Submissions, Withdrawn/Desk-Rejects, and MAIN conference and FINDINGS acceptance rates in our data.
YES response rate per item.Most items are reported for most submissions.Note that NAACL 2021 respondents were able to leave questions BLANK.Other answers shown in Figure12(Appendix).
Acceptance rate of filtered submissions with pattern P (Accept | pattern ∈ {Yes, No}) P (Accept) Acceptance rate of filtered submissions with pattern P (Accept | pattern ∈ {Yes, No}, LinkToCode = No)

Table 4 :
Submissions with all Checklist responses given the same answer (e.g., responding N/A to all items) and their change in MAIN and FINDINGS acceptance rate from overall rate.

Table 5 :
The top four components from running PCA on shared checklist items from four conferences, with percent variance explained in parentheses.Each component lists checklist items and their coefficients with magnitude > 0.20.

Table 6 :
Exact checklist item phrasing for each conference.Items listed as N/A did not appear on the checklist for that conference.

Table 7 :
Submissions and decisions statistics.Reported acceptance rates include varying amounts of withdrawn and desk-reject submissions.We exclude all of these to standardize to rates.
Figure13: ACCEPT rates for submissions with a given response.Column (A) shows rate conditioned on response regardless of item.(B) conditions on answer and item.Rows present each conference and pooled results overall.
Proportion (row labels) and ACCEPT rates (horizontal purple bars) over all conferences for top response patterns for items split into three sections.