A Systematic Review of Reproducibility Research in Natural Language Processing

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP,


Introduction
Reproducibility is one of the cornerstones of scientific research: inability to reproduce results is, with few exceptions, seen as casting doubt on their validity. Yet it is surprisingly hard to achieve, 70% of scientists reporting failure to reproduce someone else's results, and more than half reporting failure to reproduce their own, a state of affairs that has been termed the 'reproducibility crisis' in science (Baker, 2016). Following a history of troubling evidence regarding difficulties in reproducing results (Pedersen, 2008;Mieskes et al., 2019), where 24.9% of attempts to reproduce own results, and 56.7% of attempts to reproduce another team's results, are reported to fail to reach the same conclusions (Mieskes et al., 2019), the machine learning (ML) and natural language processing (NLP) fields have recently made great strides towards recognising the importance of, and addressing the challenges posed by, reproducibility: there have been several workshops on reproducibility in ML/NLP including the Reproducibility in ML Workshop at ICML'17, ICML'18 and ICLR'19;the Reproducibility Challenge at ICLR'18,ICLR'19,NeurIPS'19,and NeurIPS'20; LREC'20 had a reproducibility track and shared task (Branco et al., 2020); and NeurIPS'19 had a reproducibility programme comprising a code submission policy, a reproducibility challenge for ML results, and the ML Reproducibility checklist (Whitaker, 2017), later also adopted by EMNLP'20 and AAAI'21. Other conferences have foregrounded reproducibility via calls, chairs' blogs, 1 special themes and social media posts. Sharing code, data and supplementary material providing details about data, systems and training regimes 2 is firmly established in the ML/NLP community, virtually all main events now encouraging and making space for it. Reproducibility even seems set to become a standard part of reviewing processes via checklists. Far from beginning to converge in terms of standards, terminology and underlying definitions, however, this flurry of work is currently characterised by growing diversity in all these respects. We start below by surveying concepts and definitions in reproducibility research, areas of particular disagreement, and identify categories of work in current NLP reproducibility research. We then use the latter to structure the remainder of the paper.

Selection of Papers:
We conducted a structured review employing a stated systematic process for identifying all papers in the field that met specific criteria. Structured reviews are a type of metareview more common in fields like medicine but beginning to be used more in NLP (Reiter, 2018;Howcroft et al., 2020).
Specifically, we selected papers as follows. We 1 https://2020.emnlp.org/blog/ 2020-05-20-reproducibility 2 There are some situations where it is difficult to share data, e.g. because the data is commercially confidential or because it contains sensitive personal information. But the increasing expectation in NLP is that authors should share as much as possible, and justify cases where it is not possible. searched the ACL Anthology for titles containing either of the character strings reproduc or replica, either capitalised or not. 3 This yielded 47 papers; following inspection we excluded 12 of the papers as not being about reproducibility in the present sense. 4 We found 25 additional papers in non-ACL NLP/ML sources, and a further 7 in other fields. Figure 1 shows 5 how the 35 papers from the ACL Anthology search are distributed over years: one paper a year at most until 2017/18 when interest seems to have increased spontaneously, before dropping off again. The renewed high numbers for 2020 are almost entirely due to the LREC RE-PROLANG shared task (see Section 5 below).

Terminology and Frameworks
Reproducibility research in NLP and beyond uses a bewildering range of closely related terms, often with conflicting meaning, including reproducibility, repeatability, replicability, recreation, re-run, robustness, repetition and generalisability. The fact that no formal definition of any of these terms singly, let alone in relation to each other, is generally accepted as standard, or even predominant, in NLP at present, is clearly a problem for a survey paper. In this section, we review usage before identifying common-ground terminology that will enable us to talk about the research we survey.
The two most frequently used 'R-terms', reproducibility and replicability, are also the most problematic. For the ACM (Association for Computing Machinery, 2020), results have been reproduced if "obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts 3 grep -E 'title * =. * (r|R)\} * (eproduc|epl ica)' anthology.bib 4 Most were about annotation and data replication. 5 Data and code: https://github.com/ shubhamagarwal92/eacl-reproducibility provided by the author," and replicated if "obtained in a subsequent study by a person or team other than the authors, without the use of author-supplied artifacts" (although initially the terms were defined the other way around 6 ). The definitions are tied to team and software (artifacts), but it is unclear how much of the latter have to be the same for reproducibility, and how different the team needs to be for either concept. Rougier et al. (2017) tie definitions (just) to new vs. original software: "Reproducing the result of a computation means running the same software on the same input data and obtaining the same results. [...] Replicating a published result means writing and then running new software based on the description of a computational model or method provided in the original publication, and obtaining results that are similar enough to be considered equivalent." It is clear from the many reports of failures to obtain 'same results' with 'same software and data' in recent years that the above definitions raise practical questions such as how to tell 'same software' from 'new software,' and how to determine equivalence of results. Wieling et al. (2018) define reproducibility as "the exact re-creation of the results reported in a publication using the same data and methods," but then discuss (failing to) replicate results without defining that term, while also referring to the "unfortunate swap" of the definitions of the two terms put forward by Drummond (2009).
Whitaker (2017), followed by Schloss (2018), tie definitions to data as well as code: The different definitions of reproducibility and replicability above, put forward in six different contexts, are not compatible with each other. Grappling with degrees of similarity between properties of experiments such as the team, data and software involved, and between results obtained, each draws the lines between terms differently, and moreover variously treats reproducibility and replicability as properties of either systems or results. All are patchy, not accounting for some circumstances, e.g. a team reproducing its own results, not defining some concepts, e.g. sameness, or not specifying what can serve as a 'result,' e.g. leaving the status of human evaluations and dataset recreations unclear.
The extreme precision of the definitions of the International Vocabulary of Metrology (VIM) (JCGM, 2012) (which the ACM definitions are supposed to be based on but aren't quite) offers a common terminological denominator. The VIM definitions of reproducibility and repeatability (no other R-terms are defined) are entirely general, made possible by two key differences compared to the NLP/ML definitions above. Firstly, in a key conceptual shift, reproducibility and repeatability are properties of measurements (not of systems or abstract findings). The important difference is that the concept of reproducibility now references a specified way of obtaining a measured quantity value (which can be an evaluation metric, statistical measure, human evaluation method, etc. in NLP). Secondly, reproducibility and repeatability are defined as the precision of a measurement under specified conditions, i.e. the distribution of the quantity values obtained in repeat (or replicate) measurements.
In VIM, repeatability is the precision of measurements of the same or similar object obtained under the same conditions, as captured by a specified set of repeatability conditions, whereas reproducibility is the precision of measurements of the same or similar object obtained under different conditions, as captured by a specified set of reproducibility conditions. See Appendix C for a full set of VIM definitions of the bold terms above.
To make the VIM terms more recognisable in an NLP context, we also call repeatability reproducibility under same conditions, and (VIM) reproducibility reproducibilty under varied conditions. Finally, we refer to experiments carrying out repeat measurements regardless of same/varied conditions as 'reproduction studies.' Categories of Reproducibility Research: Using the definitions above, the work we review in the remainder of the paper falls into three categories (corresponding to Sections 3-5): Reproduction under same conditions: As near as possible exact recreation or reuse of an existing system and evaluation set-up, and comparison of results. 7 Reproduction under varied conditions: Reproduction studies with deliberate variation of one or more aspects of system and/or measurement, and comparison of results.
Multi-test studies: Multiple reproduction studies connected e.g. because of an overall multi-test design, and/or use of same methodology.

Reproduction Under Same Conditions
Papers reporting reproductions under same conditions account for the bulk of NLP reproducibility research to date. The difficulty of achieving 'sameness of system' has taken up a lot of the discussion space. As stressed by many papers (Crane, 2018;Millour et al., 2020), recreation attempts have to have access to code, data, full details/assumptions of algorithms, parameter settings, software and dataset versions, initialisation details, random seeds, run-time environment, hardware specifications, etc.
A related and striking finding, confirmed by multiple repeatability studies, is that results often depend in surprising ways and to surprising degrees on seemingly small differences in model parameters and settings, such as rare-word thresholds, treatment of ties, or case normalisation (Fokkens et al., 2013;Van Erp and Van der Meij, 2013;Dakota and Kübler, 2017). Effects are often discovered during system recreation from incomplete information, when a range of values is tested for missing details. The concern is that the ease with which such NLP results are perturbed casts doubt on their generalisability and robustness.
The difficulties in recreating, or even just rerunning, systems with same results have led to growing numbers of reproducibility checklists (Olorisade et al., 2017;Pineau, 2020), and tips for making system recreation easier, e.g. the PyTorch (Paszke et al., 2017) recommended settings. 8 We analysed reproduction studies under same conditions from 34 pairs of papers, and identified 549 individual score pairs where reproduction object, method and outcome were clear enough to include in comparisons (for a small number of papers this meant excluding some scores). Table 1 in Appendix A provides a summary of the results. In 36 cases, the reproduction study did not produce scores, e.g. because resource limits were reached, but experiments are not run again. or code didn't work. This left 513 cases where the reproduction study produced a value that could be compared to the original score. Out of these, just 77 (14.03%) were exactly the same. Out of the remaining 436 score pairs, in 178 cases (40.8%), the reproduction score was better than the original, and in 258 cases (59.2%) it was worse.
We also examined the size of the differences between original and reproduction scores. For this purpose we computed percentage change (increase or decrease) for each score pair, and looked at the distribution of size and direction of change. For this analysis, we excluded score pairs where one or both scores were 0, as well as 4 very large outliers (all increases). Results are shown in the form of a histogram with bin width 1 (and clipped to range -20..20) in Figure 2. The plot clearly shows the imbalance between worse (60% of non-same scores) and better (40%) reproduction scores. The figure also shows that a large number of differences fall in the -1..1 range. However, the majority of differences, or 3/5, are greater than +/-1%, and about 1/4 are greater than +/-5%.

Reproduction Under Varied Conditions
Reproduction studies under varied conditions deliberately vary one or more aspects of system, data or evaluation in order to explore if similar results can be obtained. There are far fewer papers of this type (see Table 2 in Appendix B for an overview) than papers reporting reproduction studies under same conditions; however, note that we are not including papers here that use an existing method for a new language, dataset or domain, without controlling for other conditions being the same in experiments.
Horsmann and Zesch (2017) pick up strong results by Plank et al. (2016) showing LSTM tag-ging to outperform CRF and HMM taggers, and test whether they can be confirmed for datasets with finer-grained tag sets. Using 27 corpora (21 languages) with finer-grained tag sets, they systematically compare results for the 3 models, and show that LSTMs do perform better, and that their superiority grows in proportion to tag set size.
Htut et al. (2018a) recreate the grammar induction model PRPN (Shen et al., 2018), testing different versions with different data. PRPN is confirmed to be "strikingly effective" at latent tree learning. In a subsequent repeat study under same conditions, Htut et al. (2018b) test PRPN using the authors' own code, obtaining the same headline result. Millour et al. (2020) attempt to get the POS tagger for Alsatian from Magistry et al. (2018) to work with the same accuracy for a different dataset. Collaborating with, and using resources provided by, the original authors and recreating some where necessary, the best result obtained was 0.87 compared to the original 0.91. Abdellatif and Elgammal (2020) varied conditions of reproduction for classification results by Howard and Ruder (2018), and were able to confirm outcomes for three new non-English datasets, in all three respects (value, finding, conclusion) identified by Cohen et al. Vajjala and Rama (2018)'s automatic essay scoring classification system was tested on different datasets and/or languages in three studies (Arhiliuc et al., 2020;Caines and Buttery, 2020;Huber and Çöltekin, 2020) all of which found performance to drop on the new data.

Multi-test and Multi-lab Studies
Work in this category is multi-test, in the sense of involving multiple reproduction studies, in a uniform framework using uniform methods. Some of it is also multi-lab in that reproduction studies are carried out by more than one research team. For example, in one multi-test repeatability study, Wieling et al. (2018) randomly select five papers each from ACL'11 and ACL'16 for which code/data was available. In a uniform design, original authors were contacted for help if needed, a maximum time limit of 8h was imposed, and all work was done by the same Masters student. It's not clear how scores were selected (not all are attempted), and reasons for failure are not always clear even from linked material. Of the 120 score pairs obtained, 60 were the same, 12 reproduction scores were better, 22 were worse, and 26 runs failed (including exceeding the time limit). See Table 1 for summary.
Rodrigues et al. (2020) recreated six SemEval'18 systems from the Argument Reasoning Comprehension Task, following system descriptions and/or reusing code, with no time limit. Scores were the same for one system, and within +/-0.036 points for the other five; the SemEval ranking was exactly the same. Systems were also run on a corrected version of the shared-task data (which contained unwitting clues). This resulted in much lower scores casting doubt on the validity of the original shared task results. REPROLANG (Branco et al., 2020) is so far the only multi-lab (as well as multi-test) study of reproducibility in NLP. It was run as a selective shared task, and required participants to conform to uniform rules. 11 papers were selected for reproduction via an open call and direct selection. Participants had to 'reproduce the paper,' using information contained/linked in it. Participants submitted (a) a report on the reproduction, and (b) the software used to obtain the results as a Docker container (controlling variation from dependencies and run-time environments) on GitLab. Submissions were reviewed in great detail, submitted code was test-run and checked for hard-coding of results. 11 out of 18 submissions were judged to conform with requirements. One original paper (Vajjala and Rama, 2018) attracted four reproductions (Bestgen, 2020; Huber and Çöltekin, 2020; Caines and Buttery, 2020; Arhiliuc et al., 2020) in what must be a groundbreaking first in NLP. See Table 1 for summaries of all 11 reproductions. An aspect the organisers did not control was how to draw conclusions about reproducibility; most contributions provide binary conclusions but vary in how similar they require results to be for success. E.g. the four papers reproducing Vajjala and Rama (2018) all report similarly large deviations, but only one (Arhiliuc et al., 2020) concludes that the reproduction was not a success.

Conclusions
It seemed so simple: share all data, code and parameter settings, and other researchers will be able to obtain the same results. Yet the systems we create remain stubbornly resistant to this goal: a tiny 14.03% of the 513 original/reproduction score pairs we looked at were the same. At the same time, worryingly small differences in code have been found to result in big differences in performance.
Another striking finding is that reproduction under same conditions far more frequently yields results that are worse than results that are better: we found 258 out of 436 non-same reproduction results (59.2%) to be worse, echoing findings from psychology (Open Science Collaboration, 2015). Why this should be the case for reproduction under same conditions is unclear. It is probably to be expected for reproduction under different conditions, as larger parameter spaces, more datasets and languages etc., are tested subsequently, and the original work may have selected better results.
There is a lot of work going on in NLP on reproducibility right now; it is to be hoped that we can solve the vexing and scientifically uninteresting problem of how to rerun code and get the same results soon, and move on to addressing far more interesting questions of how reliable, stable and generalisable promising NLP results really are.  Table 1: Tabular overview of individual repeatability tests from 34 paper pairs, and a total of 549 score pairs. * = additional information obtained from hyperlinked material. ind Where scores obtained in a repeatability study (reproduction under same conditions) are worse than in the original work, this should not be interpreted as casting the original work in a negative light. This is because it is normally not possible to create the exact same conditions in repeatability studies (and lower scores can result from such differences), and because the outcome from multiple repeatability studies may be very different.
ind For a small number of papers, the score pairs included in this table are a subset of scores reported in the paper. More generally, the summary in the last column should not be interpreted as a summary of the whole paper and its findings. ind Our intention here is to summarise differences that have been reported in the literature, rather than draw conclusions about what may have caused the differences.
B  Table 2: Tabular overview of individual studies to confirm a previous research finding. * = additional information obtained from hyperlinked material; ** = reproduction study had minor differences, e.g. hyperparameter tuning was omitted (Abdellatif and Elgammal, 2020). ind The comments from the caption for Table 1 also apply here, but note that some differences between original and reproduction study are overt and intentional in the case of the papers in this table, whereas they are not intentional and often inadvertent in the case of the papers in Table 1. C Verbatim VIM and ACM Definitions 2.1 (2.1) measurement process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity 2.15 measurement precision (precision) closeness of agreement between indications or measured quantity values obtained by replicate measurements on the same or similar objects under specified conditions 2.20 (3.6, Notes 1 and 2) repeatability condition of measurement (repeatability condition) condition of measurement, out of a set of conditions that includes the same measurement procedure, same operators, same measuring system, same operating conditions and same location, and replicate measurements on the same or similar objects over a short period of time 2.21 (3.6) measurement repeatability (repeatability) measurement precision under a set of repeatability conditions of measurement 2.24 (3.7, Note 2) reproducibility condition of measurement (reproducibility condition) condition of measurement, out of a set of conditions that includes different locations, operators, measuring systems, and replicate measurements on the same or similar objects 2.25 (3.7) measurement reproducibility (reproducibility) measurement precision under reproducibility conditions of measurement 2.3 (2.6) measurand quantity intended to be measured Table 3: VIM definitions of repeatability and reproducibility (JCGM, 2012).
Repeatability (Same team, same experimental setup) The measurement can be obtained with stated precision by the same team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same location on multiple trials. For computational experiments, this means that a researcher can reliably repeat her own computation. Reproducibility (Different team, same experimental setup)* The measurement can be obtained with stated precision by a different team using the same measurement procedure, the same measuring system, under the same operating conditions, in the same or a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using the author's own artifacts. Replicability (Different team, different experimental setup)* The measurement can be obtained with stated precision by a different team, a different measuring system, in a different location on multiple trials. For computational experiments, this means that an independent group can obtain the same result using artifacts which they develop completely independently.
Results Validated: This badge is applied to papers in which the main results of the paper have been successfully obtained by a person or team other than the author. Two levels are distinguished: Results Reproduced v1.1 The main results of the paper have been obtained in a subsequent study by a person or team other than the authors, using, in part, artifacts provided by the author. Results Replicated v1.1 The main results of the paper have been independently obtained in a subsequent study by a person or team other than the authors, without the use of author-supplied artifacts. In each case, exact replication or reproduction of results is not required, or even expected. Instead, the results must be in agreement to within a tolerance deemed acceptable for experiments of the given type. In particular, differences in the results should not change the main claims made in the paper.