Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

Human evaluation is widely regarded as the lit-mus test of quality in NLP. A basic requirement of all evaluations, but in particular where used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried in NLP, let alone formally tested, and their repeatability and reproducibility of results is currently an open question. This paper reports our review of human evaluation experiments published in NLP papers over the past five years which we assessed in terms of (i) their ability to be rerun, and (ii) their re-sults being reproduced where they can be rerun. Overall, we estimate that just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitive barriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goes up to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibility of human evaluations where those are repeatable in the first place. Here we find worryingly low degrees of reproducibility, both in terms of similarity of scores and of the findings supported by them. We summarise what insights can be gleaned so far regarding how to make human evaluations in NLP more repeatable and more reproducible.


Introduction
Human evaluation is widely seen as the most reliable form of evaluation in NLP.The traditional view in the field, here expressed for MT, is that "automatic measures are an imperfect substitute for human assessment of translation quality" (Callison-Burch et al., 2008).Numerous papers have reported meta-evaluations of metrics in terms of correlation with human judgments (Belz and Reiter, 2006;Espinosa et al., 2010;Hashimoto et al., 2019;Clark et al., 2019;Sellam et al., 2020).However, recently several papers have highlighted issues aris-ing from lack of standardisation (Belz et al., 2020), incomplete details reported for evaluation design (Howcroft et al., 2020), and poor experimental standards (van der Lee et al., 2019).In this paper, we address issues that intersect with all of these.
Our starting premise is that in order to act as litmus test of quality, human evaluations need to be able to be relied upon to produce the same results, at least in the sense of supporting the same conclusions, when run multiple times.This ought to be a low-threshold requirement, but is in fact very rarely assessed at all, let alone routinely established for new evaluation methods.Inter-evaluator agreement is more commonly assessed, but falls far short of establishing whether an experiment when repeated produces similar results and/or supports similar findings.Our aim in the work reported here1 is establishing the reproducibility, or otherwise, of current human evaluation practices, in order to provide evidence-based indications regarding how they can be improved, thereby going beyond recent opinionbased recommendations regarding better practice.
This paper makes five main contributions: (i) an annotation scheme capturing experimental properties playing a role in repeatability and reproducibility (Section 2 and Table 1); (ii) an assessment of the repeatability of human evaluation experiments in NLP (Section 2); (iii) a state-of-the-field assessment of the reproducibility of human evaluations in NLP (Section 3); (iv) the dataset of paper details and annotations our analyses are based on;2 and (v) evidence-based recommendations regarding how to improve the repeatability and reproducibility of human evaluations in NLP (Section 4).
We use the terms repeatability and reproducibility as follows.Repeatable is a property of experiments meaning able to be repeated with identical experimental design.evaluations, meaning producing the same results and/or findings when run multiple times.

Repeatability of Human Evaluation Experiments
In this section, we describe (Section 2.1) our 4stage process for assessing human evaluations in terms of repeatability as a precondition for inclusion in a coordinated set of reproductions.As part of this process, we annotated papers and then experiments with evaluation properties, and we examine what these reveal.Because the final stage of this selection process introduced non-systematic selection (to meet the needs of the coordinated studies design), we also verify our findings on a separate, randomly selected subset of papers (Section 2.2).
For an overview of the selection/filtering process, see the flow diagram in Appendix A.

Identifying repeatable experiments
Selection procedure.To start, we extracted all papers containing the key phrases "human evaluation" and "participants" from TACL and the ACL main conference in the ACL Anthology (177).We included papers from 2018 to 2022 inclusive. 3We manually checked and excluded papers that did not report a new human evaluation of system outputs.
Paper-level properties.In the second stage we annotated seven paper-level properties including language(s), number of systems, dataset and participants (for details, see Belz et al., 2023).During the annotation process, we excluded papers that had prohibitive barriers to reproduction, which meant those that we estimated to have cost >USD 2,000)4 , and/or that had a longitudinal design, and/or that used highly specialised experts as evaluators such as doctors5 .This left 116 papers, of which 29 are from TACL and 87 from ACL.
Experiment-level properties.We then split each paper into the experiments it reports and started annotating each experiment with our fine-grained annotation scheme (Table 1; for additional details see Appendix B).At this point we estimated we had enough information to complete the annotations in the case of just 5% of our papers.We therefore started contacting authors to obtain the missing information.Following the prolonged contacting process (for details see Belz et al., 2023, and Appendix C), we obtained the requested information for just 20 papers (containing 28 experiments).Using both the publicly available and author-provided information, we were able to collate property values to the extent shown in Table 3: 20 experiments had no unclear properties, and 8 had one or more.
That we were able to find clear properties for 20 of the 28 experiments in Table 3 does not indicate that these experiments could definitely be recreated, just that we have the minimal level of information required to attempt recreation.That we can only clear this first hurdle for 17% of the 115 6 papers we started with is alarming.
Bugs, errors and flaws.Moreover, in the process of collating and checking experiment details, we found several types of issues that in some cases called into question whether they should be repeated at all, for ethical and/or scientific reasons (for details see Belz et al., 2023).

Verification on random subset of papers
In order to verify the above finding that only 5% of papers are repeatable from publicly available information, we sampled a new batch of papers from an expanded set of 631 ACL, TACL and EMNLP papers that matched the keyword search, and did not fail any of our inclusion tests as above.
We annotated the 26 experiments reported in these 20 randomly sampled papers using the same procedure as in Section 2.1, except that we only used information that was publicly available either from the paper, supplementary material, or hyperlinks in the paper, e.g., a GitHub repository.In particular we tried to find the system outputs that were shown to participants, and the interface, form, or document that participants completed.
We found the above information for just 5% of papers, confirming our estimate from Section 2.1.Three papers made either just the interface or just the system outputs available.Table 2 shows the number of experiments out of all 26 where a given property was clear, for all properties in our annotation scheme.It is clear from the numbers in the table that very basic information such as number and type of participants is very often not findable.

Reproducibility of Results from Human Evaluations
To complement the assessment of the repeatability of human evaluations in NLP above, here we look at the reproducibility of results, as collated from recent reproduction studies.We examine similarity 6 116 minus one paper we excluded after receiving a response from the author.
in system-level scores between original and reproduction studies (Section 3.1), and assess whether scores support the same conclusions which can be the case even for dissimilar scores (Section 3.2).

Similarity of scores
Table 5 provides an overview of reproducibility results from reproduction studies of human system quality evaluations performed as part of the RE-PROLANG (Branco et al., 2020), ReproGen 2021 (Belz et al., 2021a), and ReproGen 2022 (Belz et al., 2021b) shared tasks.We exclude evaluations based on text annotation where a single overall aggregated score per system was not computed.
Column 1 identifies the original and reproduction study and the evaluation criteria assessed.The last two columns show the corresponding mean study-level and mean criterion-level coefficients of variation (CV * ) (Belz et al., 2022), and rank preservation, respectively.The columns in between show seven properties of each study/criterion, as per the HEDS datasheet (Shimorina and Belz, 2022); column headings identify HEDS question number (see table caption for explanation).

Confirmation of conclusions
Another perspective on reproducibility is whether the same conclusions can be drawn from two evaluations.

Discussion
When corresponding with authors to find missing information (Section 2.1), and when trying to find information from publicly available sources (Section 2.2), properties were often not obtainable for similar reasons.High-level properties such as the dataset, task, and language, could usually be found in the paper.The total number of items was usually available, but the relationship between participants and items was not.Information regarding recruitment of participants, as well as what they saw and did during the experiment almost always required additional information from the author.If authors were to make the files they used for the experiment, and a record of how these were processed (including the way they were presented to participants), then it would go a long way towards making the recreation of more experiments possible.
Repeatability, in the sense of being able to be repeated, is a basic requirement of all scientific experiments, perhaps most importantly as a prerequisite to independent verification through reproduction: "An experimental result is not fully established unless it can be independently reproduced" (ACM, 2020).It is therefore of concern in and of itself that the large majority (95%) of human evaluations in NLP is not repeatable from publicly available information (Section 2.2).This is further compounded by our finding (Section 2.1) that even with considerable effort (up to three emails to first and if necessary other authors) to obtain missing information to enable repetition, 80% of experiments remain non-repeatable.
Finally, where we were able to obtain and review all information needed for a repetition, we found multiple reporting mistakes, errors in scripts, and ad-hoc manual interference in live experiments that call into question for scientific and/or ethical reasons whether experiments should be repeated.
Our analysis of reproduction results (Table 5) showed that for the simplest binary output categorisation task, a good degree of reproducibility could be achieved (CV* = 6.11), but for most of the other, more cognitively complex, evaluations, degree of reproducibility was poor.Most significantly, the same set of conclusions could not be drawn regarding ranks of systems evaluated in any of the reproductions at the experiment level.
We would argue that we urgently need to (i) improve the repeatability of human evaluation experiments by making available publicly, as standard, full information about how the experiment was conducted, in sufficient detail to enable others to re-run it; (ii) test the results reproducibility of new evaluation methods prior to running full evaluation experiments with them; (iii) standardise evaluation methods, especially measurand (evaluation criterion) and measurement procedure, so that the reproducibility of each, once established, does not have to be tested every time.The worrying levels of errors and flaws in reporting and design we found can be in part addressed through standardisation and establishing reproducibility for standardised methods, but will also require a shift in expectations and awareness of how to conduct good quality human evaluations for NLP.

Conclusion
NLP needs human evaluation as a litmus test of quality, including as a reliable reference for metaevaluating other types of evaluation.In order to play this role, human evaluation needs to be verifi-ably reliable, and that includes being reproducible; in order to assess the reproducibility of results, we need to be able to repeat an experiment.However, our results showed that current human evaluations have very poor repeatability (we estimated that just 5% do not have prohibitive barriers to being repeated, and can be re-run without recourse to non-public information), and where we are able to repeat human evaluations, the growing number of results from human evaluation reproduction studies show that they have low degrees of reproducibility of both scores and conclusions.We derived recommendations for making human evaluations in NLP more repeatable and more reproducible, something that we surely need to do if we are to continue treating them as our most trusted assessment of system quality.

A Flow Diagram of Paper Selection/Filtering Process
The following diagram shows the steps in paper selection/filtering process (reproduced from Belz et al. (2023), for ease of reference):

B Details of evaluation experiment properties
All of the property names and values from our detailed annotations are listed below, along with descriptions of what was recorded for each property: 1. Specific data sets used; In the case of a relative evaluation, it refers to the set of outputs, e.g., a pair, that is being compared.
9. How many participants evaluated each item; for some experiments, this varied.
10. How many items were evaluated by each participant; for some experiments, this varied.In particular, for the 13 of 28 experiments that were crowd-sourced, 5 were known integers, 4 varied, and 4 could not be determined (we suspect these also varied).
11. Were training and/or practice sessions provided for participants; see the discussion below.
12. Were participants given instructions?Were they given definitions of evaluation criteria; see the discussion below.
13. Were participants required to have a specific expertise?If so, what type, and was this selfreported or externally assessed?; see the discussion below.
14. Were participants required to be native speakers?If so, was this self-reported or externally assessed?;For the first part we used the options yes, no, crowd-source region filters, and in one case that the experiment was performed with students at a university where the language was native.The latter two are inherently self-reported, although with some limited control by the researchers.Only for one of the experiments with native speakers did the researchers indicate that they had confirmed this, all others were self-reports.
15. How complex was the evaluation task (low, medium, high); assessment by authors of this paper.
16. How complex was the interface (low, medium, high); assessment by authors of this paper.
Classifying the type of participant, training, instruction, and expertise was very difficult.Firstly, not all experiments necessarily require detailed instructions but setting a threshold beyond which instructions become non-perfunctory is difficult.The same is true for training.In the end, we decided to record whether there non-perfunctory training, instruction, practice, or criterion definition.Expertise was also difficult to classify.Some papers would have originally reported 'expert annotators', but following our queries stated participants were graduate students or colleagues.Such participants were often called 'NLP experts'.In the end, we considered participants to be expert if the authors of the original study indicated that they were.

C Process for contacting authors
When we contacted authors of papers we followed a standard procedure.We considered the corre-sponding author to be the first author of the paper, unless a different corresponding author was explicitly stated.First they were sent the following email: The ReproHum project at the University of Aberdeen is running a multi-lab study where over 20 partner labs from across the world will be reproducing human evaluation experiments from NLP papers.The project is being led by Prof. Anya Belz, with Prof. Ehud Reiter as co-investigator, and myself as a research assistant.
To create a shortlist of papers to reproduce, we looked for papers containing human evaluations, at high-profile conferences such as «VENUE».We identified your paper "«TITLE»" from «VENUE» «YEAR» as a candidate for inclusion in our study.If included, the human evaluation that was performed for the paper would initially be reproduced by 2 different labs.One of our main objectives is to identify types of human evaluation that are associated with higher degrees of reproducibility so that the NLP community can then use this information to select the most appropriate methods for their studies.
We are writing to you today to ask if you can provide us with more information about your experiment to enable us to reproduce it under conditions that are as close to the original as possible.We are particularly hoping that you can provide the system outputs and questions that were shown to participants.
We would be most grateful if you could initially confirm that you are able to send us (links to) the below information (for each human evaluation that is reported in the paper): 1.The system outputs that were shown to participants.2. The interface, form, or document that participants completed; the exact document or form that was used would be ideal.3. Details on the number and type of participants (students, researchers, Mechanical Turk, etc.) that took part in the study.4. The total cost of the original study.
If you are able to provide the above information, we would be grateful if you could also confirm how soon this would be possible.If there was no response the above email, they were sent a second email with only minor adjustments to reflect that we had tried to contact them previously.A third email was sent in cases where we still had no response.At least one week passed between each email sent to an author.The first two emails were sent from the academic email account of a research assistant, although addressed from the whole project team.The third email was sent by a professor, and whilst this did elicit a small number of responses, most came from the first two emails.In the event that email addresses were no longer valid, we searched for a more recent email for the author, primarily by checking their most recent papers.In the event that we could not find any email address for an author, we attempted to contact the next author in the same way.We were able to find a working email address for one author from all bar one paper.Most were sent using a mail merge, although some were aggregated and sent manually, in cases where one author had many papers.
If you have any questions, please contact us.With best regards, Anya, Ehud, and Craig Project web page: https://reprohum.github.io Reproducible is a property of Is any reported native speaking self-reported (with no interaction with researchers, i.e. on MTurk or a double-blind study)?

Table 1 :
Properties in experiment annotation scheme.

Table 2 :
Table 4 assesses the (dis)similarity of ranks between the pairs of original and reproduction ex- Number of experiments out of 26 for which a given property was clear (random sample of 20 papers using publicly available information only).

Table 3 :
Number of experiments out of 28 for which a given property was clear (non-random set of 20 papers where authors provided missing information).

Table 4 :
Spearman's ρ as an indication of how closely matched system ranks are between original and reproduction studies (Pearson's r for reference).

Table 5 :
Overview of reproducibility results from existing reproduction studies in terms of (mean) CV* and rank preservation (last two columns).Evaluations are characterised in terms of some properties from HEDS datasheets: 3.1.1= number of items assessed per system; 3.2.1 = number of evaluators in original/reproduction experiment; 4.3.4= List/range of possible responses; 4.3.8= Form of response elicitation (DQE: direct quality estimation, RQE: relative quality estimation, Anno: evaluation through annotation); 4.1.1= Correctness/Goodness/Features; 4.1.2= Form/Content/Both; 4.1.3= each output assessed in its own right (iiOR) / relative to inputs (RtI) / relative to external reference (EFoR); scores/item = number of evaluators who evaluate each evaluation item.
Thibault Sellam, Dipanjan Das, and Ankur Parikh.2020.BLEURT: Learning robust metrics for text generation.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881-7892, Online.Association for Computational Linguistics.