Exploring Variation of Results from Different Experimental Conditions

,


Introduction
Recently there has been a promising surge of interest in reproducibility of NLP models, supported by challenges (Pineau et al., 2021), shared tasks (Belz et al., 2020), conference tracks (Carpuat et al., 2022), and even the Reality Check theme at this conference.The outcome of this surge in interest has been a flurry of reproducibility studies and related investigations (Belz et al., 2022a;Arvan et al., 2022a;Chen et al., 2022b).However, the collective findings from these efforts have been alarming.
With interest in reproducibility growing, the evidence is mounting that scores are substantially affected by changes not only to arbitrary factors like random seed and different data splits, but also by incidental factors such as the type of GPU on which an experiment is run, and the run-time environment.In many cases, near-identical scores can be guaranteed only when an experiment is re-run in fully containerised form.In effect, this means that even perfect sharing of information (once regarded as the answer to all our reproducibility problems1 (Sonnenburg et al., 2007)) cannot guarantee identical results in all cases.
All this raises questions about reporting, experimental design and the informativeness of scores regarding the relative merits of different methods.Underlying these is the question of where the boundary lies -seemingly between the two extremes.On the one hand, exploration of methodological variations and reporting of separate scores is part and parcel of method development.On the other hand, arbitrary and incidental factors such as random seed are not part of method development, because they do not generalise to future applications of the same method.For the former, clearly, comparing and reporting different scores is important; for the latter, how to interpret, address or report variation in scores is an open question.
In this paper, we tackle this question by conducting a systematic and comprehensive investigation coordinated across two NLP groups to study the variation of the results across three neural text simplification (NTS) models under many different experimental conditions.We experiment with different random seeds, run-time environments, and dependency versions to ensure broad coverage of our study.We observe that reporting average score and its coefficient of variation is a more reliable standard than reporting the maximum value, and we urge researchers to record all methodological conditions, control incidental ones, and abstract away arbitrary factors to promote the reproducibility of their scientific contributions.

Task and Experimental Set-up
Our starting point for this exploration is the first neural text simplification system reported by Nisioi et al. (2017).This work was selected because it is suitable for our purposes: the authors provided a repository2 which contains comprehensive information about the original work and the resources, thus facilitating repeat runs of their experiments and exploration of variation on their experimental conditions, which is not often the case for NLP papers.Moreover, the work has been reproduced before (Cooper and Shardlow, 2020;Popović and Belz, 2021;Popović et al., 2022;Belz et al., 2022a;Arvan et al., 2022b) as part of the REPROLANG 2020 (Branco et al., 2020) and ReproGen 2021/2022 (Belz et al., 2021(Belz et al., , 2022b) ) shared tasks, which represents another reference point to choose it.
In the following subsections we describe the four different systems ( §2.2), the single data set/split and four text processing variants ( §2.3), and the two evaluation methods ( §2.4) which were included in our exploration, either because they were part of the original study or because we added them.§2.5 provides an overview of the incidental and arbitrary variation arising in our different runs which we also analysed.

Task Background
Briefly, text simplification aims to transform a specified text into a simpler form while retaining the same meaning.This is potentially useful for a broad range of real-world applications, because it makes the text readable and understandable for wider audiences and also easier to process by automatic NLP tools.The notion of simplicity itself may be tied to a variety of factors ranging from lexical complexity to content coverage or sentence/document structure.Automatic text simiplification (ATS) can be rule-based or data-based.Many data-based techniques approach the task of simplifying text by adopting methods from machine translation (MT), which is also the case for our experiments.Our work does not seek to develop innovations in ATS specifically, but rather to use ATS models as a convenient case study for studying variation of results.Nonetheless, we provide this background to facilitate fuller understanding of the problem scope and goals of the reproduced systems.

Systems
Nisioi et al. ( 2017)'s original work is one of the first which explored neural networks for ATS (neural ATS, or NTS).They used Long Short-Term Memory (LSTM) recurrent neural networks with attention in an encoder-decoder architecture.Two models were trained: one standard neural MT model (which we call LSTM), and one (LSTM-w2v) using external pre-trained word2vec word representations (Mikolov et al., 2013).All their experiments were carried out using the openNMT tool3 (Klein et al., 2017).The used version is the initial version based on LuaTorch,4 released in December 2016.
The authors provided information about all necessary external libraries and specific Python and Lua dependencies, and also released the two models they trained (LSMT and LSTM-w2v).It is worth noting that the source code uses Python 2.7 and Torch.The Python environment uses older versions of openNMT, NLTK, and gensim.This version of openNMT is no longer maintained and most of the libraries and dependencies have become obsolete, and it is therefore advised not to use this version anymore but to switch to one of the two newer ones (openNMT-py based on PyTorch or openNMT-tf based on TensorFlow).Therefore, it has become extremely challenging to recreate the same environment to regenerate and retrain the models using the released source code.
Other than variation in the libraries and environments, we conduct a random search for the LSTM models using the original repository.In this scenario, all the hyper-parameters are kept the same except the random seed.Knowing that the random seed affects the weight initialisation, the data order used in training, and the sampling used in the generation, we suspected that we might observe a wide range of results.
Given that LSTM models generally have been superseded by transformer models (Vaswani et al., 2017), we additionally trained a transformer model on the data provided by the authors, using another publicly available tool, Sockeye. 5 We used two versions of the tool: the first version, based on MXNet (Hieber et al., 2018), and the newest (third) version based on PyTorch (Hieber et al., 2022).We treat these two versions as two different systems using the same model type.Thus to summarise, our systems are: We report results achieved under numerous conditions for each of these systems, ensuring broad coverage and supporting the robustness of the investigation.

Data Set and Text Processing
Nisioi et al.'s (2017) repository contains the preprocessed data set, but not the original data nor the pre-processing scripts.Their data set was a popular corpus of parallel English Wikipedia and Simple English Wikipedia (EW-SEW) articles (Hwang et al., 2015), and we used the same data for our experiments.The corpus statistics for the parallel data in both the training and tests sets are presented in Table 1.We report the number of sentences and words and the overall vocabulary size for each partition (original/simplified × train/test) of the data.
In the original paper, it is reported that Named Entities were treated separately: they were first identified, then replaced by an 'unknown' symbol for the training, and for generating output, each 'unknown' symbol was replaced by the word with the highest probability score from the attention layer.However, no scripts or guidelines were provided for it.Also, it was not mentioned that the words were segmented into sub-word units, which is nowadays the standard for all state-of-the-art neural systems. of rare and unseen words.The standard word segmentation method for the Sockeye tool is byte-pair encoding (BPE) (Sennrich et al., 2016), which is one of the most widely used segmentation methods.
According to the Sockeye guidelines, segmentation is performed after the original text is tokenised.In our experiments, we explored both original and additionally tokenised data, both with BPE word segmentation.
After generating outputs with our transformer models, sub-word units are joined together to form original words.This is usually followed by a detokenisation step.However, since the outputs of the original models are all tokenised, we evaluated both versions: tokenised and detokenised.Finally, due to lack of special treatment of named entities, the transformer outputs contain a number of 'unknown' symbols, referring to unseen sub-word units.We computed metric scores for two versions of the output: with 'unknown' symbols left in place, and with 'unknown' symbols removed.

Evaluation
We performed automatic evaluation of generated outputs using the script provided by the authors which calculates two metrics: BLEU (Papineni et al., 2002) and SARI (Xu et al., 2016).Previous work also explored differences arising from different BLEU implementations (Popović and Belz, 2021), but these are not relevant to present purposes.BLEU is based on matching between the generated text and a manually simplified reference text, while SARI compares the generated text both to the reference text as well as to the original text.variation.The conditions are grouped into three categories: (i) methodological factors, i.e., variation in the methods used in a solution for a task with the aim of improving performance, where better performance can to some degree be expected to generalise to similar types of tasks; (ii) arbitrary factors where an arbitrary (often random) selection is made with respect to a given parameter; and (iii) incidental factors, where selection is not under the direct control of the system creators, e.g., changes from one version of a dependency to another.All of these conditions may be reasonably expected to vary during replication experiments.

Methodological, Arbitrary and Incidental Variations
Methodological factors may occur when the group replicating a given model decides to update some component of its design based on recent findings.An example in our own work reported here is the inclusion of the transformer-based model, based on the recent success of these models for a wide range of NLP tasks in the time since Nisioi et al. ( 2017)'s publication.
Arbitrary factors may occur due to underreporting of necessary parameters in the original work.For instance, if a hyper-parameter must be specified in order for the model to run but no specifications are provided by the model creators, the group replicating the work may select that hyperparameter randomly or using their own heuristic.
Incidental factors may occur due to library or package updates, rendering the versions reported in the original publication obsolete.It also may occur in different run-time environments, for example running experiments on different computers.
By including each of these factors in our study, we sought to ensure broad coverage of the range of results variation that may realistically occur when attempting to replicate a previously reported model.

Results
We report the results from both team A and team B, for each of the studied conditions.While both teams struggled to get the original repository to a working state, team A failed to install all the required dependencies as many are deprecated.Team B reported similar concerns about reproducing and reusing the original source code; however, ultimately, they managed to get the repository to a running state.
Table 3 shows the two automatic scores generated by the evaluation script provided by the authors for all explored variations (see Table 2), grouped together by system: LSTM, LSTM-w2v, Transformer Sockeye v1 and Transformer Sockeye v3.Where they exist, results provided by the authors of the original paper are included as well.For random seed search, we included two worst and best-performing models in this table, while full results of this search can be found in Appendix.
Averaged scores for each of the three models together with the standard deviations and coefficients of variation (Belz et al., 2022a) are presented in Table 4.For each of the models, 'all' refers to the average value of all scores for this model presented in Table 3.For the LSTM model, 'random seed' is averaged only over the random seed scores, and 'other' is averaged over all scores except the random seed scores.For the transformer model, 'v1' means only the scores from version 1, and 'v3' means only the scores from version 3. According to the averaged SARI score, the transformer model performs best; however, the newest version performs worse than the old one.According to the averaged BLEU score, 6 LSTM-v2w and Transformer have very similar performance, but the newest version of the transformer is the best of all while the first version is the worst.
We used the R package cvequality (Version 0.2.0; (Marwick and Krishnamoorthy, 2019)) to test for significant differences of coefficients of variation (CV).This package implements two of the most widely used statistical significance tests, proposed by Feltz and Miller (1996) and Krishnamoorthy and Lee (2014).The null hypothesis for each of the two automatic metrics is that there is no difference in CV between the three models.
We use the results reported in the Table 4 corresponding to the row 'all' for the three model 6 The reason for slightly different scores on original outputs is yet another source of variation which we did not explore here, namely incidental variations of BLEU scores related to dependencies and run-time environment.
variants.Conducting the two tests resulted in the statistical significance values shown in Table 5.We observe that neither test statistics nor p-value suggest statistical significance when setting α = 0.05.Therefore, we cannot reject the null hypothesis.

Discussion
Nisioi et al. ( 2017) reported that using pre-trained word embeddings improves the model's performance.Results in Table 3 and Table 4 suggest that while this may be true, the differences are too small to draw clear conclusions.For one model alone, the LSTM variant, we have observed BLEU scores ranging from 84.47 to 89.59; the average, on the other hand, is 87.90 with the CV of 1.36.Compared to LSTMs, transformer models have a higher variance in their performance.This can be attributed to the transformer's complexity and the fact that they are harder to train.Also, variations in tokenisation were included only in the transformer models.The performance difference between the best and worst transformer models is even higher  4), we can observe that the transformer model trained using v3 of the Sockeye tool outperforms the rest of the models.This model achieves an average BLEU of 91.24 with a CV of 1.91.To put the CV into context, this value is higher than three other LSTM variants but lower than the rest of the transformer models.As it can be expected, using an averaged performance metric and CV enables a better comparison between models in different conditions.
Besides the mentioned analysis, we found it hard to provide distinct and unique observations from the results.This is likely due to the fact that the results are not conclusive and the variance is high.We do not believe this is a flaw in our experimental design but rather a good representation of the complexities of comparing different models across varying conditions.The number of experiments conducted in this study is more than 60, a number that exceeds the number of experiments conducted in most other studies by a large margin.
One of the concerning issues we encountered is the issue of software deprecation.While this is not a new problem, and it is as old as software itself, it is becoming more and more prevalent.This is due to extreme reliance on empirical results and the complexity of publications that utilise neural networks.Often source codes use several external libraries and dependencies, any of which may become deprecated at any time.Increased availability of source code and the abundance of tools are signs of a healthy research community.Seeing new tools and libraries developed and improved daily is encouraging.At the same time, we believe researchers should practice caution when introducing new tools and libraries into their experiments, as doing so may shorten the usability of their source code.

Addressing Experimental Variation in Experimental Design
Many factors can affect the results of an experiment.Some of these factors are under the experimenter's control, and some are not.Before we address these variations, we highlight that scientific experiments are developed as a counterpart to abstraction of real-world problems.Data sets are created with this in mind, consisting of training, validation, and test sets of which the latter, in particular, is created to represent unseen real-world data.
Research on improving the generalisation of machine learning algorithms is another good example of leveraging scientific experiments to understand real-world challenges.
We can use another analogy to explore these variations further.Bogosort is a sorting algorithm that generates random permutations of the input until the input is sorted.While in the best case, it may take O(n) steps to sort the input, its worst-case performance is unbounded, making it impractical to use.Theoretically, it is possible to find the random seed that achieves best-case performance for a specific input; nonetheless, the slightest change in hardware, environment, or even the input itself will render this seed useless.Although neural networks are far more complicated than a simple sorting algorithm, the basis of reliance on the evidence is the same.Similar to Bogosort, recording all the random numbers used in an experiment is possible (Chen et al., 2022a), but the question is: should we?We do not think so.Instead of optimising the random seed or other arbitrary factors, researchers should focus on the methods that minimize the impact of these variables.Ultimately, we believe the correct approach for conducting scientific experiments is to thoroughly report methodological variations, control incidental variations, and abstract away arbitrary variations.

Conclusions
In this work, we conducted a series of experiments for a single task using the same data under different experimental conditions.We categorized these conditions into three different categories: methodological, arbitrary, and incidental.We report the results of our experiments to demonstrate the wide results variation that can occur due to these factors.
We propose that researchers should record all methodological conditions, control incidental ones, and abstract away arbitrary factors.Lastly, we observed that using average score and its coefficient of variation (CV) instead of the maximum value provides far more reliable results.We recommend that researchers adopt this practice when documenting the findings from their own studies.
We are aware that this is easier said than done.We are, however, optimistic that the field can move closer to this ideal over time.In the meantime, it is our hope that this recommendation highlights the contrast between what is currently a common practice (unfortunately, inadequate recording and reporting that do not address necessary factors for reproducibility) and what is needed to support successful, reproducible research in our field.

Limitations
Our work is limited by several factors.First, our findings are supported only by experiments on a single NLP task (neural text simplification).We selected this task because it offered an intriguing sandbox for studying varying experimental conditions, ranging from differences in random seeds to modifications in compile-time and run-time environments and dependency versions.Comparing the multifaceted outcomes arising from these experiments facilitated greater quantified estimations of the degree of reproducibility for the selected NTS systems.However, the dimensions of variation that we explored in this work are common to many NLP tasks; none are unique only to text simplification.Because of this, we believe that our findings would generalise broadly across NLP tasks.
We used a single data set, the same as in the original paper by Nisioi et al. (2017), to foster controlled study of our other experimental variables.The data set comprises aligned sentences between English Wikipedia and Simple English Wikipedia.Thus, it is unclear whether our findings would be similar if the study was conducted using data from other languages, including those with richer morphology such as Czech or Arabic.
Finally, although we conducted a robust set of experiments for the selected models across two research groups, our experiments are limited to a small set of NTS models due to the extensive set of conditions tested for each model.Although these models vary in their architecture, we do not know if other NTS models may be more or less stable across experimental conditions.Taken together, the limitations accompanying our findings suggest compelling avenues for future research.

Ethics Statement
This research was guided by a broad range of ethical considerations, taking into account factors associated with environmental impact, equitable access, and reproducibility.We summarize those that we consider most critical in this section.It is our hope that by building a holistic understanding of these factors, we develop improved perspective of the challenges associated with reproducibility studies and the positive broader impacts that improved reproducibility standards may promote.
Environmental Impact.In this work, we seek to study the complex and murky relationship between experimental conditions and experimental outcomes.To address research questions surrounding this relationship, we conduct many experimental runs to replicate the same models across an extensive set of variable conditions.Although necessary for justifying our claims, a downside of this process is that it may produce environmental harm.One might argue that the advantages of assurance that the 'true' evaluation score is found do not outweigh the disadvantages of repeatedly running models that are known to produce large carbon footprints (Strubell et al., 2019).We attenuate this risk by controlling for as many variables allowable (e.g., data set and architectural variations) while still fostering robust study of our core question, to minimize the number of experimental runs required.
Equitable Access.A concern closely related to environmental impact is that of equitable access to this line of research.By studying a problem that requires many repeated experimental runs with subtle variations, we may exclude disadvantaged researchers from performing meaningful follow-up studies, since they may not have the requisite resource bandwidth (Bommasani et al., 2021, §5.6).However, although reproducibility studies themselves may pose a barrier to entry for researchers with limited access to compute hardware, the innovations resulting from these studies (e.g., improved community standards for reproducibility of reported results) may stand to greatly benefit marginalised researchers, by minimising the potential for bottlenecks in attempting to perform impossible and costly replications to establish performance baselines.
Reproducibility.To ensure reproducibility of our own work, we report all experimental parameters, computational budget, and computing infrastructure used.We discuss our experimental setups in depth, as they are the primary focus of this study.We report descriptive statistics about our results to enhance transparency of our findings, and we report all implementation settings (e.g., package version number) needed to successfully replicate our work.Although reproducibility studies are not specified as an intended use of the referenced systems (Nisioi et al., 2017), this use is compatible with the original access conditions and the authors have consented to the paper's use in numerous reproducibility studies since its publication (Belz et al., 2022b).C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.

D
Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Table 1 :
Word segmentation enables better coverage of large vocabularies and treatment Data set statistics showing the number of sentence pairs in the training and test set, and the number of words and vocabulary size for the non-simplified and simplified versions of sentences separately.

Table 2
provides an overview of the experimental conditions (first column) for which we explored

Table 2 :
Summary of different experimental conditions explored in our runs (see in text for explanation of three broad categories).

Table 3 :
BLEU and SARI scores for different experimental variations.† is the best-performing model in the random seed search, ‡ is the worst performing model in the random seed search.

Table 4 :
Average SARI and BLEU scores, standard deviations and coefficients of variation (CV) for the three models.

Table 5 :
Test statistics / p-value are reported for differences between coefficients of variation (CVs) of the three models reported in Table4, for both BLEU and SARI.
than LSTM variants.With a 13.42 BLEU score difference, assessing true performance of the model is a challenging task.Judging the results by the average BLEU score (Table