On the Limitations of Simulating Active Learning

Active learning (AL) is a human-and-model-in-the-loop paradigm that iteratively selects informative unlabeled data for human annotation, aiming to improve over random sampling. However, performing AL experiments with human annotations on-the-fly is a laborious and expensive process, thus unrealistic for academic research. An easy fix to this impediment is to simulate AL, by treating an already labeled and publicly available dataset as the pool of unlabeled data. In this position paper, we first survey recent literature and highlight the challenges across all different steps within the AL loop. We further unveil neglected caveats in the experimental setup that can significantly affect the quality of AL research. We continue with an exploration of how the simulation setting can govern empirical findings, arguing that it might be one of the answers behind the ever posed question ``why do active learning algorithms sometimes fail to outperform random sampling?''. We argue that evaluating AL algorithms on available labeled datasets might provide a lower bound as to their effectiveness in real data. We believe it is essential to collectively shape the best practices for AL research, particularly as engineering advancements in LLMs push the research focus towards data-driven approaches (e.g., data efficiency, alignment, fairness). In light of this, we have developed guidelines for future work. Our aim is to draw attention to these limitations within the community, in the hope of finding ways to address them.


Introduction
Based on the assumption that "not all data is equal", active learning (AL) (Cohn et al., 1996;Settles, 2009) aims to identify the most informative data for annotation from a pool (or a stream) of unlabeled data (i.e., data acquisition).With multiple rounds of model training, data acquisition and human annotation (Figure 1), the goal is to achieve data efficiency.A data efficient AL algorithm entails that a model achieves satisfactory performance on a held-out test set, by being trained with only a fraction of the acquired data.
The usual pool-based AL setting is to acquire data from an unlabeled pool, label it, and use it to train a supervised model that, hopefully, obtains satisfactory performance on a test set for the task at hand.This is very similar to the general modelin-the-loop paradigm (Karmakharm et al., 2019;Bartolo et al., 2020Bartolo et al., , 2022;;Kiela et al., 2021;Wallace et al., 2022), with the main difference being the AL-based data acquisition stage.The assumption is that, by iteratively selecting data for annotation according to an informativeness criterion, it will result into better model predictive performance compared to randomly sampling and annotate data of the same size.
However, this does not always seem to be the case.A body of work has shown that AL algorithms, that make use of uncertainty (Lewis and Gale, 1994;Cohn et al., 1996;Houlsby et al., 2011;Gal et al., 2017), diversity sampling (Brinker, 2003;Bodó et al., 2011;Sener and Savarese, 2018) or even more complex acquisition strategies (Ducoffe and Precioso, 2018;Ash et al., 2020;Yuan et al., 2020;Margatina et al., 2021), often fail to improve over a simple random sampling baseline (Baldridge and Palmer, 2009;Ducoffe and Precioso, 2018;Lowell et al., 2019;Kees et al., 2021;Karamcheti et al., 2021;Snijders et al., 2023).Such findings pose a serious question on the practical usefulness of AL, as they do not corroborate its initial core hypothesis that not all data is equally useful for training a model.In other words, if we cannot show that one subset of the data is "better"1 than another, why do AL in the first place?
Only a small body of work has attempted to explore the pain points of AL.For instance, Karamcheti et al. (2021), leveraging visualisations from data maps (Swayamdipta et al., 2020), show that AL algorithms tend to acquire collective outliers (i.e.groups of examples that deviate from the rest of the examples but cluster together), thus explaining the utter failure of eight AL algorithms to outperform random sampling in visual question answering.Building on this work, more recently Snijders et al. ( 2023) corroborate these findings for the task of natural language inference and further show that uncertainty based AL methods recover and even surpass random selection when hard-to-learn data points are removed from the pool.Lowell et al. (2019) show that the benefits of AL with cer-tain models and domains do not generalize reliably across models and tasks.This could be problematic since, in practice, one might not have the means to explore and compare alternative AL strategies.They also show that an actively acquired dataset using a certain model-in-the-loop, may be disadvantageous for training models of a different family, raising the issue of whether the downsides inherent to AL are worth the modest and inconsistent performance gains it tends to afford.
In this paper, we aim to explore all possible limitations that researchers and practitioners currently face when doing research on AL (Zhang et al., 2022d).We first describe the process of pool-based AL (Figure 1) and identify challenges in every step of the iterative process ( §2).Next, we unearth obscure details that are often left unstated and under-explored ( §3).We then delve into a more philosophical discussion of the role of simulation and its connection to real practical applications ( §4).Finally, we provide guidelines for future work ( §5) and conclusions ( §6), aspiring to promote neglected, but valuable, ideas to improve the direction of research in active learning.

Challenges in the Active Learning Loop
We first introduce the typical steps in the poolbased AL setting (Lewis and Gale, 1994) and identify several challenges that an AL practitioner has to deal with, across all steps (Figure 2).2

Problem Definition
Consider the experimental scenario where we want to model a specific NLP task for which we do not yet have any labeled data, but we have access to a large pool of unlabeled data D pool .We assume that it is unrealistic (e.g., laborious, expensive) to have humans annotating all of it.D pool constitutes the textual corpus from which we want to sample a fraction of the most useful (e.g., informative, representative) data points for human annotation.In order to perform active learning, we need an initial labeled dataset D lab , often called "seed" dataset, to be used for training a task-specific model with supervised learning.To evaluate the model, we need a usually small validation set for model selection D val and a held out test set D test to evaluate the model's generalization.We use D lab and D val to train the first model and then test it on D test .
In this stage, we start acquiring labeled data for model training.Data points are sampled from D pool via an acquisition strategy and subsequently passed to human annotators for labeling.The acquisition function selects a batch of data Q ⊂ D pool according to some informativeness criterion and can either use the model-in-the-loop or not.We employ crowdsourcing or expert annotators to label the selected batch Q which then is appended to the labeled dataset D lab .Now that we have augmented the seed dataset with more data, we re-train the model on the new training dataset, D lab .We test the new model on D test and we stop if we obtain satisfactory performance or if the budget for annotation has run out (or using any other stopping criterion).If we do not want to stop, we use the acquisition function to select more unlabeled data from D pool , which we annotate and append to D lab , etc.This is the AL loop shown in Figure 2.

Active Learning Design
Seed dataset We start the AL loop ( §2.1) by defining an initial labeled "seed dataset" (Figure 2: 1 ).The seed dataset plays an important role, as it will be used to train the the first model-in-theloop (Tomanek et al., 2009;Horbach and Palmer, 2016).In AL research, we typically address the cold-start problem by sampling from D pool with a uniform distribution for each class, either retaining the true label distribution or choosing data that form a balanced label distribution.3This is merely a convenient design choice, as it is simple and easy to implement.However, sampling the seed dataset this way, does not really reflect a real-world setting where the label distribution of the (unlabeled data of the) pool is actually unknown.Prabhu et al. (2019) performed a study of such sampling bias in AL, showing no effect in different seed datasets across the considered methods.Ein-Dor et al. ( 2020) also experimented with different imbalanced seed datasets, showing that AL improves over random sampling in settings with highest imbalance.
Furthermore, the choice of the seed dataset has a direct effect on the entire AL design because the first model-in-the-loop marks the reference point of the performance in D test .In other words, the performance of the first model is essentially the baseline, according to which a practitioner will plan the AL loop based on the goal performance and the available budget.It is thus essential to revisit existing approaches on choosing the seed dataset (Kang et al., 2004;Vlachos, 2006;Hu et al., 2010;Yuan et al., 2020) and evaluate them towards a realistic simulation of an AL experiment.

Number of iterations & acquisition budget
After choosing the seed dataset it is natural to decide the number of iterations, the acquisition size (the size of the acquired batch Q) and the budget (the size of the actively collected D lab ) of the AL experiment.This is another part where literature does not offer concrete explanations on the design choice.Papers that address the cold-start problem would naturally focus on the very few first AL iterations (Yuan et al., 2020), while others might simulate AL until a certain percentage of the pool has been annotated (Prabhu et al., 2019;Lowell et al., 2019;Zhao et al., 2020;Zhang and Plank, 2021;Margatina et al., 2022) or until a certain fixed and predefined number of examples has been annotated (Ein-Dor et al., 2020;Kirsch et al., 2021).

Model Training
We now train the model-in-the-loop with the available labeled dataset D lab (Figure 2: 2 ).Interestingly, there are not many studies that explore how we should properly train the model in the low data resource setting of AL.Existing approaches include semi-supervised learning (McCallum and Nigam, 1998;Tomanek and Hahn, 2009;Dasgupta and Ng, 2009;Yu et al., 2022), weak supervision (Ni et al., 2019;Qian et al., 2020;Brantley et al., 2020;Zhang et al., 2022a) and data augmentation (Zhang et al., 2020;Zhao et al., 2020;Hu and Neubig, 2021), with the most prevalent approach currently to be transfer learning from pretrained language models (Ein-Dor et al., 2020;Margatina et al., 2021;Tamkin et al., 2022).Recently, Margatina et al. (2022) showed large performance gains by adapting the pretrained language model to the task using the unlabeled data of the pool (i.e., task adaptive pretraining by Gururangan et al. (2020)).The authors also proposed an adaptive fine-tuning technique to account for the varying size of D lab showing extra increase in D test performance.
Still, there is room for improvement in this rather under-explored area.Especially now, state-of-theart NLP pretrained language models consist of many millions or even billions of parameters.In AL we often deal with a small D lab of a few hundred examples, thus adapting the training strategy is not trivial.

Data Acquisition
The data acquisition step (Figure 2: 4 ) is probably the core of the AL process and can be performed in various ways. 4hang et al. (2022d) provide a thorough literature review of query strategies, dividing them into two broad families.The first is based on informativeness, and methods in this family treat each candidate instance individually, assign a score and select the top (or bottom) instances based on the ranking of the scores.Major sub-categories of methods that belong in the informativeness family are uncertainty sampling (Lewis and Gale, 1994;Culotta and Mccallum, 2005;Zhang and Plank, 2021;Schröder et al., 2022), divergence-based algorithms (Ducoffe and Precioso, 2018;Margatina et al., 2021;Zhang et al., 2022b), disagreementbased (Seung et al., 1992;Houlsby et al., 2011;Gal et al., 2017;Siddhant and Lipton, 2018;Kirsch et al., 2019;Zeng and Zubiaga, 2023), gradientbased (Settles et al., 2007;Settles and Craven, 2008) and performance prediction (Roy and Mccallum, 2001;Konyushkova et al., 2017;Bachman et al., 2017;Liu et al., 2018).
The second family is representativeness and takes into account how instances of the pool correlate with each other, in order to avoid sampling bias harms from treating each instance individually.Density-based methods choose the most representative instances of the unlabeled pool (Ambati et al., 2010;Zhao et al., 2020;Zhu et al., 2008), while others opt for discriminative data points that differ from the already labeled dataset (Gissin and Shalev-Shwartz, 2019;Erdmann et al., 2019).A commonly adopted category in this family is batch diversity, where algorithms select a batch of diverse data points from the pool at each iteration (Brinker, 2003;Bodó et al., 2011;Zhu et al., 2008;Geifman and El-Yaniv, 2017;Zhdanov, 2019;Yu et al., 2022), with core-set (Sener and Savarese, 2018) to be the most common approach.
Naturally, there are hybrid acquisition functions that combine informativeness and representativeness (Yuan et al., 2020;Ash et al., 2020;Shi et al., 2021).Still, among the aforementioned methods there is not a universally superior acquisition function that consistently outperforms all others.Thus, which data to acquire is an active area of research.

Data Annotation
Once an acquisition function is applied to D pool , a subset Q is chosen, and the obtained unlabeled data is subsequently forwarded to human annotators for annotation (Figure 2: 5 ).In the context of simulation-based active learning, this aspect is not the primary focus since the labels for the actively acquired batch are already available.However, a question that naturally arises is: Are all examples equally easy to annotate?In simulation, all instances take equally long to label.This does not account for the fact that hard instances for the classifier are often hard for humans as well (Hachey et al., 2005;Baldridge and Osborne, 2004), therefore the current experimental setting is limiting and research for cost-aware selection strategies (Donmez and Carbonell, 2008;Tomanek and Hahn, 2010;Wei et al., 2019) is required.This would include explicit exploration of the synergies between random or actively acquired data and annotator expertise (Baldridge and Palmer, 2009).

Stopping Criterion
Finally, another active area of research is to develop effective methods for stopping AL (Figure 2: 3 ).In simulation, we typically decide as a budget a number of examples or a percentage of D pool up to which we "aford" to annotate.However, in both research and real world applications, it is not clear if the model performance has reached a plateau.The stopping criterion should not be pre-defined by a heuristic, but rather a product of a well-designed experimental setting (Vlachos, 2008;Tomanek and Hahn, 2010;Ishibashi and Hino, 2020;Pullar-Strecker et al., 2022;Hacohen et al., 2022;Kurlandski and Bloodgood, 2022).5

The Fine Print
Previously, we presented specific challenges across different steps in the AL loop that researchers and practitioners need to address.Still, these challenges have long been attracting the attention of the research community.Interestingly, there are more caveats, that someone with no AL experience might have never encountered or even imagined.Hence, in this section we aim to unveil several such small details that still remain unexplored.

Hyperparameter Tuning
A possibly major issue of the current academic status quo in AL, is that researchers often do not tune the models-in-the-loop.This is mostly due to limitations related to time and compute constrains.For instance, a paper that proposes a new acquisition function would be required to run experiments for multiple baselines, iterations, random seeds and datasets.For example, a modest experiment including a = 5 acquisition functions, i = 10 AL iterations, n = 5 random seeds and d = 5 datasets, would reach an outstanding number of minimum a × i × n × d = 1, 250 trained models in total.This makes it rather hard to perform hyperparameter tuning of all these models in every AL loop, so it is the norm to use the same model architecture and hyperparameters to train all models.
In reality, practitioners that want to use AL, apply it once.Therefore, they most likely afford to tune the one and only model-in-the-loop.The question that arises then, is "do the findings of AL experiments that do not tune the models generalize to scenarios where all models-in-the-loop are tuned"?In other words, if an AL algorithm A performs better than B according to an experimental finding, would this be the case if we applied hyperparameter tuning to the models of both algorithms?Wouldn't it be possible that, with another configuration of hyperparameters, B performed better in the end?

Model Stability
In parallel, another undisclosed detail is what researchers do when the models-in-the-loop are unstable (i.e., crash).This essentially means that for some reason the optimisation of the model might fail and the model never converges leading to extremely poor predictive performance.Perhaps before the deep learning era such a problem did not exist, but now it is a likely phenomenon.Dodge et al. (2020) showed that many finetuning experiments diverged part of the way through training especially on small datasets.AL is by definition connected with low-data resource settings, as the gains of data efficiency are meaningful in the scenario when labeled data is scarce.In light of this challenge, there is no consensus as to what an AL researcher or practitioner should do to alleviate this problem.One can choose to re-train the model with a different random seed, or do nothing.Though, it is non-trivial under which condition one should choose to re-train the model, since it is common that not always test performance improves from one AL iteration to the next.
Furthermore, there is currently no study that explores how much AL algorithms, that use the model-in-the-loop for acquisition, suffer by this problem.For instance, consider an uncertaintybased AL algorithm that uses the predictive proba-bility distribution of the model to select the most uncertain data points from the pool.If the model crashes, then its uncertainty estimates are not meaningful, thus the data acquisition function does not work as expected.In effect, the sampling method turns to a uniform distribution (i.e., the random sampling baseline).

Active Learning Evaluation
Another important challenge is the evaluation framework for AL.Evaluating the actual contribution of an AL method against its competitors would require to perform the same iterative trainacquire-annotate experiment (Figure 1) for all AL methods in the exact same data setting and with real human annotations.Certainly, such a laborious and expensive process is prohibitive for academic research, which is why we perform simulations by treating an already labeled and open-source dataset as a pool of unlabeled data.
Still, even if we were able to perform the experiments in real life, it is not trivial how to properly define when one method is better than another.This is because AL experiments include multiple rounds of annotation, thus multiple trained models and multiple scores in the test set(s).In cases with no clear difference between the algorithms compared, how should we do a fair comparison?
Previous work presents tables comparing the test set performance of the last model, often ignoring performance in previous loops (Prabhu et al., 2019;Mussmann et al., 2020).The vast majority of previous work though uses plots to visualize the performance over the AL iterations (Lowell et al., 2019;Ein-Dor et al., 2020) and in some cases offer a more detailed visualization with the variance due to the random seeds (Yuan et al., 2020;Kirsch et al., 2021;Margatina et al., 2021).

The Test of Time
Settles (2009) eloquently defines the "test of time" problem that AL faces: "A training set built in cooperation with an active learner is inherently tied to the model that was used to generate it (i.e., the class of the model selecting the queries).Therefore, the labeled instances are a biased distribution, not drawn i.i.d.from the underlying natural density.If one were to change model classes-as we often do in machine learning when the state of the art advances-this training set may no longer be as useful to the new model class".
Several years later, in the deep learning era, Lowell et al. (2019) indeed corroborates this concern.They demonstrate that a model from a certain family (e.g., convolution neural networks) might perform better when trained with a random subset of a pool, than an actively acquired dataset with a model of a different family (e.g., recurrent neural networks).Interestingly, Jelenić et al. (2023) recently showed that AL methods with similar acquisition sequences produce highly transferable datasets regardless of the model architecture.Related to the "test of time" challenge, it is rarely investigated whether the training data actively acquired with one model will confer benefits if used to train a second model (as compared to randomly sampled data from the same pool).Given that datasets often outlive learning algorithms, this is an important practical consideration (Baldridge and Osborne, 2004;Lowell et al., 2019;Shelmanov et al., 2021).

Active Learning in Simulated vs. Real World Settings
Is it truly logical to consider an already cleaned (preprocessed), typically published open-source labeled dataset as an unlabeled data pool for pool-based active learning simulation, with the expectation that any conclusions drawn will be applicable to real-world scenarios?
The convenience and scalability of simulation make it an undoubtedly appealing approach for advancing machine learning research.In NLP, when tackling a specific task, for instance summarization, researchers often experiment with the limited availability of labeled summarization datasets, aiming to gain valuable insights and improve summarization models across various domains and languages.While this approach may not be ideal, it is a practical solution.What makes the sub-field of active learning different?
Admittedly, progress has, and will be made in AL research by leveraging simulation environments, similar to other areas within machine learning.Thus, there is no inherent requirement for a radically different approach in AL.We believe that simulating AL is indispensable for developing new methods and advancing the state-of-the-art.
Nonetheless, we argue that a slight distinction should be taken into account.AL is an iterative process that aims to obtain the smallest possible amount of labeled data given a substantially larger pool of unlabeled data for maximizing predictive performance on a given task.The difference between developing models and constructing datasets lies in the fact that if a model is poorly trained, it can simply be retrained.Conversely, in AL, there exists a finite budget for acquiring annotations, and once it is expended, there is no going back.Consequently, we must have confidence that the AL stateof-the-art established through research simulations will perform equally well in practical applications.
Given these considerations, we advocate for a more critical approach to conducting simulation AL experiments.We should be addressing all the challenges ( §2) and the experimental limitations ( §3) discussed previously, while acknowledging the disparities between the simulation environment and real-world applications ( §4.1).Given that datasets tend to outlast models (Lowell et al., 2019), we firmly believe that it is crucial to ensure the trustworthiness of AL research findings and their generalizability to real-world active data collection.This will contribute to the generation of high-quality datasets that stand the test of time ( §3.4).

Simulation as a Lower Bound of Active Learning
The distribution gap between benchmark datasets in common ML tasks and data encountered in a real world production setting is well known (Bengio et al., 2020;Koh et al., 2021;Wang and Deng, 2018;Yin et al., 2021).
High Quality Data It is common practice for researchers to carefully curate the data to be labeled properly, often collecting multiple human annotations per example and discarding instances with disagreeing labels.When datasets are introduced in papers published in prestigious conferences or journals, it is expected that they should be of the highest quality, with an in-depth analysis of its data collection procedure, label distribution and other statistics.Nonetheless, it is important to acknowledge that such datasets may not encompass the entire spectrum of language variations encountered in real-world environments (Yin et al., 2021).Consequently, it remains uncertain whether an AL algorithm would generalize effectively to unfiltered raw data.Specifically, we hypothesize that the filtered data would be largely more homogeneous than the initial "pool".Assuming that the simulation D pool is a somewhat homogeneous dataset, we can expect that any subset of data points drawn from it would, consequently, be more or less identical. 6herefore, if we train a model in each such subset, we would expect to obtain similar performance on test data due to the similarity between the training sets.From this perspective, random (uniform) sampling from a homogeneous pool can be considered a rudimentary form of diversity sampling.
Low Quality Data In contrast, it is possible that a publicly available dataset used for AL research may contain data of inferior quality, characterized by outliers such as repetitive instances, inadequate text filtering, incorrect labels, and implausible examples, among others.In such cases, an AL acquisition strategy, particularly one based on model uncertainty, may consistently select these instances for labeling due to their high level of data difficulty and uncertainty.Previous studies (Karamcheti et al., 2021;Snijders et al., 2023) have demonstrated the occurrence of this phenomenon, which poses a significant challenge as it undermines the potential value of AL.In a real-world AL scenario, it is plausible to have a dedicated team responsible for assessing the quality of acquired data and discarding instances of subpar quality.However, within the confines of a simulation, such data filtering is typically absent from the researcher's perspective, leading to potentially misleading experimental outcomes.Snijders et al. (2023) tried to address this issue in a multi-source setting for the task of natural language inference, and showed that while uncertainty-based strategies perform poorly due to the acquisition of collective outliers, when outliers are removed (from the pool), AL algorithms exhibited a noteworthy recovery and outperformed random baselines.

Simulation as an Upper Bound of Active Learning
However, one might argue for the exact opposite.
Favored Design Choices Previously, we mentioned that when selecting the seed dataset ( §2.2) we typically randomly sample data from D pool , while keeping the label distribution of the true training set. 7Hence, a balanced seed dataset is typically obtained, given that most classification datasets tend to exhibit a balanced label distribution.In effect, the label distribution of D pool would also be balanced, setting a strict constraint for AL simulation, as the actual label distribution of the unlabeled data should in reality be unknown.In other words, such subtle choices in the experimental design can introduce bias, making the simulated settings more trivial than more challenging real world AL settings where there is uncertainty as to the quality and the label distribution of data crawled online, that typically constitute the unlabeled pool.
Temporal Drift & Model Mismatch Datasets intended for research purposes are often constructed within a fixed timeframe, with minimal consideration for temporal concept drift issues (Röttger and Pierrehumbert, 2021;Lazaridou et al., 2021;Margatina et al., 2023b).However, it is important to recognize that this may not align with real-world applications, where the data distribution undergoes changes over time.The utilization of random and standard splits, commonly employed in AL research, can lead to overly optimistic performance estimates (Søgaard et al., 2021), which may not generalize to the challenges presented by real-world scenarios.Consequently, practitioners should consider this limitation when designing their active learning experiments.Lowell et al. (2019) also raises several practical obstacles neglected in AL research, such as that the acquired dataset may be disadvantageous for training subsequent models, and concludes that academic investigations of AL typically omit key real-world considerations that might overestimate its utility.

Main Takeaways
In summary, there exist compelling arguments that support both perspectives: simulation can serve as a lower bound by impeding the true advancement of AL methods, or it can implicitly favor AL experimental design, thus providing an upper bound for evaluation.The validity of these arguments likely varies across different cases.We can claim with certainty that this simulation setting, as described in this paper, is a far from perfect framework to evaluate AL algorithms among them and against random sampling.Nevertheless, we hypothesize that the lower bound argument ( §4.1) might be more truthful.It is conceivable that AL data selection approaches may exhibit similar performance levels, either due to a lack of variation and diversity in the sampled pool of data or due to the presence of outliers that are not eliminated during the iter-ations.Hence, we contend that simulation can be perceived as a lower bound for AL performance, which helps explain why AL methods struggle to surpass the performance of random sampling.We undoubtedly believe that we can only obtain such answers by exploring the AL simulation space in depth and by performing thorough analysis and extensive experiments to contrast the two theories.

Active Learning in the LLMs Era
The field of active learning holds considerable importance in the current era of Large Language Models (LLMs).AL has recently been explore as a framework to identify the most useful demonstrations for in-context learning with LLMs (Zhang et al., 2022c;Diao et al., 2023;Margatina et al., 2023a).Additionally, AL is inherently intertwined with data-driven approaches that underpin recent advancements in artificial intelligence, such as reinforcement learning from human feedback (RLHF) (Christiano et al., 2023;OpenAI, 2022OpenAI, , 2023;;Bai et al., 2022a).AL and RLHF represent two distinct approaches that tackle diverse aspects of the overarching problem of AI alignment (Askell et al., 2021).AL primarily focuses on optimizing the data acquisition process by selectively choosing informative instances for labeling, primarily within supervised or semi-supervised learning paradigms.
On the other hand, RLHF aims to train reinforcement learning agents by utilizing human feedback as a means to surmount challenges associated with traditional reward signals.Despite their disparate methodologies, both AL and RLHF emphasize the criticality of incorporating human involvement to enhance the performance of machine learning and AI systems.Through active engagement of humans in the training process, AL and RLHF contribute to the development of AI systems that exhibit greater alignment with human values and demonstrate enhanced accountability (Bai et al., 2022a,b;Ganguli et al., 2022;Glaese et al., 2022;Sun et al., 2023;Kim et al., 2023).Consequently, the synergistic relationship between these two approaches warrants further exploration, as it holds the potential to leverage AL techniques in order to augment the data efficiency and robustness of RLHF methods.

Guidelines for Future Work
Given the inherent limitations of simulated AL settings, we propose guidelines to improve trustworthiness and robustness in AL research.
Transparency Our first recommendation is a call for transparency, which essentially means to report everything (Dodge et al., 2019).Every detail of the experimental setup, the implementation and the results, would be extremely helpful to properly evaluate the soundness of the experiments.We urge AL researchers to make use of the Appendix (or other means such as more detailed technical reports) to communicate interesting (or not) findings and problems, so that all details ( §3) are accessible.
Thorough Experimental Settings We aim to incentivize researchers to thoughtfully consider ethical and practical aspects in their experimental settings.It is crucial to compare a wide range of algorithms, striving for generalizable results and findings across datasets, tasks, and domains.Moreover, we endorse research endeavors that aim to simulate more realistic settings for AL, such as exploration of AL across multiple domains (Longpre et al., 2022;Snijders et al., 2023).Additionally, we advocate for investigations into active learning techniques for languages beyond English, as the prevailing body of research predominantly focuses on English datasets (Bender, 2011).
Evaluation Protocol We strongly encourage researchers to prioritize the establishment of fair comparisons among different methods and to provide extensive presentation of results, including the consideration of variance across random seeds, in order to ensure robustness and reliability of findings.Generally, we argue that there is room for improvement of the active learning evaluation framework and we should explore approaches from other fields that promote more rigorous experimental and evaluation frameworks (Artetxe et al., 2020).
Analysis Particularly when employing AL with large-scale models, it becomes crucial to establish the actively acquired data from other studies as baselines, rather than re-running the entire process from the beginning.Such an approach would not only enhance transparency, but also promote efficiency and ecofriendly practices within the research community.

Conclusion
In this position paper, we examine the numerous challenges encountered throughout the various stages of the active learning pipeline.Additionally, we provide a comprehensive overview of the often-overlooked limitations within the AL research community, with the intention of illuminating obscure experimental design choices.Furthermore, we delve into a thorough exploration of the limitations associated with simulation in AL, engaging in a critical discussion regarding its potential as either a lower or upper bound on AL performance.Lastly, we put forth guidelines for future research directions, aimed at enhancing the robustness and credibility of AL research for effective real-world applications.This perspective is particularly timely, particularly considering the notable advancements in modeling within the field NLP (e.g., ChatGPT8 , Claude9 ) .These advancements have resulted in a shift of emphasis towards a more data-centric approach in machine learning research, emphasizing the significance of carefully selecting relevant data to enhance models and ensure their alignment with human values.

Limitations
In this position paper, we have strived to provide a comprehensive overview, acknowledging that there may be relevant research papers that have inadvertently escaped our attention.While we have made efforts to include a diverse range of related work from various fields, such as machine learning and computer vision, it is important to note that our analysis predominantly focuses on AL papers presented at NLP conferences.Moreover, it is worth mentioning that the majority, if not all, of the AL papers examined and referenced in this survey are centered around the English language, thereby limiting the generalizability and applicability of our findings and critiques to other languages and contexts.We wish to emphasize that the speculations put forth in this position paper carry no substantial risks, as they are substantiated by peer-reviewed papers, and our hypotheses ( §4) are explicitly stated as such, representing conjectures rather than definitive findings regarding the role of simulation in AL research.We sincerely hope that this paper stimulates robust discussions and undergoes thorough scrutiny by experts in the field, with the ultimate objective of serving as a valuable guideline for AL researchers, particularly graduate students, seeking to engage in active learning research.Above all, we earnestly urge researchers equipped with the necessary resources to conduct experiments and analyses that evaluate our hypotheses, striving to bridge the gap between research and real-world settings in the context of active learning.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: High-level overview of the train-acquireannotate steps of the active learning loop.

Figure 2 :
Figure 2: Distinct steps of the active learning loop (1-6).We use blue for the unlabeled data, purple for the labeled data and red for the (labeled) test data.
Siddharth Karamcheti, Ranjay Krishna, Li Fei-Fei, and Christopher Manning.2021.Mind your outliers!investigating the negative impact of outliers on active learning for visual question answering.In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7265-7281, Online.Association for Computational Linguistics.Twin Karmakharm, Nikolaos Aletras, and Kalina Bontcheva.2019.Journalist-in-the-loop: Continuous learning as a service for rumour analysis.In Proceedings of the 2019 Conference on Empirical Methods C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Not applicable.Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Not applicable.Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.