ALToolbox: A Set of Tools for Active Learning Annotation of Natural Language Texts

We present ALToolbox – an open-source framework for active learning (AL) annotation in nat-ural language processing. Currently, the framework supports text classification, sequence tagging, and seq2seq tasks. Besides state-of-the-art query strategies, ALToolbox provides a set of tools that help to reduce computational overhead and duration of AL iterations and increase annotated data reusability. The framework aims to support data scientists and researchers by providing an easy-to-deploy GUI annotation tool directly in the Jupyter IDE and an extensible benchmark for novel AL methods. We prepare a small demonstration of ALToolbox capabilities available online 1,2 . The code of the framework is published under the MIT license 3 .


Introduction
The development of text processing applications based on machine learning (ML) usually requires a lot of labeled data.Despite numerous annotated corpora designed for various tasks and available for resource-rich languages, in practice, the business logic of an application is often very specific and cannot be implemented using only public resources.Manual annotation of natural language corpora is a tedious and time-consuming task, which can take up to 30-40% of the application development time.
For rather simple tasks, the annotation of corpora can be organized using crowd-sourcing.However, crowd-sourcing is not suitable for specific domains like medicine, finance, information technology, or any other field that requires specific qualifications or knowledge of business logic.It is also problematic to apply crowd-sourcing when the annotation scheme is complex and requires some premature training of annotators.In each of the aforementioned cases, annotation of each instance becomes expensive because it requires hiring people with a high qualification or a specific skill set.
Active learning (AL) is a well-known technique that speeds up data annotation by leveraging model output for selecting instances demonstrated to human experts (Cohn et al., 1996;Settles and Craven, 2008).It focuses human effort on instances that are the most informative for model training, decreasing redundancy and filtering out noisy outliers.AL helps to achieve a certain level of model performance using only a fraction of the labor required to exhaustively annotate a given dataset.
In this work, we present ALToolbox -an opensource framework that contains a comprehensive set of tools for practical AL annotation in text classification, sequence tagging, and seq2seq tasks.The main goal of the framework is to support data scientists and researchers.They usually need to test new ideas very quickly, and the lack of annotation is a common obstacle to this.ALToolbox aims to address several practical obstacles to deploying AL: (1) data annotated with AL should be reusable; (2) AL should not consume excessive computational resources, while the annotation process should be interactive without delays for annotators; (3) annotation should be quick and fluent.
(1) Instances selected with AL that are informative for one model can be not informative for a different model of another type.This hurts the reusability of data annotated with the help of AL.For example, Lowell et al. (2019) shows that if we use predictions of one model for selecting instances during AL, but train a model of a different type on the selected data, the performance of the latter can be even worse compared to the case when it is trained on data labeled without AL.Lowell et al. (2019)  model used for selecting instances during AL and successor is a model trained on the labeled data for the final application).To address this problem, we include in the framework several pipelines for the preparation of acquisition models and postprocessing of data annotated with the help of AL.These pipelines leverage the Pseudo-labeling for the Acquisition-Successor Mismatch (PLASM) algorithm based on the effect of knowledge distillation (Hinton et al., 2015) in AL revealed by Shelmanov et al. (2021); Tsvigun et al. (2022b).PLASM effectively mitigates ASM, making data collected with AL reusable for training models of various architectures.
(2) Applying AL is not free.It introduces additional computational overhead which usually sums up from training an acquisition model and performing its inference.For resource-intensive models such as modern neural networks, this overhead might be prohibitive due to the cost of GPUaccelerated computations for their training and inference.Due to the ASM problem, it is not possible to simply replace a resource-intensive model (e.g.ELECTRA) with a small one (e.g.DistilBERT).PLASM addresses this problem and allows to use small versions of acquisition models obtained using distillation, which speeds up training and inference.ALToolbox also implements an unlabeled pool subsampling algorithm, which leverages uncertainty of instances to avoid repetitive predictions on the part of the unlabeled pool, speeding up the inference phase of AL iterations (Tsvigun et al., 2022b).
(3) AL itself speeds up the annotation procedure, but the time required for deploying an ALempowered annotation system and integrating annotation with existing data processing pipelines can diminish its benefits.Removing obstacles between the data processing workflow and annotation tools can facilitate rapid evaluation of new ideas.Therefore, in ALToolbox, besides a set of state-of-the-art query strategies, we also provide a serverless AL-empowered annotation tool that is natively integrated directly into the Jupyter Notebook IDE.This tool is suitable for labeling small datasets and testing new ideas quickly, which, we believe, is useful for data scientists and researchers.This tool is easy to start and is fully integrated with the familiar IDE, while also being flexible and extensible.
There are many UI-centric academic and commercial annotation systems for end users that support AL annotation: WebAnno (Yimam et al., 2013), AlpacaTag (Lin et al., 2019), Paladin (Nghiem et al., 2021), ActiveAnno (Wiechmann et al., 2021), FAMIE (Van Nguyen et al., 2022), Prodigy (Montani and Honnibal, 2018) (a commercial system), and others.However, they lack many practical features that serve the goal of rapid annotation, compatibility with pipelines for data analysis and IDEs, and reusability of the annotated data.There are also several low-level AL packages that focus on providing various query strategies and can be used as building blocks for more elaborated systems: LibAct (Yang et al., 2017), ModAL (Danka and Horvath, 2018), Baal (Atighehchian et al., 2020), Small-text (Schröder et al., 2021).However, most of them also overlook the problem of reusability and computational efficiency.Only Small-text is specifically tailored to NLP tasks.
The contributions of the proposed framework: • a comprehensive collection of state-of-the-art query strategies for sequence tagging, text classification, and seq2seq tasks; • a benchmarking tool for experimental evaluation of novel AL methods; • pipelines for acquisition model preparation and for data post-processing that provide reusability of annotated data and computational efficiency of AL; • a serverless GUI for AL annotation integrated

Framework Description
The ALToolbox framework is a Python library with several executable scripts, as well as a Jupyter widget implemented in JavaScript.In this section, we describe the key features of the framework.

Query Strategies
One of the key components of AL pipelines is a query strategy that specifies what instances are selected for annotation.ALToolbox provides classical and state-of-the-art query strategies for text classification, sequence tagging, and seq2seq tasks.(Shen et al., 2017), Breaking Ties (BT) (Luo et al., 2004), Prediction entropy (PE) (Roy and Mccallum, 2001), Normalized Sequence Probability (NSP) (Ueffing and Ney, 2007).Since a predictive distribution of a single deterministic neural network cannot be used to obtain reliable uncertainty scores (Sener and Savarese, 2018;Mukhoti et al., 2021), some works have ventured into the development of Bayesian query strategies (Siddhant and Lipton, 2018).ALToolbox implements one of the widelyadopted strategies -Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011).It selects instances that provide the biggest amount of information about true model parameters by knowing the true label of the considered instance.In practice, the strategy approximates variational inference in a Bayesian neural network using Monte-Carlo dropout (Gal and Ghahramani, 2016).AL-Toolbox also includes a batched version of BALD -BatchBALD (Kirsch et al., 2019), which is modified to jointly score and select for annotation multiple instances on each AL iteration.
An alternative for uncertainty sampling is diversity-based sampling.In this category, the coreset algorithm (Sener and Savarese, 2018) leverages data geometry and aims to minimize the bound between an average loss over any given subset of the dataset and the remaining data points.Recently proposed Contrastive Active Learning (CAL) prioritizes instances, which predictive likelihoods diverge the most from their neighbors in the training set (Margatina et al., 2021).The Cluster-Margin algorithm (Citovsky et al., 2021) is designed to select large batches for annotation.It prioritizes instances that are diverse and that the model is not confident about.BERT-KM (Yuan et al., 2020) clusters texts in the unlabeled pool using their contextualized embeddings and selects the nearest neighbors of cluster centers.Active Learning by Processing Surprisal (ALPS) (Yuan et al., 2020) leverages pre-trained models, self-supervised learning objective, and clustering to solve the cold-start problem in AL.AcTune (Yu et al., 2022) can be used as a wrapper over uncertainty-based query strategies.It selects the most uncertain instances from regions obtained by clustering the unlabeled pool and ranking them by uncertainty and diversity.
ALToolbox also contains several gradient-based  et al., 2007).Batch Active Learning by Diverse Gradient Embeddings (BADGE) measures uncertainty as the gradient magnitude with respect to parameters in the final (output) layer (Ash et al., 2020).Batch Active learning via Information maTrices (BAIT) selects batches of instances by optimizing a bound on the MLE error in terms of the Fisher information (Ash et al., 2021).
Furthermore, ALToolbox provides several query strategies for seq2seq tasks.NSP (Ueffing and Ney, 2007) is an analogue of LC for text generation, which calculates the length-normalized total probability of a generated sequence.ENSP (Wang et al., 2019) makes several stochastic runs using Monte-Carlo dropout and averages the probabilities of the sequences.The BLEUVar (Xiao et al., 2020) algorithm strives to measure the variance of texts generated under Monte-carlo dropout by using the BLEU metric (Papineni et al., 2002).The IDDS (Tsvigun et al., 2022a) strategy, shown to be state-of-the-art for the abstractive text summarization task, selects instances that are semantically dissimilar from the already annotated instances, avoiding outliers and borderline instances.
Finally, the framework provides the ability to use different strategies for different AL iterations.For example, one could use a cold-start method (e.g.ALPS) at several first iterations and later switch to another strategy such as LC.

Supported Models
ALToolbox is compatible with the HuggingFace Transformers library (Wolf et al., 2020), allowing the usage of state-of-the-art Transformer models like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), ELEC-TRA (Clark et al., 2020), and others.The support of some older RNN-based models like CNN-BiLSTM-CRF (Ma and Hovy, 2016) for sequence tagging is implemented via a wrapper around the Flair library (Akbik et al., 2019).Users can also implement their own models directly using PyTorch.ALToolbox provides several custom neural model implementations in PyTorch, including the classical CNN for text classification (Le et al., 2018).

Jupyter Annotation Tool
ALToolbox provides a simple serverless tool with a GUI for AL annotation integrated directly into Jupyter Notebook, which is one of the most popular IDEs for the Python language and data analysis (Figure 1).It supports annotation for text classification and sequence tagging tasks like named entity recognition and event extraction.
The tool is implemented using Jupyter widgets -a built-in feature of the Jupyter IDE for creating extensions.This widget can be configured with various AL query strategies and models, including Transformers.After the tool object is invoked, the IDE displays the widget in a notebook cell, and AL annotation begins.For example, to add NER annotation, a user can select a corresponding text fragment with a mouse and add a label to it.For text classification, a label can be chosen from a predefined list via selectable buttons.On each iteration, the user receives instances for annotation in mini-batches.The user can annotate all or just a part of them and invoke the next iteration of an AL algorithm with the "Next iteration" button asking for a new minibatch of unlabeled instances.
The annotation tool performs all necessary computations asynchronously with GUI and returns new instances without any delay.It keeps a list of instances sorted by their "informativeness" and updates it as soon as possible in the background.
The user can interrupt annotation at any time and resume it after a while.The tool tracks changes made by the user on the hard drive.The annotation is accumulated in easy-to-parse JSON files.
The target audience of this tool is data scientists and researchers.It is very easy to launch and modify: new graphical elements can be added using Jupyter Widgets as well.Using Jupyter also helps to reduce the effort of combining the system with data processing pipelines.We consider this tool might be useful for rapid annotation in small to medium projects and for testing new ideas quickly.However, we note that it lacks many useful features of full-fledged annotation systems, e.g., the ability to work with multiple users simultaneously.Cre-ating a complex GUI for annotation is out of the scope of this project since a wide range of similar projects have already been released, e.g.DocAnno (Nakayama et al., 2018), LabelStudio (Tkachenko et al., 2020(Tkachenko et al., -2022)), ActiveAnno (Wiechmann et al., 2021).The ALToolbox framework can be easily integrated into such annotation systems with the help of API.

Tools for Computational Efficient Active Learning and Reusable Annotation
ALToolbox contains a set of scripts that help to improve the computational efficiency of AL while keeping annotated data reusable.AL requires a substantial amount of computations on each iteration, which depends on the complexity and the size of the acquisition model.Using smaller and lighter models can lead to performance degradation of AL due to the ASM problem discussed in the introduction.We mitigate this problem by implementing tools for the "Pseudo-Labeling for Acquisition-Successor Mismatch" (PLASM) algorithm (Tsvigun et al., 2022b).This algorithm leverages small distilled models (e.g.DistilBERT) during the acquisition of instances, but after the annotation is finished it trains the original full-sized models (e.g.BERT) on the acquired data and uses it for automatic pseudo-labeling of the whole unlabeled pool of instances.The mistakes in automatic annotation are cleaned with the help of the TracIn method (Pruthi et al., 2020).Finally, the successor model is trained on the data that contains gold-standard labels and cleaned automatically labeled instances.PLASM reduces or completely removes the gap in performance that appears when the successor model is different from the acquisition model.It makes the annotated data reusable for training suc- cessor models of various architectures.ALToolbox provides scripts for automatic model distillation and a pipeline for data post-processing with PLASM.All the necessary post-processing can be done by invoking a single function.
For large datasets, making predictions for the whole unlabeled set on each iteration to obtain the uncertainty estimates may require an enormous amount of time and resources.Consequently, in the framework, we also implement the "unlabeled pool subsampling" (UPS) algorithm (Tsvigun et al., 2022b), which samples the instances from the unlabeled pool according to their uncertainty estimates on previous iterations.
Figures  Figures 19a,20a,22a show that the performance of the successor model does not deteriorate when these algorithms are used.Figures 19b, 20b, 22b, in turn, show that the ASM problem leads to a substantial decrease in the model performance.
We also provide scripts for domain adaptation of acquisition models.Margatina et al. (2022) demonstrate that self-supervised adaptation (Gururangan et al., 2020) of pre-trained Transformers on the unlabeled pool of instances helps to speed up AL.

Benchmarking Tool for Query Strategies
ALToolbox provides an extensible and easy-to-use benchmarking tool for testing new AL query strate-gies and unlabeled pool subsampling strategies.To experiment with a new strategy, a user implements it in the form of a Python class and runs the evaluation script, specifying the path to the corresponding class module as an argument.The script performs several iterations of simulated AL annotation and constructs the dependence of the model performance scores on the size of the labeled data.Experiments are launched multiple times with different random seeds to obtain confidence intervals of the results.
Using this tool, we provide the evaluation results of implemented query strategies, which can be used as a reference.The experiments with text classification are conducted on AG News (Zhang et al., 2015), IMDB (Maas et al., 2011), and CoLA (Warstadt et al., 2018); with sequence tagging -on CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003); with abstractive text summarization -on AESLC (Zhang and Tetreault, 2019), WikiHow (Koupaee and Wang, 2018), and PubMed (Cohan et al., 2018).We provide the results with big and lightweight Transformers and with several different query sizes: • Selecting k% of instances (for text classification & abstractive text summarization) / tokens (for sequence tagging).In this setting, we randomly select and annotate k% of instances / tokens as the initial seed and select k% of instances / tokens for annotation on each AL iteration according to the query function.This configuration aims to benchmark strategies in a high-resource AL mode.We refer to it as query size = k%.
• Selecting 100 instances / tokens on each AL iteration and as the initial seed.This configuration aims to benchmark strategies in a medium-resource AL mode.We refer to it as query size = 100.
• Selecting 10 instances / tokens on each AL iteration.The initial seeding procedure differs between tasks under this mode.For text classification, we randomly select and annotate 1 instance of each class as the initial seed.For other tasks, we annotate 10 randomly chosen instances / tokens.This configuration aims to benchmark strategies in a low-resource AL mode.We refer to it as query size = 10.
Dataset statistics, model details, and hyperparameters are presented in Tables 4-6.Sequence tagging results are presented in Tables 19-23.MNLP demonstrates the best quality in terms of F1-micro score excluding the "no entity" tag ("O").Figures 11-13 show the iteration-wise scores.The duration of computations for various strategies is presented in Figure 23.
For abstractive text summarization, due to the big size of the unlabeled pool of WikiHow and Pubmed, on each AL iteration, we randomly subsample the unlabeled pool to 10,000 instances.Tables 24-27 provide the average results throughout the AL cycle and results on several iterations, while Figures 15-18 illustrate the results during the entire AL cycle.Finally, Figure 14 compares the duration of execution of the seq2seq query strategies.

Related Work
The comparison of ALToolbox with other frameworks from the related work on AL in NLP is presented in Table 1.
First of all, ALToolbox supports two most demanded NLP tasks: text classification and sequence tagging.It also works with abstractive text summarization, which is a seq2seq task.Other frameworks support only one of the tasks: Paladin, Ac-tiveAnno, and Small-Text work only with text classification, while AlpacaTag and FAMIE support only sequence tagging.
Table 2 compares AL frameworks by implemented query strategies.Paladin, ActiveAnno, and AlpacaTag implement only the basic strategies.FAMIE implements several modern methods like ALPS and BADGE, but lacks many others.We note that Small-Text implements many recently proposed query strategies, including CAL, BADGE, and BERT-KM.However, ALToolbox provides the most comprehensive set of state-of-the-art query strategies and also allows combining them.
Except for ALToolbox and FAMIE (Van Nguyen et al., 2022), the computational overhead and the AL-caused time delays have been inexplicably dismissed in the prior art.FAMIE entails training a bigger model in the background during the labeling of each batch while using a smaller one as a proxy for acquisition.Such knowledge distillation makes the AL annotation process more interactive but also carries an additional computational burden, requiring extra resources for training and running two models.On the contrary, the knowledge distillation within our framework reduces both the time needed to complete an AL iteration and the overall amount of computation.
We note that neither FAMIE, nor other frameworks, address the ASM problem that hinders the reusability of annotated data.The tools for model distillation and annotated data post-processing based on the PLASM algorithm in our framework help to mitigate the ASM, so a user, for example, can train XLNet using data acquired with Distil-BERT without significant performance penalties.
Most of the considered systems provide an elaborated GUI for annotation by end-users.Our framework aims to support data scientists and researchers and provides a fast-to-deploy minimalistic annotation system directly in the Jupyter IDE.
None of the considered systems provides easyto-use scripts for conducting experiments with new AL methods.ALToolbox implements an extensible benchmarking tool that we hope will simplify research in AL for NLP.
One of the problems that are currently out of the scope of ALToolbox is efficient task assignments to multiple annotators.Proactive learning implemented in Paladin addresses this problem.We consider this feature as future work.

Conclusion
We introduced ALToolbox, an open-source framework for practical AL in NLP.Besides many other features, the framework addresses the problems of computational efficiency of AL and data reusability.We hope that our framework will foster the development of new AL methods and remove some practical obstacles to deploying AL annotation.
In future work, we are looking forward to adding the support of more text generation tasks, introducing proactive learning, and providing tools for hyperparameter selection in AL.Table 6: Hyperparameter values of Transformers.The hyperparameters are chosen according to evaluation scores on the validation datasets when models are trained using the whole available training data.Adapt refers to adaptive length, when generation maximum length is equal to the maximum summary length on the train set.

B Query Strategy Benchmark
For the tables in this section, we select with bold state-of-the-art results with respect to the confidence intervals.When all the values are within the confidence interval, we only select with bold the largest average value.The results are averaged for 10 runs with different seeds for query size = 10 and for 5 runs for other query size settings to ensure stability.The Average column refers to the average result throughout the AL cycle. B

Figure 1 :
Figure 1: Serverless GUI annotation tool integrated into the Jupyter IDE.query strategies.Expected Gradient Length (EGL) aims to prioritize instances that would impart the greatest change to the current model if we add them to the training set with their labels(Settles  et al., 2007).Batch Active Learning by Diverse Gradient Embeddings (BADGE) measures uncertainty as the gradient magnitude with respect to parameters in the final (output) layer(Ash et al., 2020).Batch Active learning via Information maTrices (BAIT) selects batches of instances by optimizing a bound on the MLE error in terms of the Fisher information(Ash et al., 2021).

Figure 2 :
Figure 2: Duration in seconds of all the training and inference phases of the simulated AL with different acquisition settings on AG News with query size = 1% and 15 AL iterations.ELECTRA is used as a successor model, and DistilBERT -for acquisition in PLASM.

Figure 14 :
Figure 14: Average duration in seconds of one AL query with different strategies on AESLC with BART as an acquisition model and query size = 10.Hardware configuration is provided in Appendix C.

Figure 15 :Figure 16 :
Figure 15: ROUGE scores of the best performing query strategies with BART as an acquisition model on AESLC with query size = 10.

Figure 20 :Figure 21 :Figure 22 :Figure 23 :
Figure 20: IMDB dataset: performance of PLASM and UPS algorithms compared to classic AL and acquisitionsuccessor mismatch (ASM) settings.For all the experiments, RoBERTa is used as a successor model (therefore, as an acquisition model in "classic AL" as well), and DistilELECTRA -for acquisition in PLASM and ASM.

Table 1 :
call this effect acquisition-successor mismatch (ASM) problem (where acquisition is a Comparison of NLP-related AL frameworks.

Table 2 :
The comparison of AL frameworks by implemented query strategies.

Table 3 :
Accuracy of RoBERTa on AG News with various AL strategies on several AL iterations with query size = 1% (1200 instances).Average refers to the average result throughout the AL cycle.We select with bold state-of-the-art results with respect to confidence intervals.The results are averaged for 5 runs with different seeds to ensure the stability.

Table 3
depicts the results on AG News with RoBERTa-base as an acquisition model, and query size = 1%.We can see that most of the strategies perform roughly similar with CAL and LC showing the best performance across all AL iterations.Figure 3 also demonstrates the results throughout the whole AL cycle of the best-performing query strategies according to the average accuracy throughout the AL cycle.Figure4provides the comparison of the duration of computations for various query strategies.Tables 7-18 compare query strategies on text classification datasets for various settings and models.Figures 5-10 visualize the results of the best-performing query strategies.

Table 4 :
Mahnaz Koupaee and William YangWang.2018.Wikihow: A large scale text summarization dataset.CoRR, abs/1810.09305.Hoa T Le, Christophe Cerisara, and Alexandre Denis.2018.Do convolutional networks need to be deep for text classification?In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.Dataset statistics.We provide a number of instances / tokens (for sequence tagging) for the training and test sets and average lengths of documents in terms of tokens.C is a number of classes / entity types for text classification and sequence tagging datasets.

Table 7 :
Figure 4: Average duration in seconds of one AL query with different strategies on AG News with RoBERTa as an acquisition model and query size = 1% (1200 instances).Hardware configuration is provided in Appendix C. Accuracy of RoBERTa on AG News with various AL strategies with query size = 100.
Figure 3: Accuracy of the best performing query strategies according to average accuracy throughout the AL cycle (BT, CAL, and LC) on AG News with RoBERTa with query size = 1%.

Table 8 :
Accuracy of DistilBERT on AG News with various AL strategies with query size = 100.

Table 9 :
Accuracy of RoBERTa on AG News with various AL strategies with query size = 10.

Table 10 :
Accuracy of DistilBERT on AG News with various AL strategies with query size = 10.Accuracy of the best performing query strategies with different acquisition models on IMDB with query size = 100.

Table 11 :
Accuracy of RoBERTa on IMDB with various AL strategies with query size = 100.

Table 12 :
Accuracy of DisitlBERT on IMDB with various AL strategies with query size = 100.

Table 13 :
Accuracy of RoBERTa on IMDB with various AL strategies with query size = 10.

Table 20 :
Overall F1-micro score of ELECTRA on CoNLL-2003 with various AL strategies with query size = 100 (tokens).

Table 22 :
Overall F1-micro score of ELECTRA on CoNLL-2003 with various AL strategies with query size = 10 (tokens).

Table 24 :
ROUGE scores of BART on AESLC with various AL strategies with query size = 10.

Table 25 :
ROUGE scores of PEGASUS on AESLC with various AL strategies with query size = 10.

Table 26 :
ROUGE scores of BART on WikiHow with various AL strategies with query size = 10.Figure 18: ROUGE scores of the best performing query strategies with BART as an acquisition model on PubMed with query size = 10.

Table 27 :
ROUGE scores of BART on PubMed with various AL strategies with query size = 10.