nerblackbox: A High-level Library for Named Entity Recognition in Python

We present **nerblackbox**, a python library to facilitate the use of state-of-the-art transformer-based models for named entity recognition. It provides simple-to-use yet powerful methods to access data and models from a wide range of sources, for fully automated model training and evaluation as well as versatile model inference. While many technical challenges are solved and hidden from the user by default, **nerblackbox** also offers fine-grained control and a rich set of customizable features. It is thus targeted both at application-oriented developers as well as machine learning experts and researchers.


Introduction
Named Entity Recognition (NER) is an important natural language processing task with a multitude of applications (Lorica and Nathan, 2021).While generative AI is currently ubiquitous in the scientific literature and public debate, it has not (yet) replaced discriminative AI for information extraction tasks like NER.Fine-tuned, transformer-based encoder models are both SOTA in research1 and commonly used by developers to solve real-world problems, see e.g.(Raza et al., 2022;Stollenwerk et al., 2022).Popular open source frameworks, like the ones provided by HuggingFace (Wolf et al., 2020;Lhoest et al., 2021;Von Werra et al., 2022), greatly facilitate the use of such models.They cover the whole workflow consisting of dataset integration, model training, evaluation and inference, see Fig. 1.However, they do require a certain degree of expertise and often some significant, use-case specific effort.Some of the (general and NER-specific) challenges are: (i) There exist various sources for datasets.Regarding public datasets, HuggingFace and GitHub repositories are important sources.Private datasets may be stored on local filesystems or be created using annotation tools.Additional complexity is introduced by the circumstance that datasets often come in different formats.This may be true even for datasets from the same source.These issues typically require customized data preprocessing code for every new use case.
(ii) Data for NER is processed on three different levels: tokens, words and entities.Different parts of the workflow may operate on different levels, as shown in Tab. 1. Datasets may be pre- tokenized (word level) or not (entity level).At training time, labels for tokens that are not the first token of a word may be ignored (word level) or included (token level) in the computation of the loss.Model evaluation takes place primarily on the entity level (although it is labels on the token or word level that are employed for the computation).Finally, while model predictions are often made on the entity level, some use cases may require predictions on the word level, for instance if the associated probabilities are to be used for active learning.Handling these technical intricacies requires expert knowledge.
(iii) There exists a multitude of NER-specific annotation schemes and variants and it is important to be aware of the differences.For instance, during data preprocessing, existing word or entity labels need to be mapped to token labels, which is an annotation scheme dependent process.At evaluation time, there are different ways to cope with predictions that do not obey the rules of the given annotation scheme (we will get back to this in Sec.4.6).
(iv) Training hyperparameters which lead to reasonable performance may depend on the employed model and dataset.For instance, while a small dataset often requires more training epochs, larger datasets can usually be trained for fewer The aim of nerblackbox is to provide a high-level framework which makes the usage of SOTA NER models as simple as possible.As we will see in detail in Sec. 3, it offers easy access to datasets from various sources, automated training and evaluation as well as simple but versatile model inference.It does so by hiding all technical complications from the user2 and is targeted at developers as well as people who are not necessarily experts in machine learning or NLP.However, nerblackbox also allows fine-grained control over all sorts of low-level parameters and provides many advanced features, some of which we will cover in Sec. 4. This might make the library appealing also for researchers and experts.

Related Work
The most commonly used framework for transformer-based NLP is arguably the Hugging-Face ecosystem, in particular the open source libraries transformers (Wolf et al., 2020), datasets (Lhoest et al., 2021) and evaluate (Von Werra et al., 2022).Another popular alternative is spacy (Honnibal et al., 2020).
High-level libraries that are build on top of transformers exist in the form of Simple Transformers (Rajapakse, 2019) and T-NER (Ushio and Camacho-Collados, 2021).Simple Transformers is a highlevel library that covers a broad range of NLP tasks with basic support for NER.T-NER is specific to NER with an emphasis on cross-domain and crosslingual model evaluation.Of all the mentioned libraries, it is arguably the most similar to nerblackbox.However, as will be discussed in the following sections, nerblackbox offers many unique and powerful features that-to the best of our knowledgemake it distinct from any existing frameworks.
3 Basic Usage nerblackbox provides a simple API to automate each step in the life cycle of a NER model (cf.Fig. 1) using very few lines of code.It does so in terms of the following classes: A high-level overview of the involved components is shown in Fig. 2.

Dataset Integration
nerblackbox allows seamless access to datasets from the following sources: HuggingFace (HF), the local filesystem (LF), built-in datasets (BI) and annotation tools (AT) 3 .
Basically, a dataset can be set up for training and evaluation like in the following example: While this works out-of-the-box for the sources HF and BI, some additional information needs to be provided for the sources LF and AT in order for nerblackbox to be able to find the data.Integrating different datasets can be challenging as they may have different formatting (even on HuggingFace) and annotation schemes.Some datasets are pretokenized and split into training/validation/test subsets, while others are not.The set_up() method automatically deals with these challenges and makes sure that every dataset, irrespective of the source, is transformed into a standard format4 .Apart from downloading, reformatting, and dataset splitting (if needed), it also includes an analysis of the data.For details, we refer to the library's documentation.

Training
In order to train a model, one only needs to choose a name for the training run (for later reference) and specify the model and dataset names, like so: In order to ensure stable results irrespective of the dataset, the training employs well-established hyperparameters by default (Mosbach et al., 2021).In particular, a specific learning rate schedule (Stollenwerk, 2022) based on early stopping and warm restarts (Loshchilov and Hutter, 2017) is used to accommodate different dataset sizes.

Evaluation
Any NER model, whether it was trained using nerblackbox or is taken directly from Hugging-Face (HF), can be evaluated on any dataset that is accessible via nerblackbox (see Sec.The standard metrics for NER are used, i.e. precision, recall and the f1 score.Each metric is computed as a micro-and macro-average as well as for the individual classes.All metrics are determined both on the entity and word level.

Inference
Similar to evaluation, both NER models trained using nerblackbox and models taken directly from HuggingFace (HF) can be used for inference.
Apart from the predictions on the entity level for a single document shown above, nerblackbox also supports predictions on the word level (with or without probabilities) and batch inference.In addition, a model can be applied directly to a file containing raw data, which may be useful for inference at large scale (e.g. in production).

Advanced Usage
The nerblackbox workflow and the API are designed to be as simple as possible and to conceal technical complications from the user.However, they are also highly customizable in terms of optional function arguments, which may be particularly interesting for machine learning experts and researchers.In this section, we are going to cover a non-exhaustive selection of nerblackbox's advanced features, with a slight emphasis on the training part.For further information, the reader is referred to the library's documentation.

Training Hyperparameters and Presets
While nerblackbox uses sensible default values for the training hyperparameters (see Sec. 3.2), one may also opt to specify them manually.In par-ticular, all aspects of the learning rate schedule (e.g.maximum learning rate, epochs, early stopping parameters etc.) can be chosen at one's own discretion.In addition, the Training class offers several popular hyperparameter presets via the instantiation argument from_preset .Among them are the learning rate schedules from (Devlin et al., 2019) and (Mosbach et al., 2021), which may work well for larger and smaller datasets, respectively.Hyperparameters search is also supported.

Dataset Pruning
nerblackbox provides the option to only use a subset of the training, validation or test data by specifying parameters like train_fraction .This may be useful to accelerate the training (for instance in the development phase of a product) or if one wants to investigate the effect of the dataset size (for instance to see if the model has saturated, or for research).

Annotation Schemes
While every dataset is associated with a certain annotation scheme, nerblackbox provides the option to translate between schemes at training time.The desired annotation scheme can simply be specified via the training parameter annotation_scheme .This may be interesting for users who aim to optimize their model's performance as well as researchers who systematically want to investigate the impact of the annotation scheme.

Multiple Runs
Since the training of a neural network includes stochastic processes, the performance of the resulting model depends on the employed random seed.In order to gain control over the associated statistical uncertainties, one may train multiple models using different random seeds.With nerblackbox, this can trivially be done by setting the training parameter multiple_runs to an integer greater than 1.In that case, the evaluation metrics will be given in terms of the mean and its associated uncertainty.For inference, the best model is automatically used.

Detailed Results
nerblackbox saves detailed training and evaluation results (e.g.loss curves, confusion matrices) using MLflow 5 and TensorBoard.This is useful in order to keep an overview of trained models, inspect their 5 https://pypi.org/project/mlflow/detailed properties as well as optimize and crosscheck the training process.

Careful Evaluation
A model may predict labels for a sequence of tokens that are inconsistent with the employed annotation scheme.For instance, if the BIO annotation scheme is used, the combination O I-PER is incorrect6 .When translated to entity predictions, nerblackbox ignores incorrect labels by default, both at evaluation and inference time.However, the popular evaluate (Von Werra et al., 2022) and seqeval (Nakayama, 2018) libraries do take inconsistent predictions into account during evaluation.For this reason, the evaluate_on_dataset() method (see Sec. 3.3) returns results for both approaches.

Compatibility with transformers
nerblackbox is heavily based on transformers (Wolf et al., 2020) such that compatibility is guaranteed.In particular, the Model class has the attributes tokenizer and model , which are ordinary transformers classes and can be used as such.GPU support (i.e.automatic detection and use) is also provided through transformers.

Resources and Code Quality
nerblackbox is available as a package on PyPI7 .The associated GitHub repository is public at https://github.com/flxst/nerblackboxand contains the source code as well as multiple example notebooks.A detailed documentation is provided8 .It includes a pedagogical introduction to the library, an in-depth discussion of its features as well as docs for the python API.Consistent code syntax and typing are ensured by usage of black9 and mypy10 , respectively.We employ unit and end-to-end testing.As an additional crosscheck, numerical results from the literature are reproduced using nerblackbox (details can be found in the documentation).

Figure 1 :
Figure 1: Essential stages in the life cycle of a machine learning model.

Figure 2 :
Figure 2: High-level overview of the nerblackbox library.It allows to easily fine-tune, evaluate and apply models for named entity recognition.The symbols to the left and right represent the sources that nerblackbox provides seamless access to.These are the Local Filesystem (LF), HuggingFace (HF), Annotation Tools (AT) as well as Built-in (BI) datasets that are fetched from GitHub.

Table 1 :
Overview of the data levels that the different parts of a NER model workflow can operate on.