T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition

Language model (LM) pretraining has led to consistent improvements in many NLP downstream tasks, including named entity recognition (NER). In this paper, we present T-NER (Transformer-based Named Entity Recognition), a Python library for NER LM finetuning. In addition to its practical utility, T-NER facilitates the study and investigation of the cross-domain and cross-lingual generalization ability of LMs finetuned on NER. Our library also provides a web app where users can get model predictions interactively for arbitrary text, which facilitates qualitative model evaluation for non-expert programmers. We show the potential of the library by compiling nine public NER datasets into a unified format and evaluating the cross-domain and cross- lingual performance across the datasets. The results from our initial experiments show that in-domain performance is generally competitive across datasets. However, cross-domain generalization is challenging even with a large pretrained LM, which has nevertheless capacity to learn domain-specific features if fine- tuned on a combined dataset. To facilitate future research, we also release all our LM checkpoints via the Hugging Face model hub.


Introduction
Language model (LM) pretraining has become one of the most common strategies within the natural language processing (NLP) community to solve downstream tasks (Peters et al., 2018;Howard and Ruder, 2018;Radford et al., 2018Radford et al., , 2019Devlin et al., 2019). LMs trained over large textual data only need to be finetuned on downstream tasks to outperform most of the task-specific designed models. Among the NLP tasks impacted by LM 1 https://github.com/asahi417/tner 2 https://huggingface.co/models?search= asahi417/tner. pretraining, named entity recognition (NER) is one of the most prevailing and practical applications. However, the availability of open-source NER libraries for LM training is limited. 3 In this paper, we introduce T-NER, an opensource Python library for cross-domain analysis for NER with pretrained Transformer-based LMs. Figure 1 shows a brief overview of our library and its functionalities. The library facilitates NER experimental design including easy-to-use features such as model training and evaluation. Most notably, it enables to organize cross-domain analyses such as training a NER model and testing it on a different domain, with a small configuration. We also report initial experiment results, by which we show that although cross-domain NER is challenging, if it has an access to new domains, LM can successfully learn new domain knowledge. The results give us an insight that LM is capable to learn a variety of domain knowledge, but an ordinary finetuning scheme on single dataset most likely causes overfitting and results in poor domain generalization.
As a system design, T-NER is implemented in Pytorch (Paszke et al., 2019) on top of the Transformers library (Wolf et al., 2019). Moreover, the interfaces of our training and evaluation modules are highly inspired by Scikit-learn (Pedregosa et al., 2011), enabling an interoperability with recent models as well as integrating them in an intuitive way. In addition to the versatility of our toolkit for NER experimentation, we also include an online demo and robust pre-trained models trained across domains. In the following sections, we provide a brief overview about NER in Section 2, explain the system architecture of T-NER with a few basic usages in Section 3 and describe experiment results on cross-domain transfer with our library in Section 4.

Named Entity Recognition
Given an arbitrary text, the task of NER consists of detecting named entities and identifying their type. For example, given a sentence "Dante was born in Florence.", a NER model are would identify "Dante" as a person and "Florence" as a location. Traditionally, NER systems have relied on a classification model on top of hand-engineered feature sets extracted from corpora (Ratinov and Roth, 2009;Collobert et al., 2011), which was improved by carefully designed neural network approaches (Lample et al., 2016;Chiu and Nichols, 2016;Ma and Hovy, 2016). This paradigm shift was mainly due to its efficient access to contextual information and flexibility, as human-crafted feature sets were no longer required. Later, contextual representations produced by pretrained LMs have improved the generalization abilities of neural network architectures in many NLP tasks, including NER (Peters et al., 2018;Devlin et al., 2019). In particular, LMs see millions of plain texts during pretraining, a knowledge that then can be leveraged in downstream NLP applications. This property has been studied in the recently literature by probing their generalization capacity (Hendrycks et al., 2020;Aharoni and Goldberg, 2020;Desai and Durrett, 2020;Gururangan et al., 2020). When it comes to LM generalization studies in NER, the literature is more limited and mainly restricted to indomain (Agarwal et al., 2021) or multilingual settings (Pfeiffer et al., 2020a;Hu et al., 2020b). Our library facilitates future research in cross-domain and cross-lingual generalization by providing a unified benchmark for several languages and domain as well as a straightforward implementation of NER LM finetuning.

T-NER: An Overview
A key design goal was to create a self-contained universal system to train, evaluate, and utilize NER models in an easy way, not only for research purpose but also practical use cases in industry. Moreover, we provide a demo web app ( Figure 2) where users can get predictions from a trained model given a sentence interactively. This way, users (even those without programming experience) can conduct qualitative analyses on their own or existing pre-trained models.
In the following we provide details on the technicalities of the package provided, including details on how to train and evaluate any LM-based architecture. Our package, T-NER, allows practitioners in NLP to get started working on NER with a few lines of code while diving into the recent progress in LM finetuning. We employ Python as our core implementation, as is one of the most prevailing languages in the machine learning and NLP communities. Our library enables Python users to access its various kinds of features such as model training, in-and cross-domain model evaluation, and an interface to get predictions from trained models with minimum effort.

Datasets
For model training and evaluation, we compiled nine public NER datasets from different domains, unifying them into same format: OntoNotes5 (Hovy et al., 2006), CoNLL 2003(Tjong Kim Sang and De Meulder, 2003, WNUT 2017 (Derczynski et al., 2017), WikiAnn (Pan et al., 2017), FIN (Salinas Alvarado et al., 2015), BioNLP 2004 (Collier and Kim, 2004), BioCreative V CDR 4 (Wei et al., 2015), MIT movie review semantic corpus, 5 and MIT restaurant review. 6 These unified datasets are also made available as part of our T-NER library. Except for WikiAnn that contains 282 languages, all the datasets are in English, and only the MIT corpora are lowercased. As MIT corpora are com-4 The original dataset consists of long documents which cannot be fed on LM because of the length, so we split them into sentences to reduce their size. 5 The movie corpus includes two datasets (eng and trivia10k13) coming from different data sources. While both have been integrated into our library, we only used the largest trivia10k13 in our experiments. 6 The original MIT NER corpora can be downloaded from https://groups.csail.mit.edu/sls/ downloads/.

Figure 2:
A screenshot from the demo web app. In this example, the NER transformer model is fine-tuned on OntoNotes 5 and a sample sentence is fetched from Wikipedia (en.wikipedia.org/wiki/Sergio_Mendes). monly used for slot filling task in spoken language understanding (Liu and Lane, 2017), the characteristics of the entities and annotation guidelines are quite different from the other datasets, but we included them for completeness and to analyze the differences across datasets. Table 1 shows statistics of each dataset. In Section 4, we train models on each dataset, and assess the in-and cross-domain accuracy over them.
Dataset format and customization. Users can utilize their own datasets for both model training and evaluation by formatting them into the IOB scheme (Tjong Kim Sang and De Meulder, 2003) which we used to unify all datasets. In the IOB format, all data files contain one word per line with empty lines representing sentence boundaries. At the end of each line there is a tag which states whether the current word is inside a named entity or not. The tag also encodes the type of named entity. Here is an example from CoNLL 2003:

Model Training
We provide modules to facilitate LM finetuning on any given NER dataset. Following Devlin et al. (2019), we add a linear layer on top of the last embedding layer in each token, and train all weights with cross-entropy loss. The model training component relies on the Huggingface transformers library (Wolf et al., 2019), one of the largest Python frameworks for distributing pretrained LM checkpoint files. Our library is therefore fully compatible with the Transformers framework: once new model was deployed on the Transformer hub, one can immediately try those models out with our library as a NER model. To reduce computational complexity, in addition to enabling multi-GPU support, we implement mixture precision during model training by using the apex library 7 . The instance of model training in a given dataset 8 can be used in an intuitive way as displayed below: from tner import TrainTransformersNER model = TrainTransformersNER( dataset="ontonotes5", transformer="roberta-base") model.train () With this sample code, we would finetune 7 https://github.com/NVIDIA/apex RoBERTa BASE  on the OntoNotes5 dataset. We also provide an easy extension to train on multiple datasets at the same time: Once training is completed, checkpoint files with model weights and other statistics are generated. These are automatically organized for each configuration and can be easily uploaded to the Hugging Face model hub. Ready-to-use code samples can be found in our Google Colab notebook 9 , and details for additional options and arguments are included in the github repository. Finally, our library supports Tensorboard 10 to visualize learning curves.

Model Evaluation
Once a NER model is trained, users may want to test the models in the same dataset or a different one to assess its general performance across domains. To this end, we implemented flexible evaluation modules to facilitate cross-domain evaluation comparison, which is also aided by the unification of datasets into the same format (see Section 3.1) with a unique label reference lookup.
The basic usage of the evaluation module is described below.

Evaluation
In this section, we assess the reliability of T-NER with experiments in standard NER datasets.

Implementation details
Through the experiments, we use XLM-R , which has shown to be one of the most reliable multi-lingual pretrained LMs for discriminative tasks at the moment. In all experiments we make use of the default configuration and hyperpameters of Huggingface's XLM-R implementation. For WikiAnn/ja (Japanese), we convert the original character-level tokenization into proper morphological chunk by MeCab 12 .

Evaluation metrics and protocols
As customary in the NER literature, we report span micro-F1 score computed by seqeval 13 , a Python library to compute metrics for sequence prediction evaluation. We refer to this F1 score as typeaware F1 score to distinguish it from the the typeignored metric used to assess the cross-domain performance, which we explain below.
In a cross-domain evaluation setting, the typeaware F1 score easily fails to represent the crossdomain performance if the granularity of entity types differ across datasets. For instance, the MIT restaurant corpus has entities such as amenity and rating, while plot and actor are entities from the MIT movie corpus. Thus, we report type-ignored F1 score for cross-domain analysis. In this typeignored evaluation, the entity type from both of predictions and true labels is disregarded, reducing the task into a simpler entity span detection task. This evaluation protocol can be customized by the user at test time.

Results
We conduct three experiments on the nine datasets described in Table 1: (i) in-domain evaluation (Section 4.2.1), (ii) cross-domain evaluation (Section 4.2.2), and (iii) cross-lingual evaluation (Section 4.2.3). While the first experiment tests our implementation in standard datasets, the second experiment is aimed at investigating the cross-domain performance of transformer-based NER models. Finally, as a direct extension of our evaluation module, we show the zero-shot cross-lingual performance of NER models on the WikiAnn dataset.

In-domain results
The main results are displayed in Table 2, where we report the type-aware F1 score from XLM-R BASE and XLM-R LARGE models along with current stateof-the-art (SoTA). One can confirm that our framework with XLM-R LARGE achieves a comparable SoTA score, even surpassing it in the WNUT 2017 dataset. In general, XLM-R LARGE performs consistently better than XLM-R BASE but, interestingly, the base model performs better than large on the FIN dataset. This can be attributed to the limited training data in this dataset, which may have caused overfitting in the large model.
Generally, it can be expected to get better accuracy with domain-specific or larger language models that can be integrated into our library. Nonetheless, our goal for these experiments were not to achieve SoTA but rather to provide a competitive and easy-to-use framework. In the remaining experiments we report results for XLM-R LARGE only, but the results for XLM-R BASE can be found in the appendix.
In Table 3, we present the type-ignored F1 results across datasets. Overall cross-domain scores are not as competitive as in-domain results. This gap reveals the difficulty of transferring NER models into different domains, which may also be attributed to different annotation guidelines or data construction procedures across datasets. Especially, training on the bionlp and bc5cdr datasets lead to a null accuracy when they are evaluated on other datasets, as well as others evaluated on them. Those datasets are very domain specific dataset, as they have entities such as DNA, Protein, Chemical, and Disease, which results in a poor adaptation to other domains. On the other hand, there are datasets   that are more easily transferable, such as wnut and conll. The wnut-trained model achieves 85.7 on the conll dataset and, surprisingly, the conll-trained model actually works better than the wnut-trained model when evaluated on the wnut test set. This could be also attributed to the data size, as wnut only has 1,000 sentences, while conll has 14,041. Nevertheless, the fact that ontonotes has 59,924 sentences but does not perform better than conll on wnut reveals a certain domain similarity between conll and wnut. Finally, the model trained on the training sets of all datasets achieves a type-ignored F1 score close to the in-domain baselines. This indicates that a LM is capable of learning representations of different domains. Moreover, leveraging domain similarity as explained above can lead to better results as, for example, distant datasets such as bionlp and bc5cdr surely cause performance drops. This is an example of the type of experiments that could be facilited by T-NER, which we leave for future work.

Cross-lingual results
Finally, we present some results for zero-shot crosslingual NER over the WikiAnn dataset, where we include six distinct languages: English (en), Japanese (ja), Russian (ru), Korean (ko), Spanish (es), and Arabic (ar). In Table 4, we show the crosslingual evaluation results. The diagonal includes the results of the model trained on the training data of the same target language. There are a few interesting findings. First, we observe a high correlation between Russian and Spanish, which are generally considered to be distant languages and do not share the alphabet. Second, Arabic also transfers well to Spanish which, despite the Arabic (lexical) influence on the Spanish language (Stewart et al., 1999), are still languages from distant families.
Clearly, this is a shallow cross-lingual analysis, but it highlights the possibilities of our library for research in cross-lingual NER. Recently, (Hu et al., 2020a) proposed a compilation of multilingual benchmark tasks including the WikiAnn datasets as a part of it, and XLM-R proved to be a strong baseline on multilingual NER. This is in line with the results of Conneau et al. (2020), which showed a high capacity of zero-shot cross-lingual transferability. On this respect, Pfeiffer et al. (2020b) proposed a language/task specific adapter module that can further improve cross-lingual adaptation in NER. Given the possibilities and recent advances in cross-lingual language models in recent years, we expect our library to help practitioners to experiment and test these advances in NER.

Conclusion
In this paper, we have presented a Python library to get started with Transformer-based NER models. This paper especially focuses on LM finetuning, and empirically shows the difficulty of crossdomain generalization in NER. Our framework is designed to be as simple as possible so that any level of users can start running experiments on NER on any given dataset. To this end, we have also facilitated the evaluation by unifying some of the most popular NER datasets in the literature, including languages other than English. We believe that our initial experiment results emphasize the importance of NER generalization analysis, for which we hope that our open-source library can help NLP community to convey relevant research in an efficient and accessible way.

A Appendices
In all experiments we make use of the default configuration and hyperpameters of Huggingface's XLM-R implementation.

A.1 Cross-lingual Results
In this section, we show cross-lingual analysis on XLM-R BASE , where the result is shown in Table 5. For these cross-lingual results, we rely on the WikiAnn dataset where zero-shot cross-lingual NER over six distinct languages is conducted: English (en), Japanese (ja), Russian (ru), Korean (ko), Spanish (es), and Arabic (ar).

A.2 Cross-domain Results
In this section, we show a few more results on our cross-domain analysis, which is based on non-lowercased English datasets: OntoNotes5 (ontonotes), CoNLL 2003 (conll), WNUT 2017 (wnut), WikiAnn/en (wiki), BioNLP 2004 (bionlp), and BioCreative V (bc5cdr), and FIN (fin). Table 6 shows the type-aware F1 score of the XLM-R LARGE and XLM-R BASE models trained on all the datasets. Furthermore, Table 7    converted all datasets into lowercase. Tables 8 and  Table 9 show the type-ignored F1 score across models trained on different English datasets including lowercased corpora with XLM-R LARGE and XLM-R BASE , respectively.      Table 9: Type-ignored F1 score in cross-domain setting over lower-cased English datasets with XLM-R BASE . We compute average of accuracy in each test set, named as avg. The model trained on all datasets listed here, is shown as all.