T2NER: Transformers based Transfer Learning Framework for Named Entity Recognition

Recent advances in deep transformer models have achieved state-of-the-art in several natural language processing (NLP) tasks, whereas named entity recognition (NER) has traditionally benefited from long-short term memory (LSTM) networks. In this work, we present a Transformers based Transfer Learning framework for Named Entity Recognition (T2NER) created in PyTorch for the task of NER with deep transformer models. The framework is built upon the Transformers library as the core modeling engine and supports several transfer learning scenarios from sequential transfer to domain adaptation, multi-task learning, and semi-supervised learning. It aims to bridge the gap between the algorithmic advances in these areas by combining them with the state-of-the-art in transformer models to provide a unified platform that is readily extensible and can be used for both the transfer learning research in NER, and for real-world applications. The framework is available at: https://github.com/suamin/t2ner.


Introduction
Named entity recognition (NER) is an important task in information extraction, benefiting the downstream applications such as entity linking (Cucerzan, 2007), relation extraction (Culotta and Sorensen, 2004) and question answering (Krishnamurthy and Mitchell, 2015). NER has been a challenging task in NLP due to large variations in entity names and flexibility in how entities are mentioned. These challenges are further enhanced in crosslingual and cross-domain NER settings, where the added difficulty comes from the difference in text genre and entity names across languages and domains (Jia et al., 2019). Furthermore, NER models have shown relatively high variance even when trained on the same data (Reimers and Gurevych, 2017). These models generalize poorly when tested on data from different domains and languages, and even more so when they contain unseen entity mentions (Augenstein et al., 2017;Agarwal et al., 2020;Wang et al., 2020). These challenges make transfer learning research an important and well studied area in NER.
Recent successes in transfer learning have mainly come from pre-trained language models (Devlin et al., 2019;Radford et al., 2019) with contextualized word embeddings based on deep transformer models (Vaswani et al., 2017). These models achieve state-of-the-art in several NLP tasks such as named entity recognition, document classification, and question answering. Due to their wide success and the community adoption, successful frameworks like Transformers have emerged. In NER, the existing frameworks like NCRF++  lack the core infrastructure to support such models directly with state-of-the-art transfer learning algorithms.
In this paper, we present an adaptable and userfriendly development framework for growing research in transfer learning with deep transformer models for NER, with underexplored areas such as semi-supervised learning. This is in contrast to the standard LSTM based approaches which have largely and successfully dominated the NER research. Our framework is aimed to bridge several gaps with core design principles that are discussed in next section.

Design Principles
T2NER is divided into several components as shown in Figure 1. The core design principle is to seamlessly integrate the Transformers (Wolf et al., 2020) library as the backend for modeling, while extending it to support different transfer learning scenarios with a range of existing algorithms. Trans- formers offer optimized implementations of several deep transformer models, including BERT (Devlin et al., 2019), GPT (Radford et al., 2019), RoBERTa (Liu et al., 2019), and XLM (Conneau and Lample, 2019) among others, with multi-GPU, distributed, and mixed precision training.
The second design principle is inspired by previous pre-trained models in the computer vision: Dassl.pytorch (Zhou et al., 2020) 1 and Trans-Learn (Jiang et al., 2020) 2 that unify domain adaptation, domain generalization, and semisupervised learning, thus allowing easy benchmarking, fair comparisons, and reproducibility. T2NER is the unification of these major algorithmic approaches to bridge the gap between the algorithms and advance transfer learning research in NER.

Data Sources
The main data source is the NER data, which is expected to be labeled or unlabeled in the CoNLL format. We adopt widely used BIO tagging scheme. In practice, the differences in results which arise due to different schemes are negligible (Ratinov and Roth, 2009). A simple preprocessing routine is provided to standardize the data files, along with the required metadata, that is used through- out the framework. In particular, for a given named collection as domain.datasetname (possibly split into train, development and test files), T2NER creates output data files named as lang.domain.datasetname-split and lang.domain.datasetname.labels, where language information is provided by the user. In case of missing metadata, a placeholder xxx can be used. For preprocessing, we tokenize via Transformers and split the sentences which are longer than the user-defined maximum length. An example output file could be en.news.conll-train, referring to the CoNLL 2003 data set (Tjong Kim Sang and De Meulder, 2003).
Besides NER data, additional task data can also be provided, such as that for language modeling, POS tagging, and alignment resources (e.g. bilingual dictionaries or parallel sentences).

Data Readers
These are classes that are designed to serve the data needs of a given transfer learning scenario in a modular and extensible way. The framework provides SimpleData, SimpleAdaptationData, MultiData, and SemiSupervisedData which are suitable for single dataset NER, cross-lingual and domain NER, multi-dataset NER, and single dataset semisupervised NER, respectively. Each class is derived from a base class BaseData and can be extended for further scenarios. As a concrete example, consider a dataset reader class SimpleAdaptationData in T2NER, which can provide training data for source and target language or domain up to a requested number of copies.

Models
A model is composed of three main components: a base encoder from the Transformers (Wolf et al., 2020), any additional networks (X-nets) on top of the encoder, and the prediction layer(s).
Encoder is the main model component that takes as input tokenized text and returns hidden states such as those from BERT (Devlin et al., 2019) or RoBERTa (Liu et al., 2019). There are five encoder modes that we support: • finetune: Fine-tunes the encoder and uses the last layer hidden states.
• freeze: Freezes the encoder and uses the last layer hidden states.
• firstn: Freezes only the first n layers of the encoder and uses the last layer hidden states (Wu and Dredze, 2019).
• lastn: Freezes the encoder and uses the aggregated hidden states by summing the outputs from the last n layers .
• embedonly: Uses and fine-tunes the embedding layer only of the encoder.
X-nets are additional neural architectures that can be used on top of the encoder to further function on the encoder hidden states. T2NER provides

multi-layered Transformers and BiLSTM by default.
Prediction Layers offer the final classification layer for the sequence labeling. Following Devlin et al. (2019), the default prediction layer in T2NER is a linear layer, however support for linear-chain conditional random field (CRF) is included. In the multi-task setting, several output layers from different datasets in different domains or languages might be available with partial or exact entity types as outputs. To help the transfer across the tasks, private and shared prediction layers are also supported (Wang et al., 2020;Lin et al., 2018).
With these underlying components, models are mainly implemented as single or multi-task architectures. To support a wide range of encoders in a unified API, T2NER adopts the Auto classes design from the Transformers. Figure 3 shows the class hierarchies, outlining the customized extensions with further possibilities to extend with external model implementations.

Criterions
For a given sequence of length L with tokens x = [x 1 , x 2 , ..., x L ], labels y = [y 1 , y 2 , ..., y L ] with each y i ∈ ∆ C a one-hot entity type vector with C types, and the linear prediction layer, the NER loss is defined as: where p(h j = i|x j ) is the probability of token x j being labeled as entity type i and h j is the model output. When p is softmax, this becomes crossentropy loss. To tackle class-imbalance in realworld applications, T2NER also offers two-class sensitive loss functions: • Focal Loss adds a modulating factor to the standard softmax which reduces the loss contribution from easy examples and extends the range in which an example receives low loss (Lin et al., 2017).
• LDAM Loss is the label-distribution-aware loss function that encourages the model to have the optimal trade-off between per-class margins by promoting the minority classes to have larger margins (Cao et al., 2019).

Auxiliary Tasks
Multi-task learning has greatly benefited transfer learning in NER (Lin et al., 2018;Wang et al., 2020;Jia et al., 2019;Jia and Zhang, 2020). Several auxiliary tasks are supported in a multi-task model by default: • Language Classification: In the cross-lingual setting, this task provides an additional classification signal over the languages (e.g., English and Spanish) used in the training data (Keung et al., 2019).
• Domain Classification: In the cross-domain setting, this task provides an additional clas-sification signal over the domains (e.g., News and Biomedical) used in the training data (Wang et al., 2020).
• • Shared Tagging: In NER settings where the entity types might differ, a shared prediction layer across all the entity types provides an additional signal to the base NER tasks.
• All-Outside Classification: This is a binary classification task which predicts if the sentence has entity types other than the outside (O) type.

Trainers
Trainer is the main class concept that glues together all the components and provides a unified setup to develop, test, and benchmark the algorithms. Figure 3 shows the organization of trainer classes. Each transfer learning scenario inherits from the BaseTrainer class, where each scenario can further be extended to create an algorithm-specific training regime. This allows the researchers to focus mainly on the algorithms' logic while the framework fulfills the requirements of a chosen transfer scenario. Following (Zhou et al., 2020;Jiang et al., 2020), a few training algorithms are implemented by default which we briefly describe. In the following, a feature extractor is referred to as the base encoder with any X-nets. An optional pooling strategy {mean, sum, max, attention, ...} can be applied to aggregate the hidden states.
In what follows, domain and language can be used interchangeably. For consistency, we use the word domain. Gradient Reversal Layer (GRL) adds a domain classifier which is trained to discriminate whether input features come from the source or target domain, whereas the feature extractor is trained to deceive the domain classifier to match feature distributions.
Earth Mover Distance (EMD) adds a critic that maximizes the difference between unbounded scores of source and target features. This effectively returns the approximation of Wasserstein distance between source and target feature distributions (Arjovsky et al., 2017). The overall objective jointly minimizes NER cross-entropy loss and Wasserstein distance. Theoretically, GRL is effectively minimizing Jensen-Shannon (JS) divergence which suffers from discontinuities and thus provide poor gradients for feature extractor. In contrast Wasserstein distance is stable and less prone to hyperparamter selection (Chen et al., 2018). For stable training, the gradient penalty is also provided (Gulrajani et al., 2017).
Keung Adversarial is closely related to GRL but additionally uses the generator loss such that the features are difficult for the discriminator to classify correctly between source and target. The optimization is carried out in step-wise fashion for the feature extractor, discriminator, and generator (Keung et al., 2019).
Maximum Classifier Discrepancy (MCD) adds a second classifier to measure the discrepancy between the predictions of two classifiers on target samples. It is noted that the target samples outside the support of the source can be measured by two different classifiers. Overall, MCD solves a minimax problem in which the goal is to find two classifiers that maximize the discrepancy on the target sample, and a features generator that minimizes this discrepancy (Saito et al., 2018).
Minimax Entropy (MME) decreases the entropy on unlabeled target features in adversarial manner by using GRL to obtain high quality discriminative features (Saito et al., 2019). Besides unsupervised domain adaptation, the method can additionally be used in semi-supervised and fewshot learning scenarios when some labeled target samples are available.
Further algorithms, such as classical conditional entropy minimization (CEM) for semi-supervised learning (Grandvalet and Bengio, 2004) or recent works based on maximum mean discrepancy (MMD) for multi-source domain adaptation (Peng et al., 2019), are provided. In general, extending T2NER for newer algorithms is simple and flexible.

Usage
T2NER offers a single entry point to the framework which relies on a base JSON configuration file, an experiment-specific JSON configuration file with an optional algorithm name to run. An example experiment-specific configuration file is shown in Figure 4. The command below shows an example run: Like other frameworks, it can be further developed and used as a standard Python library.

Conclusion and Future Work
In this work we presented a transformer based framework for transfer learning research in named entity recognition (NER). We laid out the design principles, detailed out the architecture, and presented the transfer scenarios and some of the representative algorithms. T2NER offers to bridge the gap between growing research in deep transformer models, NER transfer learning, and domain adaptation. T2NER has the potential to serve as a unified benchmark for existing and newer algorithms with state-of-the-art models.
For future work, we consider the following: • We would like to create a benchmark data and perform comparison of the transfer learning algorithms (Ramponi and Plank, 2020;Kashyap et al., 2020).
• Assess the performance of framework in terms of speed and efficiency and compare with other tools 3 .
• While we focused on the task of NER here, we would also like to add related tasks such as relation extraction, entity linking, and question answering.